500 likes | 688 Views
BCB 444/544. Lecture 19 A bit of: Protein Structure - Basics Protein Structure Visualization, Classification & Comparison #19_Oct05. Required Reading ( before lecture). √ Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction Chp 7 - pp 85-96 √ Wed Oct 3 - Lecture 18
E N D
BCB 444/544 Lecture 19 A bit of: Protein Structure - Basics Protein Structure Visualization, Classification&Comparison #19_Oct05 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Required Reading (before lecture) √MonOct 1- Lecture 17 Protein Motifs & Domain Prediction • Chp 7 - pp 85-96 √Wed Oct 3 - Lecture 18 Protein Structure: Basics (Note chg in Lecture Schedule online) • Chp 12 - pp 173-186 √Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19 Protein Structure: Basics, Databases, Visualization, Classification & Comparison • Chp 13 - pp 187-199 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 544 - Extra Required Reading Assigned Mon Sept 24 BCB 544 Extra Required Reading Assignment: for 544 Extra HW#1 Task 2 • Pollard KS, …., Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
BCB 544 Projects (Optional for BCB 444) • For a better idea about what's involved in the Team Projects, please look over last year's expectations for projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm • Criteria for evaluation of projects (oral presentations) are summarized here: http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf Please note: wrong URL (instead of that shown above) was included in originally posted 544ExtraHW#1; corrected version is posted now BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Assignments & Announcements - #1 Students registered for BCB 444: Two Grading Options 1) Take FinalExam per original Grading Policies 2) Instead of taking Final Exam - you may participate in a Team Research Project If you choose #2, please do 3 things: • Contact Drena (in person) • Send email to Michael Terribilini (terrible@iastate.edu) • Complete544 Extra HW#1 - Task 1.1 by noon on Mon Oct 1 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Assignments & Announcements - #2 BCB 444s (Standard): 200 ptsMidterm Exams = 100 points each 200 Homework & Laboratory assignments = 200 points 100 Final Exam 500 pts Total for BCB 444 BCB 444p (Project): 200 ptsMidterm Exams = 100 points each 200 Homework & Laboratory assignments = 200 points 190 Team Research Project 590 pts Total for BCB 444p BCB 544: 200 pts Midterm Exams = 100 points each 200 Homework & Laboratory assignments 100 Final Exam 200 Discussion Questions & Team Research Projects 700 pts Total for BCB 544 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Assignments & Announcements #3 ALL: HomeWork #3 Due: Mon Oct 8 by 5 PM • HW544: HW544Extra #1 √Due: Task 1.1 - Mon Oct 1 by noon Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday) • 444 "Project-instead-of-Final" students should also submit: • HW544Extra #1 • Due: Task 1.1 - Mon Oct 8 by noon • Due:Task 1.2 - Fri Oct 12 by 5 PM (not Monday) Task 2 NOT required! BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
QUESTIONS re: HW#3? Due Mon BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction This is a new slide BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 But, where do you start? "Begin" state not shown BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Occasionally Dishonest Casino - HW#3 "Begin" state? 50:50 chance of starting with F vs L die BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Calculating Different Paths to an Observed Sequence This slide has been changed transition probability emission probability Calculations such as those shown below are used to fill a matrix with probability values for every state at every position BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Calculate optimal path? Construct a matrix of probability values for every state at every residue How: one way = Viterbi Algorithm • Initialization (i = 0) • Recursion (i = 1, . . . , L): For each state k • Termination: To find*, use trace-back, as in dynamic programming BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Viterbi for Calculating Most Probable Path* x 2 6 6 0 0 B 1 0 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 (1/6)(1/2) = 1/12 0 F (1/2)max{0.013750.01, 0.020.8} = 0.08 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)(1/2) = 1/4 0 L * Path within HMM that matches query sequence with highest probability BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Total Probability Several different paths can result in observation x Probability that our model will emit x is: BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Calculating the Total Probability: This slide has bee changed x 2 6 6 0 0 B 1 0 (1/6)sum{(1/12)0.99, (1/4)0.2} = 0.022083 (1/6)sum{0.0220830.99, 0.0200830.2} = 0.004313 (1/6)(1/2) = 1/12 0 F (1/2)sum{0.0220830.01, 0.0200830.8} = 0.008144 (1/10)sum{(1/12)0.01, (1/4)0.8} = 0.020083 (1/2)(1/2) = 1/4 0 L Total probability = = 0 + 0.004313 + 0.008144 = 0.012 Note: This not the same as matrix on previous slide! Here, last column contains sums for each row BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
A few more Details re: Profiles & HMMs • Smoothing or "Regularization" - method used to avoid "over-fitting" • Common problem in machine learning (data-driven) approaches • Limited training sample size causes over-representation of observed characters while "ignoring" unobserved characters • Result?Miss members of family not yet sampled (too many false negative hits) • Pseudocounts- adding artificial values for 'extra' amino acid(s) not observed in the training set • Treated as a 'real' values in calculating probabilities • Improve predictive power of profiles & HMMs • Dirichlet mixture - commonly used mathematical model to simulate the aa distribution in a sequence alignment • To "correct" problems in an observed alignment based on limited number of sequences BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Chp 7 - Protein Motifs & Domain Prediction SECTION II SEQUENCE ALIGNMENT Xiong: Chp 7 Protein Motifs and Domain Prediction • √Identification of Motifs & Domains in MSAs • √Motif & Domain Databases Using Regular Expressions • √Motif & Domain Databases Using Statistical Models • Protein Family Databases • Motif Discovery in Unaligned Sequences • √Sequence Logos BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Motifs & Domains • Motif - short conserved sequence pattern • Associated with distinct function in protein or DNA • Avg = 10 residues (usually 6-20 residues) • e.g., zinc finger motif - in protein • e.g., TATA box - in DNA • Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit • Avg = 100 residues (range from 40-700 in proteins) • e.g., kinase domain or transmembrane domain - in protein • Domains may (or may not) include motifs BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
2 Approaches for Representing "Consensus" Information in Motifs & Domains • Regular expression - symbolic representation of information from MSA • e.g., protein phosphorylation site motif: [S,T]- X- [R,K] • Symbols represent specific or unspecified residues, spaces, etc. • 2 mechanisms for matching: • Exact • "Fuzzy" (inexact, approximate) - flexible, more permissive to detect "near matches" • Statistical model - includes probability information derived from MSA • e.g., PSSM, Profile, or HMM BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Motif & Domain Databases Based on regular expressions: • Prosite (Interpro includes Prosite, PRINTS, etc) • Emofit Limitation: these don't take probability info into account Based on statistical models: • PRINTS • BLOCKS • ProDom • Pfam • SMART • CDART • Reverse PsiBLAST • READ your textbook & try some of these at home; there are distinct advantages/disadvantages associated with each • TAKE HOME LESSON: Always try several methods! (not just one!) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Protein Family Databases • In addition to databases of "related" protein sequences, based on shared motifs or domains (Pfam, BLOCKS, CDART), some databases "cluster" sequences into families based on near full-length sequence comparisons • COGs - Clusters of Orthologous Groups (at NCBI) • Mostly Prokaryotic sequences • KOG = newer Eukaryotic version • COGnitor - softwared to search database • ProtoNet - also clusters of homologous protein sequences • Advantages: tree-like hierarchical structure • Provide GO (gene ontology) annotations • Provides InterPro keywords BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Motif Discovery in Unaligned Sequences Expectation Maximization - generate"random" alignment of all sequences, derive PSSM, iteratively match individual sequences to PSSM to edit & improve it Problems? Can hit a local optimum (premature convergence) Sensitive to initial alignment • MEME - Multiple EM for Motif Elicitation - modified EM, avoids local optimum issues; two step procedure Gibbs Sampling - generate "trial" PSSM from random alignment first, as in EM, but leave one sequence out of initial alignment, then iteratively match PSSM to left-out sequences • Gibbs Sampler - web-based motif search via Gibbs sampling • Not mentioned in textbook: • Stochastic context-free grammers • Other "state of the art"pproaches in recent literature, but not available in web-based servers (yet) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Chp 12 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 12 Protein Structure Basics • LAB 6 • Introduction to Protein DataBank - PDB • PyMol • Cn3D? BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Chp 12 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 12 Protein Structure Basics • Amino Acids • Peptide Bond Formation • Dihedral Angles • Hierarchy • Secondary Structures • Tertiary Structures • Determination of Protein 3-Dimensional Structure • Protein Structure DataBank (PDB) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Protein Structure & Function • Protein structure - primarily determined by sequence • Protein function - primarily determined by structure • Globular proteins: compact hydrophobic core & hydrophilic surface • Membrane proteins: special hydrophobic surfaces • Folded proteins are only marginally stable • Some proteins do not assume a stable "fold" until they bind to something = Intrinsically disordered • Predicting protein structure and function can be very hard --& fun! BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
4 Basic Levels of Protein Structure BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Primary & Secondary Structure • Primary • Linear sequence of amino acids • Description of covalent bonds linking aa’s • Secondary • Local spatial arrangement of amino acids • Description of short-range non-covalent interactions • Periodic structural patterns: -helix, b-sheet BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Tertiary & Quaternary Structure • Tertiary • Overall 3-D "fold" of a single polypeptide chain • Spatial arrangement of 2’ structural elements; packing of these into compact "domains" • Description of long-range non-covalent interactions (plus disulfide bonds) • Quaternary • In proteins with > 1 polypeptide chain, spatial arrangement of subunits BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
"Additional" Structural Levels • Super-secondary elements • Motifs • Domains • Foldons BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Amino Acids • Each of 20 different amino acids has different "R-Group" or side chain attached to Ca BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Peptide Bond is Rigid and Planar BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Hydrophobic Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Charged Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Polar Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Certain Side-chain Configurations are Energetically Favored (Rotamers) Ramachandran plot: "Allowable" psi & phi angles BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Glycine is Smallest Amino Acid R group = H atom • Glycine residues increase backbone flexibility because they have no R group BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Proline is Cyclic • Proline residues reduce flexibility of polypeptide chain • Proline cis-trans isomerization is often a rate-limiting step in protein folding • Recent work suggests it also may also regulate ligand binding in native proteins Andreotti (BBMB) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Cysteines can Form Disulfide (S-S) Bonds • Disulfide bonds (covalent) stabilize 3-D structures • In eukaryotes, disulfide bonds are often found in secreted proteins or extracellular domains BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Globular Proteins Have a Compact Hydrophobic Core • Packing of hydrophobic side chains into interior is main driving force for folding • Problem?Polypeptide backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit (which are charged at neutral pH=7, found in biological systems); these polar groups must be neutralized • Solution? Form regular secondary structures, • e.g., -helix, b-sheet, stabilized by H-bonds BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Exterior Surface of Globular Proteins is Generally Hydrophilic • Hydrophobic core formed by packed secondary structural elements provides compact, stable core • "Functional groups" of protein are attached to this framework; exterior has more flexible regions (loops) and polar/charged residues • Hydrophobic "patches" on protein surface are often involved in protein-protein interactions BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Protein Secondary Structures • Helices • Sheets • Loops • Coils BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Helix: Stabilized by H-bonds between every ~ 4th residue in Backbone C = black O = red N = blue H = white Look! - Charges on backbone are "neutralized" by hydrogen bonds (H-bonds) -red fuzzy vertical bonds BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Certain Amino Acids are "Preferred" & Others are Rare in Helices • Ala, Glu, Leu, Met = good helix formers • Pro, Gly Tyr, Ser = very poor • Amino acid composition & distribution varies, depending on on location of helix in 3-D structure BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
-Sheets - also Stabilized by H-bonds Between Backbone Atoms Anti-parallel Parallel BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Loops • Connect helices and sheets • Vary in length and 3-D configurations • Are located on surface of structure • Are more "tolerant" of mutations • Are more flexible and can adopt multiple conformations • Tend to have charged and polar amino acids • Are frequently components of active sites • Some fall into distinct structural families (e.g., hairpin loops, reverse turns) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Coils • Regions of 2' structure that are not helices, sheets, or recognizable turns • Intrinsically disordered regions appear to play important functional roles BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification
Chp 13 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 13 Protein Structure Visualization, Comparison & Classfication • Protein Structural Visualization • Protein Structure Comparison • Protein Structure Classification BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification