680 likes | 831 Views
Identification of Protein Domains. Eden Dror Menachem Schechter. Computational Biology Seminar 2004. Overview. Introduction to protein domains. Classification of homologs. Representing a domain. PSSM HMM Internet resources Pfam SMART PROSITE InterPro Research example.
E N D
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004
Overview • Introduction to protein domains. • Classification of homologs. • Representing a domain. • PSSM • HMM • Internet resources • Pfam • SMART • PROSITE • InterPro • Research example.
Protein domains • A discrete portion of a protein assumed to fold independently, and possessing its own function. • Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.
Protein domains • The assumption: The domain is the fundamental unit of protein structure and function. • Protein family – all proteins containing a specific domain.
What can we learn from them? • Common ancestors & homology information of a set of proteins. • Homology can induce properties of a protein like functionality & localization. • Therefore, domains can be used to classify a new protein to a family, inferring functionality.
Classification of homologs • Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes. • Homologous genes can be derived by two major ways: • Gene duplication (in the same species). • Speciation (splitting of one species into two).
Classification of homologs • Orthologs– Two genes from two different species that derive from a single gene in the last common ancestor of the species. • Paralogs– Two genes that derive from a single gene that was duplicated within a genome.
ortho para para ortho Classification of homologs
Classification of homologs • Inparalogs - paralogs that evolved by gene duplication after the speciation event. • Outparalogs - paralogs that evolved by gene duplication before the speciation event.
In-para In-para out-para When comparing human with worm Classification of homologs
What can we learn from them? • Ortholog proteins are evolutionary, and typically functional counterparts in different species. • Paralog proteins are important for detecting lineage-specific adaptations. • Both of them can reveal information on a specific species or a set of species.
Protein domains – summary • By identifying domains we can: • infer functionality & localization of a protein. • Learn on a specific species. • Learn on a set of species as a group.
Domain representation • Different methods to represent (model) domains: • Patterns (regular expressions). • PSSM (Position specific score matrix). • HMM (Hidden Markov model).
PSSM • Position specific score matrix • Score matrix representing the score for having each amino acid in a given position in a specific sequence. • Based on the independent probabilities P(a|i) of observing amino acid a in position i.
PSSM: Identifying a domain • Given a sequence and a PSSM: • Run over all positions. • Score each sub-sequence according to the matrix.
x3 x1 x2 x4 HMM: Hidden Markov Model • Markov model: a way of describing a process that goes through a series of states. • Each state has a probability of transitioning to the other states. • xi is a random variable of state.
x3 =0 x3 =0 x3 =1 x1 =0 x1 =1 x1=0 x2 =0 x2 =0 x2=1 x4 =0 x4 =1 x4 =1 HMM: Markov Model • Example: • States are Î {0,1}
x x3 x1 x2 x4 HMM: Markov Model • Transition matrix:
HMM: Markov Model • State transition example: • States are the nucleotides A, T, G, C.
x3 x1 x2 x4 y3 y1 y2 y4 HMM: Hidden Markov Model • Hidden Markov model: • Each state x emits an output y, at a specific probability. • We only know the output (observations). • Thus, the states are hidden.
x3 =1 x3 =0 x1 =1 x1 =0 x2 =0 x2 =1 x4 =1 x4 =1 y3 =1 y3 =0 y1 =1 y1 =1 y2 =1 y2 =0 y4 =0 y4 =0 HMM: Hidden Markov Model • Example: states are Î {0,1}, output Î {0,1}
x y x3 x1 x2 x4 y3 y1 y2 y4 HMM: Hidden Markov Model • Emission matrix:
HMM: What can we do with it? • Given (A, B): • Probability of given states and outputs • Probability of a given output sequence • Most likely sequence of states that generated a given output sequence
HMM: What can we do with it? • Learning: • Given state and output sequences calculate the most probable (A, B). • Easy when the states are known. • Otherwise: use a training algorithm.
HMM: Profile HMM • Use HMM to represent sequence families. • A particular type of HMM suited to modeling multiple alignments. • (Assume we have a multiple alignment).
HMM: Trivial profile HMM • We begin with ungapped regions. • Each position corresponds to a state. • Transitions are of probability 1.
HMM: Trivial profile HMM • Let ei(a) be the independent probability of observing amino acid a in position i. • The probability of a new sequence x, according to the model:
HMM: Trivial profile HMM • We can score the sequence x: • Where q indicates the probability under a random model.
HMM: Trivial profile HMM • Consider the values • They behave like elements in a score matrix. • The trivial profile HMM is equivalent to a PSSM.
HMM: profile HMM • Let’s untrivialize by allowing for gaps: insertions and deletions. • Start off with the PSSM HMM.
HMM: profile HMM • Handling insertions: • Introduce new states Ij– match insertions after position j. • These states have random emission probabilities.
HMM: profile HMM • The score of a gap of length k:
HMM: profile HMM • Handling deletions: • Introduce silent states Dj. • These states do not emit.
HMM: profile HMM • The complete profile HMM:
Internet resources • Databases of protein families. • Family information and identification. • Considerations: • Type of representation (pattern, PSSM, HMM). • Choice of seed multiple alignment proteins. • Quality control. • Database features (links, annotations, views). • Database Specificity (organism, functions).
Pfam • Protein families database of alignments and HMMs • Uses profile-HMMs to represent families. • For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures
Pfam: Databases 2 databases: • Pfam-A – curated multiple alignments. • Grows slowly. • Quality controlled by experts. • Pfam-B – automatic clustering (ProDom derived). • Complements Pfam-A. • New sequences instantly incorporated. • Unchecked: false positives, etc.
Pfam: Features • Search by: Sequence, keyword, domain, taxonomy. • Browsing by family or genome. • Evolutionary tree
Pfam: Construction • Source of seed alignments: • Pfam-B families. • Published articles. • 'domain hunting' studies. • occasionally using entries from other databases (e.g. MEROPS for peptidases).
PROSITE • Database of protein families. • Matching according to simple patterns or PSSM profiles. • Browsing all proteins of a specific family. • Latest release knows 1696 protein families.
PROSITE: Features • Comprehensive domain documentation. • All profile matches checked by experts. • Specificity/sensitivity: • Specificity: true-pos/all-pos • Sensitivity: true-pos/(true-pos + false-neg)