Advanced Bioinformatics (MB480/580)

Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA GATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACG AAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATG GAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAG Learn How to: ● Assemble a genome and predict its: - ORFs - Promoters ●Annotate genome: - Predict proteinfunctions - Model them if possible - Re-design them if possible ●Predict functions by inference from a large amount of unrelated data ●Predict ncRNAs ●High-throughput methods and data interpretation ●Prepare the data for presentations & publications

What is Bioinformatics? • Choices: • The analysis of biological molecules using computers and statistical techniques • TRUE • The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research • also TRUE, but suits Computational Biology better

More definitions • The collection, organization and analysis of large amounts of biological data, using networks of computers and databases. • The process of developing tools and processes to quantify and collect data to study biological systems logically. • The science of informatics as applied to biological research.

Yet more definitions • Mark Gerstein’s definition: • Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • The manuscript breaking down each part of the above statement will be e-mailed. • http://wiki.bioinformatics.org/Bioinformatics_FAQ

The important stuff • Bioinformatics brings together biological data from genome research with the theory and tools of mathematics, computer science and artificial intelligence. • Bioinformatics includes any application of computer technology and information science to: • Gather, organize, store and handle data. • Analyze, interpret and spread data. • Predict biological structure and function.

What is the information in Molecular Biology? • Central Dogmaof Molecular BiologyDNA -> RNA -> Protein -> Phenotype • Molecules • Sequence, Structure, Function • Processes • Mechanism, Specificity, Regulation • Central Paradigmfor BioinformaticsGenomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype • Large Amounts of Information • Standardized • Statistical • Most cellular functions are performed or facilitated by proteins. • Primary biocatalyst • Cofactor transport/storage • Mechanical motion/support • Immune protection • Control of growth/differentiation • Information transfer (mRNA) • Protein synthesis (tRNA/mRNA) • Some catalytic activity • Genetic material This slide is courtesy of Mark Gerstein

Language of biology is not easy to understand • Just like in spoken language, some words look very different but have the same meaning (car and automobile are synonyms; sequences of distantly related proteins are synonyms) • Some words look or sound very similar yet have different meaning (complement and compliment; eminent and imminent; allude and elude; decent and descent are homophones; GAG and TAG codons are homophones) • In spoken language, we came up with the rules which is why most of the time we can trace back their origins • How do we trace the origins of Nature’s language?

Why is Bioinformatics important? • Supports experimental work • In some cases, it provides complementary data • More importantly, guides experimental work • Predictions based on data • Extension of experiments in new directions • To be believable, Bioinformatics predictions have to be verifiable • Statistical significance, or some other kind of significance score

When did Bioinformatics begin? • 10-15 years ago? • This is a common assumption • Bioinformatics existed even back in 70s • It was called differently • It was underused because the amount of biological sequence data was small

Bioinformatics and Genome Biology • The revolution driving enormous development in Bioinformatics and experimental sciences came from whole genome sequencing Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512. (Picture adapted from TIGR website, http://www.tigr.org) • Integrative Data 1995, HI (bacteria): 1.6 Mb & 1600 genes done 1997, yeast: 13 Mb & ~6000 genes for yeast 1998, worm: ~100Mb with 19 K genes 1999: >30 completed genomes! 2003, human: 3 Gb & 50 K genes... Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading. -- G A Petsko, Nature401: 115-116 (1999)

What can we infer from sequence using Bioinformatics?

Make sense of subtle differences - About 90% of the mouse and human genomes are in syntenic blocks. [Waterston et al.Nature 2002]

What’s in the genome? • If we are so much alike in terms of genome, why are we so much different? • Large variation in human population • Similar genes and similar genome organization between human and chimp (or even human and mouse), yet large phenotypic difference • The importance of non-coding parts of our genome became more obvious • Non-coding, regulatory RNAs • Binding sites for regulatory proteins • Other possibilities that are not obvious right now

Complexity of biological information 1. Finding regulatory motifs in DNA 2. Increasing the speed and reliability of functional annotation from sequence

The more we know, the better?

So we have a genome sequence … >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGCGGAATACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGGAGGGATATACTGAGA GTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGTAAACAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGAAGAAGAGTAATGAATGGTTATGGTTAGG GACTAAAATTATAAACGCCCATAAGACTAACGGCTTTGAAAGTGCGATTATTTTCGGGAAACAAGGTACGGGAAAGACTACTTACGCCCTTAAGGTGGCAAAAGAAGTTTACCAGAGATT AGGACATGAACCGGACAAGGCATGGGAACTGGCCCTTGACTCTTTATTCTTTGAGCTTAAAGATGCATTGAGGATAATGAAAATATTCAGGCAAAATGATAGGACAATACCAATAATAAT TTTCGACGATGCTGGGATATGGCTTCAAAAATATTTATGGTATAAGGAAGAGATGATAAAGTTTTACCGTATATATAACATTATTAGGAATATAGTAAGCGGGGTGATCTTCACTACCCC TTCCCCTAACGATATAGCGTTTTATGTGAGGGAAAAGGGGTGGAAGCTGATAATGATAACGAGAAACGGAAGACAACCTGACGGTACGCCAAAGGCAGTAGCTAAAATAGCGGTGAATAA GATAACGATTATAAAAGGAAAAATAACAAATAAGATGAAATGGAGGACAGTAGACGATTATACGGTCAAGCTTCCGGATTGGGTATATAAAGAATATGTGGAAAGAAGAAAGGTTTATGA GGAAAAATTGTTGGAGGAGTTGGATGAGGTTTTAGATAGTGATAACAAAACGGAAAACCCGTCAAACCCATCACTACTAACGAAAATTGACGACGTAACAAGATAGTGATACGGGTAATG TCAGACCCCTTTTAGCCATTCCGCATACTTTTTATATTGCTCTTTCGCTATGCCGAAGAGCGATACGTAATGTTGCGTTAAAACGCGTGTCGGTTTACGCCCTTGAATAAAATCGATAAT ATCTAACGGTACGCTTAGCTCAGCCATCTTAGACGCTACGAATTTGCGGAAGTACTTTATCGCTATAGCGTCCTTATGACGTCGTTCAAAGTCCGCTATTGCCCACTTCGTCACCTCTAC TCTCTTCAGAGGCGTTATGTGGAATACATAGAAGACGCCCTTATATCCCCTAGTCCAACTAAGCGGATAATAACAGACGTCGTTACCGCAAATGTCCCTTTCGGGTTCCTTCAGCACTTT CAGTATTTCGCTCAGCCTAACGCCCGACTCGAGAGCGATACGGTAGATGAAGTAGACGTTTTCGCTATAGTCTTTTGCTAATTGTAACGTCCTTTTTATCTCTTCCAACGTTGGAATGTA GATATCAGCGTTCGCCTTCTTCACCTTTACCGCTTTCAATATTTTATCCGCAAATTCATCATGTATGATATTGCGTGACGCTAAGAAACGTGCAAAGAGTCGGTAAGCCTTCTGTGCGTC TCTCGTCTCTTTATACGGCTTTGATATAGCATTGATGTAGTCCTTTGCAGTTTTTTCGCTTATCCCCCTTTCGTTCATGAGATAGTCGTAGAACGCCTTTATGTTGCCGTCCGTCGCGTA TTGGCGCAAATTGGCAACCAACGCTATTTTACGTCGTTCAGTTCCCTCTTTTCCGCCTCCGGAGCCGGAGGTCCCGGGTTCAAATCCCGGCGGGTCCGCTTGTAGGGGAGTATCCCCTAC GACCCCTAATTTCATTTTTAGATATGATTCAACGACGTCAGCTAAAGGACCCACGTAACGCTCTTTTACCTCACCGTTTTCATACTCTAGCTTGTAAACATAATACCGCCCTTTCCTCTC GCGTAAAATATAATCCCCGTATTTATAACGCGTCTTATCTTTCGTCATTTCGCCTCACAGTATTATGGTTGCCAAAACGGGCTTATAAGCATTGGCAACCCGTTAATTTTTGCCGTTAAA ACACGTTGAATTGAAAGAAGACGGCAAAGAATCCACACAGGTAATACTAAAAAAGTAGTATTACTTACATTAGAAGGACTCATTTGTCCACCTTGTATTCTAGCCATGCTATCTCTGCCT TCAGCTCATCTAGCTTCCCCTTTATGTCTGTCAGGTCAAGGGGAACTCCTCTCATTAACCTGAGTTCGTTTTCGATTTTTTCAAGCTCCTTTTCCAACTCCTCTAGTTTCTCTAATTCCT TTAGTCGTTCTTCCAATTTCTTTTCCAATTTCCCCTTTGCGTCATTTATAATTATGCTTACTACCCAAACAATTCCTAAATCAGAAATAATTATTAACTCCTCTGAGTTGAATATCATTT TCCGCCCCTCGCTAAATACTCCTTAAAGCTCTGATAGAACCCCTTCAGACTAACCCGTAAGTCTGTTAGGTTCTTCCAGTATTGTAATGGGATTAAGTAATAGTAGCTTACTGCATCTCT CTCAAATTTGTCCTTCTTAATCTTTCCTTGCTTTTCTAAGTTGAGTATTTGCAGTGCTGAGATACATTTTAACTTGTCCTCAGCATCTGAATAGTGTATAAACCAAACCCTCCCCATAAC CTCATTCTGCTTTGCAACTTCTACTTTAGTGCTTAATATTGCGTAAACGCTTTCGCCGTATCTTTCTTTGCTCTGTTCTTCAGTCCATGAACTTCCCGTAATATCTATCCAAATTAAAGG ATAATATTCTGTCTTAGCCTTAACGTATAAAGTCAAATCGTATTTATCTTGCAGACCGCTATAGTATTGCTCATTTATTACATTAGTTAAAGTCCCCACGCCAGTTGGGCGGATATAAAC ATCAAAGTCTAACAAACCCTTAGCCCGCCACTTTGATAAAGAGATTAAGAGCTTTCCAAAAACTAGGTATTCTCGCCCTAAATAAGTTGAAGGGAGGATATAATCCTCAGCTTGATTACC CCAATACTTTAGCTTAAAATTAGTTTCAGCCATCTCACTCACCATATTGAAACGTGGGCTAGTATGTGAATCAGTACTGATGCTATTGCAAATAACACACTTGCAGTAGCAATTCCTATT ACAATCCATTTACCATAATCCACCTTAGTTTGTTGGTCAATATACTCGTTGATGATCTTTAGTATTTCTGGCTTTAGTTCTGATAATGAAAGGAAGACAGAGGCATAAAGTACTAAGGAG GATGTGAACAGATTATCCGCCTTTTCTGAAAGTTTATAAAGCTCATATCTTGCTCTCTCATAATCTTCATAATTAATAATTTCATCAAACTTTTCTACTTGCTCTTCATATTCTTTCTTC AGAGAGTAAGGAGTTGTCTTTTCAATTACTCCTAATTTTATTAACTTCTTAACAGCTTCCTTAAATCCTTGTTTATTGCTAGCATACGCTAAAGGGTCTTTTCCTTCTTGAGAAGCTCTA TAGATAACTATAGCACCATAAACAATATTTACAATATCGTATGGTAAGGAATACGCACCGATTTGGGCAATATCTTCAACTCTTCTTTGATCCATCTAGTTCACCTCTTTTTGATTTGTT TGTAGGTTTCTATCGCAGTTTTCAGCGATATCGCAAATAGCTTCCCCTTTTCCGTTAGGTATAGCCTCTTTTCGCCTCTTTCTTGACGCTCTTTCACGAAGCCCTCTTGTATTAGGAACT TTTTTGCATCATAAAAGGTGGCAGTGGACATGGGAAATTCTGCGTTTACTTTCTTGTATAGGTCATATGTTGCTATTCCTTCATTATCATATAGATAAGCCAATACTATGGCTTCGGGGT AGAAGAATGGTGTACTTTTCATATCCTCCTCACTCCTCAGCCTCTAATAGCTTAACTGCCTCCTCTATCAACTGTCCCATTGTCTTTCCAGTCTTTGCCTTAAGCCTCTGCAGAGTCTCA TATGTTTCCTCACTTATTGAAATGTTAAGCCTTTTGACTATCCTATCTTTCCTCTTCTCTATCATTTAGGTCACCTTGTTTATTGTTATTTGAAATACGTATCCGTCTTCGTCACATCGA AGTATAATTTTGTATCCATTATTAGCATATTCTACGTCAAAGTTCCCACAACAATAATTCGGGTCTTCGGACTCGTTATAGACTTTGCTCCAACCATCTTTTTGTAGTGCCTCTTCTAAG TAGTCTACTCTGATGAAGCCTTCATCATATTCGTTCAGTACCCTAAAGCTTATACTATCAATGCCTAATACGTCTAATAGCTTCAACAGATCGAATATAGGAACTTGCACCATCATTTCA GCTCACCTTAATGAGCTGATATAATTCCGCTTCTATCTTTTGAACTTGGAAGTATGCCTTGCCTAGCTTTTGCTTATCCATATTGCCCGTTATTCTATCAATCTTAATCTCGTGGATTAA TGATAATAGCTCTCTGACATCCTCATCAAGCATTTCAAATAATTCTTTCTCTAAGACTTCTTTACTCATTGTTTTTCACCTTAGCAAACTCATCTAACGTTGTTTGTCTCAGTTCTCTTT TCTTTATCAAATAAAATTCCGAATGTCCCTTCTTATTGTTATTACTGTACTTCATGTCAGTTCACTGCTTTGCCTTTATAAATCCTTGATCCGTTTGCTCAAAATTTGCGGGCTGGGCAT

Gene finding through learning atgccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgtaa gaggatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagatg Gene Non-gene gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag Gene?

atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

Is this all? Map looks better.

OK, so we’ll predict protein functions …

… maybe do few experiments …

… and then enjoy glory (maybe money, too). Trevor Douglas and Mark Young (2006) Science 312, 873 - 875.

What can we do with Molecular Biology information? • Different levels of Molecular Biology information • DNA • Coding or non-coding • Meaningful or junk DNA? • RNA • Information transfer (mRNA, tRNA, rRNA) • Regulatory roles • Protein • Structure and function • Modifications

Molecular Biology Information in DNA and RNA • Raw DNA Sequence • 4 bases: AGCT • Coding or Not? • How do we parse the sequence into genes? • Because of introns, ~1 K in a gene could mean ~2 M in genome • Raw RNA Sequence • 4 bases: AGCU • mRNA, tRNA, rRNA • Regulatory RNAs • Secondary structure atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Molecular Biology information in protein sequences • 20 letter alphabet, more combinatorial variability than DNA (20AA-number) • ACDEFGHIKLMNPQRSTVWY but not BJOUXZ • Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain • More than 2 million unique protein sequences (more than 5.6 M of total sequences in the database) • We must be able to “transfer” the function from characterized proteins to uncharacterized ones based on some measure of similarity 1. Finding regulatory motifs in DNA

Molecular Biology information in macromolecular structures • DNA/RNA/Protein • The majority of all structures are of proteins • Proteins easier to crystallize and were thought to be more important

Organizing information: Redundancy and multiplicity help … • Fairly different sequences may have the same structure and function • Bad news: If they are very different, how do we find this? • Good news: Once they are found, we learn something more about structure and function • An organism has many similar genes and non-coding RNAs • The redundancy present for essential genes and/or RNAs (rRNA) • Single gene may have multiple functions • Combining domains in eukaryotes produces large proteins • Genes are grouped into pathways; this is good

… though sometimes the path is difficult • Evolutionary distances do not help establish initial relationship • Large differences (large evolutionary distances) between proteins are hard to identify and defend on statistical grounds without experiment • Evolutionary distances do help once the relationship is established • If the relationship between distant proteins is established, their conserved parts provide information about what is vital for function • Less conserved parts of proteins are less important for function - scaffold • Given all these difficulties, how do we find hidden similarities?

Some things we can do using just sequence • Sequence (text string) comparisons • Sequence (text string) search • Sequence alignment • Finding short sequences in biological sequences • Significance statistics • Databases • Building, Querying • Learning patterns • Artificial Intelligence and Machine Learning • Mining for patterns and clustering them • Secondary structure prediction • Where are helices, strands and loops in proteins? • Finding trans-membrane helices • Tertiary structure prediction • Fold recognition and structure prediction • Active site identification

How are optimal alignments found?(Should we all pick the one we like?)

Aligning text strings …Which alignment is the best? Raw Data ???T C A T G C A T T G 2 matches, 0 gaps T C A T G | |C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G 4 matches, 1 insertion T C A - T G | | | | . C A T T G 4 matches, 1 insertion T C A T - G | | | | . C A T T G

Dynamic Programming to the rescue 1. Finding regulatory motifs in DNA • What to do for Bigger String? SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGG REGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEPKPNEPRGDILLPTVGHALAFIERLERPELYGVNP EVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRT EDFDGVWAS • Needleman-Wunsch (1970) provided first automatic method • Dynamic Programming to Find Global Alignment • Local Alignment is sometimes better than Global • Needleman-Wunsch Test Data • ABCNYRQCLCRPMAYCYNRCKCRBP

Make a dot plot (Similarity matrix) Put 1's where characters are identical.

Scoring the alignment • The idea is to go through the matrix and find a shortest path to the bottom (it is actually done from the bottom backwards) • Caveat 1: This path also needs to have the highest score • Caveat 2: We have to score the gaps (insertions and deletions) since they do not exist in proteins

Global alignment by dynamic programming Sequence X: MONTANA Sequence Y: MONTANA Scoring system: 5 for match; -2 for mismatch; -6 for gap Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 A -30 -19 -8 3 14 25 19 13 N -36 -25 -14 -3 8 19 30 24 A -42 -31 -20 -9 2 13 24 35 Optimum alignment score: 35 X: MONTANA Y: MONTANA

What about gaps? Sequence X: MONTTANA Sequence Y: MONTANA Scoring system: 5 for match; -2 for mismatch; -6 for gap Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 T -30 -19 -8 3 14 18 12 6 A -36 -25 -14 -3 8 19 16 17 N -42 -31 -20 -9 2 13 24 18 A -48 -37 -26 -15 -4 7 18 29 Optimum alignment score: 29 X: MONTTANA Y: MON-TANA

Scoring “real-life” alignments Sequence X: MONTANABOBCATS Sequence Y: MONTANAGRIZZLIES Scoring system: 5 for match; -2 for mismatch; -6 for gap Dynamic programming matrix: M O N T A N A G R I Z Z L I E S 0 -6 -12 -18 -24 -30 -36 -42 -48 -54 -60 -66 -72 -78 -84 -90 -96 M -6 5 -1 -7 -13 -19 -25 -31 -37 -43 -49 -55 -61 -67 -73 -79 -85 O -12 -1 10 4 -2 -8 -14 -20 -26 -32 -38 -44 -50 -56 -62 -68 -74 N -18 -7 4 15 9 3 -3 -9 -15 -21 -27 -33 -39 -45 -51 -57 -63 T -24 -13 -2 9 20 14 8 2 -4 -10 -16 -22 -28 -34 -40 -46 -52 A -30 -19 -8 3 14 25 19 13 7 1 -5 -11 -17 -23 -29 -35 -41 N -36 -25 -14 -3 8 19 30 24 18 12 6 0 -6 -12 -18 -24 -30 A -42 -31 -20 -9 2 13 24 352923 17 11 5 -1 -7 -13 -19 B -48 -37 -26 -15 -4 7 18 29 33 27 21 15 9 3 -3 -9 -15 O -54 -43 -32 -21 -10 1 12 23 27 31 25 19 13 7 1 -5 -11 B -60 -49 -38 -27 -16 -5 6 17 21 25 29 23 17 11 5 -1 -7 C -66 -55 -44 -33 -22 -11 0 11 15 19 23 27 21 15 9 3 -3 A -72 -61 -50 -39 -28 -17 -6 5 9 13 17 21 25 19 13 7 1 T -78 -67 -56 -45 -34 -23 -12 -1 3 7 11 15 19 23 17 11 5 S -84 -73 -62 -51 -40 -29 -18 -7 -3 1 5 9 13 17 21 15 16 Optimum alignment score: 16 X: MONTANA--BOBCATS Y: MONTANAGRIZZLIES

The scoring depends on our choice of parameters Sequence X: MONTANABOBCATS Sequence Y: MONTANAGRIZZLIES Scoring system: 5 for match (B=G); -2 for mismatch; -6 for gap Optimum alignment score: 23 X: MONTANAB--OBCATS Y: MONTANAGRIZZLIES Sequence X: MONTANABOBCATS Sequence Y: MONTANAGRIZZLIES Scoring system: 5 for match; -2 for mismatch; -1 for gap Optimum alignment score: 26 X: MONTANA--------BOBCATS Y: MONTANAGRIZZLIE------S

How do we choose good scoring parameters? • A simple scoring scheme considers only sequence identity • More realistic scoring schemes consider sequence similarity, which is taken from substitution matrices • We measure the frequency of residue substitutions and normalize it by residue frequency in the database (LOG2 an/ad) • Zero in substitution matrix means that the substitution occurs by chance • Score less than zero means that the substitution is unlikely to occur by chance • There is no universally good matrix

BLOSUM62 substitution matrix A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

What is the limit of substitution matrices?

Why did substitution matrices fail? • The proteins in question are very distantly related and their substitution patterns are not properly rewarded by general matrices • Substitution matrices do not capture all families equally well because they are meant to be general • Can we build protein family-specific substitution matrices? • Yes, these are known as protein family profiles

How do we build protein family-specific matrix? Search protein database using BLOSUM62 matrix Build protein-family specific matrix (profile) and search protein database again ???

Detecting distant relationships using profiles (PSI-BLAST)

Position-specific scoring matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V T -1 0 1 2 -2 0 1 -1 0 -2 -2 0 -1 -2 -1 0 3 -1 -1 -1 M -1 -2 -3 -3 -1 -2 -3 -3 -2 1 1 -2 7 0 -3 -3 -2 -1 -1 0 D -1 2 0 2 -3 3 1 -1 0 -3 -3 1 -2 -3 -1 0 0 -2 -2 -2 V -1 -4 -4 -5 -1 -4 -4 -5 -3 4 1 -4 0 -1 -4 -4 -1 -2 -2 5 I -1 -4 -5 -6 -1 -4 -5 -6 -4 5 1 -5 0 -1 -5 -5 -2 -2 -2 4 S 0 -1 0 -1 -1 -1 -1 -1 -1 -3 -3 -1 -2 -3 -2 4 3 -3 -2 -2 F -2 -4 -4 -5 -2 -4 -5 -5 -2 1 2 -4 1 6 -4 -4 -3 2 2 0 K -1 3 -1 -1 -4 1 0 -2 0 -3 -3 5 -2 -4 -2 -1 -1 -3 -3 -3 L -2 -4 -5 -6 -1 -4 -5 -6 -3 3 4 -4 2 1 -4 -5 -3 -1 -1 1 P -1 -3 -3 -2 -4 -3 -2 -3 -3 -4 -4 -2 -4 -4 7 -2 -2 -4 -4 -3 P -1 -1 -1 2 -4 -1 0 -1 -1 -4 -4 -1 -4 -4 6 -1 -1 -3 -3 -3 E 0 0 0 2 -4 1 4 -2 -1 -3 -3 1 -2 -4 -1 -1 -1 -3 -3 -2 L -2 -3 -4 -5 -2 -3 -4 -5 -2 2 4 -4 6 1 -4 -4 -2 -1 -1 1 N -1 0 3 0 -2 0 2 -1 0 -2 0 1 -1 -2 -1 0 0 -1 -1 -2 A 2 1 -1 -1 -1 0 0 -2 0 0 0 0 0 -1 -1 -1 0 0 0 0 K -1 1 -1 -1 -3 1 0 -2 0 -2 0 4 -1 -2 -2 -1 -1 -2 -1 -1 L -2 -4 -5 -5 -2 -4 -5 -5 -3 1 5 -4 1 1 -4 -5 -3 -1 -1 1 E -1 0 2 4 -4 0 3 -1 0 -5 -4 0 -4 -4 -1 0 -1 -3 -3 -4 S 1 1 0 0 -3 3 1 -1 0 -2 -2 1 -2 -2 -1 1 0 -2 -1 -2 V -1 -1 -2 -3 -1 -2 -3 -3 0 2 1 -2 1 2 -2 -2 -1 1 3 3 A 5 -3 -3 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -3 -2 0 -1 -3 -3 -1 L -1 0 -1 -2 -1 -1 -1 -2 0 2 1 0 1 0 -2 0 0 1 2 0 K -1 2 0 0 -3 1 2 -1 0 -3 -3 3 -2 -4 -1 1 0 -3 -2 -2 E -1 1 0 0 -2 2 2 -2 3 -2 0 1 -1 -2 -1 -1 0 -1 -1 -1 K -1 1 1 0 -3 0 0 2 0 -4 -3 3 -3 -3 -1 0 -1 -2 -2 -3 K -1 0 -1 -1 -2 0 0 -3 0 0 -1 3 3 -1 -2 -1 0 -1 0 1 S -1 -1 2 0 -2 0 0 -1 0 -3 -3 0 -2 -3 -1 3 2 -2 -2 -3

Combining sequence similarity with SS information

Can we quickly scan for common protein families? • YES, many databases available • Instead of comparing our query to other sequences, we compare it to the database of profiles (also called Hidden Markov Models or HMMs) • Profiles (and HMMs) capture the average preference for residues at all positions; They are probabilistic representations of protein families • Try these databases: • PFAM (http://pfam.wustl.edu) • SMART (http://smart.embl-heidelberg.de)

Profiles HMMs have other uses • Profile HMMs represent a phylogenetic footprint of a given protein family • Also used for secondary structure prediction • Predictions of trans-membrane proteins • Prediction of protein disorder • Most of these predictors are based on machine-learning algorithms that are trained on known data and can extract subtle patterns

Questions?Mensur DlakicDepartment of Microbiology111 Lewis HallTel: 994-6576mdlakic@montana.eduMicro office: 109 Lewis Hall

Advanced Bioinformatics (MB480/580)

Advanced Bioinformatics (MB480/580)

Presentation Transcript

Bioinformatics and sequence analysis

Introduction to Bioinformatics

Advanced Spatial Analysis

Advanced SQL Lecture 4

Biases in RNA- Seq data October 30, 2013 NBIC Advanced RNA- Seq course

Bioinformatics Workshop 1 Sequences and Similarity Searches

Text Information Retrieval and Applications – Advanced Topics

CAP5510 – Bioinformatics Sequence Comparison

Advanced Charm++ Tutorial

Bioinformatics For MNW 2 nd Year

Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome

CS 6293 Advanced Topics: Translational Bioinformatics

K. SEKAR, Ph.D. BIOINFORMATICS CENTRE INDIAN INSTITUTE OF SCIENCE BANGALORE 560 012 INDIA

Functional Genomics of Plant Phosphorylation

Bioinformatics

Text Information Retrieval and Applications – Advanced Topics

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics