VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 8 - Montag

VL Algorithmische BioInformatik (19710) WS2015/2016Woche 8 - Montag Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin

Vorlesungsthemen Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 2 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

MUMmer: Algorithm Read two genomes Perform Maximum Unique Match (MUM) of genomes using suffix tree Using SNPs, mutation regions, repeats, tandem repeats Close the gaps in the Alignment Sort and order the MUMs using LIS Output alignment • MUMs • regions that do not match exactly 3 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Suffix tree • To find the longest subsequence of a string quickly • Definition: a compact representation of all possible suffixes of an input S • Can be built in O(m) time and space where m=| S | • Search of sub-string X takes O(n) time, n=| X | 4 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Suffix Trees • Example: TORONTO$ • ‘$’ is terminating character T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1

Suffix Trees • Example: TORONTO$ • Searching for ‘ONT’ T NTO$ O RONTO$ 2 4 RONTO$ ORONTO$ $ O$ NTO$ 0 5 6 3 1 ‘ONT’ at position 3 in S

Maximal Unique Match • Sequences in genomes A and B that: • occur exactly once in A and in B • are not contained in any larger such sequence Genome A: tcgatcGACGATCGC…AGCATAAcgact Genome B: gcattaGACGATCGC…AGCATAAtcca A B 10 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Finding, sorting MUMs • MUM: Internal node with a leaf from each genome in its subtree • With single scan of the suffix tree, find all MUMs • Sort MUMs based on their position in genome A. 11 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Finding MUMs from a suffix tree

Matching MUMs 1 2 3 4 5 6 7 A B 1 3 2 6 4 5 7 Select longest consistent set of MUMs occurring in the same order in A and B 1 2 4 5 7 A B 1 2 4 5 7 13 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Choosing MUMs • Configuration can be uniquely represented: • P = {1, 2, 3, 4, 6, 7, 5}; • LIS(P) = {1, 2, 3, 4, 6, 7} • Determining optimal sequence of MUMs reduces to finding LIS of P 14 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

IS Definition • Increasing Subsequence: values (strictly) increase from left to right • Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} • Examples of two increasing subsequences: {4, 5, 9} or {2, 5, 6, 9, 10} • Can be solved by greedy algorithms (find minimum cover) • Cover of P: set of increasing subsequences of P that contains all numbers of P 15 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Matching MUMs • Sort, LIS=> O(KlogK) => O(N) • K: the numbers of MUMs • K<<N/logN • Actually two steps: finding greedily minimum cover in O(k log k) and finding LIS from cover O(k) 16 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Closing the Gaps • After global-MUM alignment found, need to close local gaps • Gap: interruption in MUM-alignment • Types of gaps: • SNP Single Nucleotide Polymorphisms • Insertion • Highly polymorphic region • Repeat • How? • Long gaps: repeat procedure using a shorter minimum length for MUMs • Short gaps: Smith-Waterman alignments 17 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Closing the Gaps 18 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Some results from the original MUMmer paper“Alignment of whole genomes“, Delcher et at 1000bp segments. Pairs of sequences that were at least 50% identical over 80% of the match appear as points in the plot. FASTA 25mers MUMmer Figure 7. Alignment of M.genitalium and M.pneumoniae using FASTA (top), 25mers (middle) and MUMs (bottom). In all three plots, a point indicates a ‘match’ between the genomes. In the FASTA plot a point corresponds to similar genes. In the 25mer plot, each point indicates a 25-base sequence that occurs exactly once in each genome. In the MUM plot, points correspond to MUMs as defined in the main text. 19 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Some results • Align two cousin bacteria, M.genitalium(580 kbp) and M.pneumoniae(816 kbp) • Time: 6.5s suffix tree; finding LIS 0.02s; 116s alignments. • Longest MUM 281 bp, 16 MUMs > 100 bp, <50% identical • Align two highly homologous strains ofM.tuberculosis, 4.4 million bps. • Time: 5s suffix tree construction, 45s sorting MUMs, 5s Smith-Waterman alignments. • Longest MUM 24.563 bp; 249 MUMs > 5000 bp; >90% identical 20 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Some results • Alignment of two syntenic sequences from human chromosome 12 and mouse chromosome 6 (225 kbp). • Time: 29s in total, 1.6s for suffix tree. • Longest MUM, 117 bp, 10 MUMs > 50bp 21 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

MUMmer 2 • Problem with MUMmer 1 • Align only DNA sequences • Needs lots of memory • Can not align incomplete genomes • Solution: MUMmer2 • 3x faster than MUMmer 1 • Requires 1/3 space • Can align protein strands and incomplete genomes • Parallel alignment • Delcheret al., Nucleic Acids Research (2002) • http://www.tigr.org/software/mummer/MUMmer2.pdf 22 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

MUMmer 2 • Alternative to find initial exact matches • Identify where the query sequence would branch off from the tree, to find all matches • Unique match • Wherever a branch occurs at a tree position with just a single leaf beneath it • Maximal match • Using suffix links to find next match (extended match) • By checking the character immediately preceding the start of this match, we can determine whether it is a maximal match • Find all maximal matches: time proportional to the length of the query

Suffix Trees • MUMmer wants to find all maximal unique matches for all suffixes: • E.g., for query ACCGTGCGTC, we want: • ACCGTGCGTC • CCGTGCGTC • CGTGCGTC • GTGCGTC • … • Up to some reasonable limit… • Idea: don’t go back to root of tree each time… 24 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Suffix Trees • Suffix Links • All internal, non-root nodes have a suffix link to another node • If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a. 25 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Suffix Links The dotted lines indicate the suffix links. If you start at the blue node and follow the suffix links from there (from blue, to green, to first gray, to second gray), and look at the strings leading from the root to each node, you will see this: http://stackoverflow.com/questions/10168097/how-and-when-to-create-a-suffix-link-in-suffix-tree 26 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Streaming algorithm - unique match The match is unique, because there is a single leaf below this position in the tree. 27 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Streaming algorithm - maximal match • Suffix links are used to find extended match 28 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

MUMmer 2 • Improvements • Use only 20 bytes per bp (MUMmer, 38 bytes)Kurtz (1999) • Build suffix tree for the shorter sequence • Find MUMs by streaming the second sequences against suffix tree, Chang-Lawler (1994) • cluster the matches 29 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

New in MUMmer 2: Clustering step • To align unfinished assembly which needs rearrangement • Cluster MUMs • After matches are identified, the interval length between matches are checked • If the interval length between matches is less than a user-defined gap length, the matches are joined into a cluster • Find Longest Increasing Subsequence 30 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

NUCmer (NUCleotide MUMmer) • For finishing phase of assembly • Multiple-contigs alignment program • Uses MUMmer 2 • Can • Compare assemblies at different stages of project • Compare unfinished genomes to a closely related genome (speed up finishing step) • Compare outputs of two different assembly program 31 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

NUCmer • Inputs: two multi-fasta files • Output: alignment of every contig in the first file to every sequence in the second file • Algorithm • Create a map of all contig positions within each file • Concatenate contigs in each file • Run MUMmer to find MUMs • Map back the matches to the separate contigs • Cluster MUMs • (Modified) Smith-Waterman DP alignment to align the sequence between MUMs 32 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

PROmer • Protein-based alignment program • Input: two multi-fasta files • Technique: • Translate DNA into AA in all 6 reading frames • Map each protein to DNA sequence • Concatenate all potential proteins • Run MUMmer, cluster MUMs based on DNA coordinates • Examine a series of consecutive, consistent matches 33 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Campylobacter PROmer analysis Fouts et al. (PLoS Biol. 2005)Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species. • One genome is used as the x-axis for all four pair-wise comparisons • X-shape characteristic of collinearity interrupted by inversions around the origin or terminus of replication • Loss of collinearity in more distant comparisons 34 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Some results • Align P.yeolii (5 * coverage) and P.falciparum (8 * coverage), size 25 Mb • PROmer : time < 1 h • Blast : time ~ weeks • >70% of human chromosome 14 is duplication of part of chromosome 2 • Align E.coli (4.7 Mb) and V.cholerae (3 Mb) on 1 GHz desktop computer • MUMmer 1: 74 s, 293 MB • MUMmer 2: 27 s, 100 MB 35 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Improvements MUMmer 3 • Optimized suffix-tree library • Faster and requires 25% less memory (see Kurtz et al.) • Non-unique maximal matches • GUI • Now open source • Align Human vs human genome • Computer : Sun-Sparc, Solaris OS,64 GB, 950 MHz • Size: 2,839 Mbps • Time: suffix tree, 4.7 h; 4 GB Memory; query, 101.5 h; Total 4.5 days 36 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Benchmarks MUMmer 2.1 vs. 3.0 MUMmer 3.0, page 4 37 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Human Gut metagenome Percent Identity Plot (PIP) of random shotgun reads to a complete Bifidobacterium genome and a good quality draft Methanobrevibacter genomeGill et al. (Science, 2006)Metagenomic analysis of the human distal gut microbiome. Anaerobic bacteria. They are ubiquitous, endosymbiotic inhabitants of the gastrointestinal tract, vagina and mouth (B. dentium) of mammals, including humans. Some bifidobacteria are used as probiotics. 38 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Mauve Multiple Genome Aligner • Able to identify and align collinear regions of multiple genomes even in the presence of rearrangements • Find and extend seed matches • Group into locally collinear blocks • Align intervening regions • Darling et al. Genome Res. 2004 Jul;14(7):1394-403. 39 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Progressive Mauve alignment of 12 E. coli genome 40 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Vorlesungsthemen Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 41 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

The next sessions Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 42

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Today Buch 11.1-11.3 43

Overview Proteins 101 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Protein Functions • How do proteins do so much? • Proteins FOLD spontaneously • Assume a characteristic 3D SHAPE • Shape depends on particular Amino Acid Sequence • Shape gives SPECIFIC function 45 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

What is protein structure? 46 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Proteins are linear polymers that fold up by themselves…mostly. 47 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Secondary Structure http://www.abcte.org/files/previews/biology/ http://bioweb.wku.edu/courses/biol22000/3AAprotein/images/ 48

What are proteins made of? 49 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

The parts of a protein H OH “Backbone”: N, C, C, N, C, C… R: “side chain” 50 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 8 - Montag

VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 8 - Montag

Presentation Transcript