1 / 16

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture.

sylvia
Download Presentation

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

  2. Outline of the lecture • Sequence approximation in computational molecular biology: the premise and the limits • Getting ready for analysis of exact string matching and sequence alignment algorithms: some definitions and interplay with biology • The notion of string/sequence similarity • Substitution matrices for sequence alignment JM - http://folding.chmcc.org

  3. R: unique - 0.7 Gb; common with both H and M – 1.1 Gb R: 2.75 Gb M: 2.5 Gb H: 2.9 Gb Before we start: literature watch A draft of the Rat genome has been published! RGSPC Nature 428 What are the first conclusions from the comparison with other mammalian genomes? What approaches and tools have been used to perform this comparative analysis? JM - http://folding.chmcc.org

  4. Biological Polymers and Central Dogma Bio-Polymer (alphabet) Process (algorithm) DNA (A,T,G,C) replication transcription mRNA (U,A,C,G) splicing translation Proteins (20 a.a.) folding interactions Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.

  5. Complexity of “DNA computing” http://www.genecrc.org/site/lc/lc2d.htm JM - http://folding.chmcc.org

  6. Get the relevant sequences to compare them: conservation and differences Problem  Algorithms  Programs Sequencing  Fragment assembly problem  The Shortest Superstring Problem  Phrap (Green, 1994) Gene finding  Hidden Markov Models, pattern recognition methods  GenScan (Burge & Karlin, 1997) Sequence comparison  pairwise and multiple sequence alignments  dynamic algorithm, heuristic methods  BLAST (Altschul et. al., 1990) JM - http://folding.chmcc.org

  7. Redundancy in biological systems An example: sperm whale vs. human myoglobin: Query: 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 60 MLS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE Sbjct: 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60 Query: 61 DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISEAIIHVLHSRH 120 DLKKHG TVLTALG ILKKKGHHEAE+KP AQSHATKHKIP+KYLEFISE II VL S+H Sbjct: 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120 Query: 121 PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154 PG+FGADAQGAMNKALELFRKD+A+ YKELG+QG Sbjct: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154 Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI JM - http://folding.chmcc.org

  8. Limits of the sequence approximation • All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences • However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics • On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc. JM - http://folding.chmcc.org

  9. Strings, sequences and string operations String vs. sequence duality will be important for exact vs. inexact string matching JM - http://folding.chmcc.org

  10. Beyond the letters: how to find better models (e.g. GC content for gene finding) http://www.imb-jena.de/IMAGE_BPDIR.html

  11. Another example: active sites, functional motifs and multiple alignment JM - http://folding.chmcc.org

  12. Distance and similarity measures JM - http://folding.chmcc.org

  13. Edit distance vs. substitution score JM - http://folding.chmcc.org

  14. Substitution matrices for protein sequence alignment: learning and extrapolating from examples • PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time), s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)= Sc P(b|c,1)P(c|a,1) • BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity), s(a,b) = log pab/qaqb where pab= Fab / Scd Fcd Expected score: Sab qaqb s(a,b) = - Sab qaqb log qaqb / pab = -H(q||p) JM - http://folding.chmcc.org

  15. Summary JM - http://folding.chmcc.org

  16. Web resources and materials for the course • Protein Modeling Lab • Remote access to PML and the Citrix software • All lectures and other materials available electronically from the PML servers • Electronic tests and homework, web submission interfaces • The web site for the Introduction to Bioinformatics course • Updates http://folding.chmcc.org http://folding.chmcc.org/protlab/protlab.html http://folding.chmcc.org/intro2bioinfo/intro2bioinfo.html JM - http://folding.chmcc.org

More Related