1 / 56

Molecular biology in silico

Molecular biology in silico. Mikhail Gelfand Research and Training Center “Bioinformatics”, Institute for Information Transmission Problems, RAS AlBio06, Moscow, July 2006. red: papers (experiments) blue: sequence fragments. Propaganda. Complete genomes.

tevin
Download Presentation

Molecular biology in silico

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular biology in silico Mikhail Gelfand Research and Training Center “Bioinformatics”, Institute for Information Transmission Problems, RAS AlBio06, Moscow, July 2006

  2. red: papers (experiments)blue: sequence fragments Propaganda

  3. Complete genomes GOLD db.(III.2006):361 complete genomesIncomplete (in the process): 952 bacteria58 archaea 607 eukaryotes (incl.ESTs)46 metagenomes

  4. More propaganda Most genes will never be studied in experiment Even in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized) Bioinformatics = molecular biology in silico • ~2% of all recent papers in biological journals • Essential component of biological research • Make predictions about function and regulation of genes (many quite reliable!) • Metabolic reconstruction and prediction of phenotype given genome • Identify really interesting cases, fill gaps in knowledge • “Universally missing genes” – not a single known gene even for ~10% reactions of central metabolism. No genes for >40% reactions overall • “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function

  5. Haemophilus influenzae, 1995

  6. Vibrio cholerae, 2000

  7. How?Similarity to known proteins • Useful for many purposes (allows one to annotate 50-75% genes in a bacterial genome) • Necessary first step • May be automated • … to some extent … • in particular, care is needed to avoid too specific predictions • Problem: propagation of annotation errors • Boring (nothing new)

  8. Noradrenaline transporter in an archaeon? SOURCE Methanococcus jannaschii. ORGANISM Methanococcus jannaschii Archaea; Euryarchaeota; Methanococcales; Methanococcaceae; Methanococcus. Now corrected: Hypothetical sodium-dependent transporter MJ1319. FEATURES Location/Qualifiers source 1..492 /organism="Methanococcus jannaschii" /db_xref="taxon:2190" Protein 1..492 /product="sodium-dependent noradrenaline transporter" CDS 1..492 /gene="MJ1319" /note="similar to EGAD:HI0736 percent identity: 38.5; identified by sequence similarity; putative" /coded_by="U67572:71..1549" /transl_table=11

  9. Similarity to hypothetical proteins: somebody else’s errors… The correct annotation

  10. Genes with curious functional assignments • C75604: Probable head morphogenesis protein,Deinococcusradiodurans • O05360:Automembrane protein H,Yersinia enterocolitica • Q8TID9:Benzodiazepine (valium) receptor TspO, Methanosarcina acetivorans • NP_069403: DR-beta chain MHC class II, Archaeoglobus fulgidus

  11. Errors in experimental papers SwissProt: DEFINITION Hypothetical 43.6 kDa protein. ACCESSION P48012 ... KEYWORDS Hypothetical protein. SOURCE Debaryomyces occidentalis ORGANISM Debaryomyces occidentalis Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Debaryomyces. [CAUTION] Was originally (Ref.1) thought to be 3-isopropylmalatedehydrogenase (LEU2). PIR: DEFINITION 3-isopropylmalate dehydrogenase (EC 1.1.1.85) - yeast(Schwanniomyces occidentalis). ACCESSION S55845 KEYWORDS oxidoreductase.

  12. SwissProt entry DSDX_ECOLI -!- CAUTION: An ORF called dsdC was originally (Ref.3) assigned to thewrong DNA strand and thought to be a D-serine deaminase activator,it was then resequenced by Ref.2 and still thought to be "dsdC",but this time to function as a D-serine permease. It is Ref.1 thatshowed that dsdC is another gene and that this sequence should becalled dsdX. It should also be noted that the C-terminal part ofdsdX (from 338 onward) was also sequenced (Ref.6 and Ref.7) andwas thought to be a separate ORF (don't worry, we also haddifficulties understanding what happened!).

  13. Positional clustering • Genes that are located in immediate proximity tend to be involved in the same metabolic pathway or functional subsystem • mainly in prokaryotes, very weak in eukaryotes • caused by operon structure, but not only • horizontal transfer of loci containing several functionally linked operons • compartmentalisation of products in the cytoplasm • very weak evidence • stronger if observed in may unrelated genomes • May be measured • e.g. the STRING database/server (P.Bork, EMBL) • and other sources

  14. STRING: trpB – positional clusters

  15. Functionally dependent genes tend to cluster on chromosomes in many different organisms Vertical axis: number of gene pairs with association score exceeding a threshold. Control: same graph, random re-labeling of vertices

  16. More genomes (stronger links) => highly significant clustering

  17. Especially in linear pathways (right)

  18. Fusions • If two (or more) proteins form a single multidomain protein in some organism, they all are likely to be tightly functionally related • Very useful for the analysis of eukaryotes • Sometimes useful for the analysis of prokaryotes

  19. STRING: trpB – fusions

  20. Phyletic patterns • Functionally linked genes tend to occur together • Enzymes with the same function (isozymes) have complementary phyletic profiles

  21. STRING: trpB – co-occurrence (phyletic profiles)

  22. Phyletic profiles in the Phe/Tyr pathway shikimate kinase

  23. Chorismate biosynthesis pathway (E. coli) Archaeal shikimate-kinase

  24. 3-dehydroquinate dehydratase (EC 4.2.1.10): Class I (AroD) COG0710 aompkzyq---lb-e----n---i-- Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj---- Two forms combined aompkzyqvdrlbcefghsnuj-i-- + Shikimate kinase (EC 2.7.1.71): Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i-- Archaeal-type COG1685 aompkz-------------------- Two forms combined aompkzyqvdrlbcefghsnuj-i-- + Arithmetics of phyletic patterns Shikimate dehydrogenase (EC 1.1.1.25): AroE COG0169 aompkzyqvdrlbcefghsnuj-i-- 5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroACOG0128aompkzyqvdrlbcefghsnuj-i-- Chorismate synthase (EC 2.5.1.19) AroCCOG0082aompkzyqvdrlbcefghsnuj-i--

  25. Distribution of association scores (monotonic for subunits, bimodal for isozymes)

  26. E.g. transporters • Transporters of end products of metabolic pathways may substitute the entire pathway • Transporters of compounds for catabolic pathways co-occur with pathways • Transporters for intermediates substitute upstream parts of pathways

  27. Example:bioY

  28. Other approaches to phyletic patterns • Gene signatures of lifestyles • e.g. thermophily:DNA gyrase is the only gene specific to all hyperthermophiles (bacterial and archaeal) • see COGs • Regulators and signals

  29. Example: bioRgene: black arrow;candidate site: red dot

  30. Comparative analysis of regulation • Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions • Consistency filtering: regulons (sets of co-regulated genes) are conserved => • true sites occur upstream of orthologous genes • false sites are scattered at random

  31. Enzymes • Identification of a gap in a pathway (universal, taxon-specific, or in individual genomes) • Search for candidates assigned to the pathway by co-localization and co-regulation (in many genomes) • Prediction of generalbiochemical function from (distant) similarity and functional patterns • Tentative filling of the gap • Verification by analysis of phylogenetic patterns: • Absence in genomes without this pathway • Complementary distribution with known enzymes for the same function

  32. Transporters • Identification of candidates assigned to the pathway by co-localization and co-regulation (in many genomes) • Prediction of generalfunction by analysis of transmembrane segments and similarity • Prediction of specificity by analysis of phylogenetic patterns: • End product if present in genomes lacking this pathway (substituting the biosynthetic pathway for an essential compound) • Input metabolite if absent in genomes without the pathway (catabolic, also precursors in biosynthetic pathways) • Entry point in the middle if substituting an upper or side part of the pathway in some genomes

  33. 5’ UTR regionsof riboflavin genes from bacteria

  34. Conserved secondary structure of the RFN-element Capitals: invariant (absolutely conserved) positions. Lower case letters: strongly conserved positions. Dashes and stars: obligatory and facultative base pairs Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion

  35. RFN: the mechanism of regulation • Transcription attenuation • Translation attenuation

  36. Early observation: an uncharacterized gene (ypaA) with an upstream RFN element

  37. Phylogenetic tree of RFN-elements (regulation of riboflavin biosynthesis) no riboflavin biosynthesis duplications no riboflavin biosynthesis

  38. YpaA: riboflavin (vitamin B2) transporter in Gram-positive bacteria • 5 predicted transmembrane segments => a transporter • Upstream RFN element (likely co-regulation with riboflavin genes) => transport of riboflaving or a precursor • S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin pathway => transport of riboflavin Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999) Validation: • YpaA transports flavines (riboflavin, FMN, FAD) (by genetic analysis, Kreneva et al., 2000) • ypaA is regulated by riboflavin (by microarray expression study, Lee et al., 2001) • … via attenuation of transcription (and to some extent inhibition of translaition) (Winkler et al., 2003)

  39. A new family of nickel/cobalt transporters • No experimental data • No structural data • Specificity predicted by comparative genomics • … and then validated in experiment • Mutational analysis under way

  40. Conserved signal upstream of nrd genes

  41. Identification of the candidate regulator by the analysis of phyletic patterns • COG1327: the only COG with exactly the same phylogenetic pattern as the signal • “large scale” on the level of major taxa • “small scale” within major taxa: • absent in small parasites among alpha- and gamma-proteobacteria • absent in Desulfovibrio spp. among delta-proteobacteria • absent in Nostoc sp. among cyanobacteria • absent in Oenococcus and Leuconostoc among Firmicutes • present only in Treponema denticola among four spirochetes

  42. COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway?

  43. Additional evidence • sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA • candidate signals upstream of other replication-related genes • dNTP salvage • topoisomerase I, replication initiator dnaA, chromosome partitioning, DNA helicase II • experimental confirmation in Streptomyces (Borovok et al., 2004)

  44. Multiple sites (nrd genes): FNR, DnaA, NrdR

  45. Mode of regulation • Repressor (overlaps with promoters) • Co-operative binding: • most sites occur in tandem (> 90% cases) • the distance between the copies (centers of palindromes) equals an integer number of DNA turns: • mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns • 21 bp (2 turns) in Vibrio spp. • 41-42 bp (4 turns) in some Firmicutes

  46. Combined regulatory network for iron homeostasis genes in a-proteobacteria. Irr Irr RirA RirA FeS heme degraded 2+ 3+ S i d e r o p h o r e F e / F e I r o n - r e q u i r i n g I r o n s t o r a g e F e S H e m e T r a n s c r i p t i o n u p t a k e u p t a k e e n z y m e s f e r r i t i n s s y n t h e s i s s y n t h e s i s f a c t o r s I r o n u p t a k [ i r o n c o f a c t o r ] e s y s t e m s IscR Fur Fur Fe [+Fe] [+Fe] [- Fe] [ Fe] - FeS status of cell FeS [- Fe] [+Fe] The connecting line denote regulatory interactions, which the thickness reflecting the frequency of the interaction in the analyzed genomes. The suggested negative or positive mode of operation is shown by dead-end and arrow-end of the line.

  47. Distribution of Irr, Fur/Mur, MntR, RirA, and IscR regulons in α-proteobacteria Fe and Mn regulons MUR / Irr Group RirA IscR Organism Abb. MntR F UR - - SM + + + Sinorhizobium meliloti - - + + + + Rhizobium leguminosarum RL Rhizobiaceae - - + + + Rhizobium etli RHE - - + + + Agrobacterium tumefaciens AGR A. - - + + + ML Mesorhizobium loti - - + + + + Mesorhizobium sp. BNC1 MBNC - - + + + Brucella melitensis BME Rhizobiales - - + + + BQ Bartonella quintana and spp. - - - + + + Bradyrhizobium japonicum BJ - - - + + + RPA Rhodopseudomonas palustris B. - - - + + Nham Nitrobacter hamburgensis Bradyrhizobiaceae - - - + + Nitrobacter winogradskyi Nwi - RC + + + + Rhodobacter capsulatus - + + + + Rhodobacter sphaeroides Rsph - STM + + + + Silicibacter sp. TM1040 - + + + + S PO Silicibacter pomeroyi - + + #? + Jannaschia sp.CC51 Jann Rhodo- - bacteraceae HTCC2654 + + + + Rhodobacterales bacterium RB2654 C. - + + + + Roseobacter sp. MED193 MED193 - #? ISM + + + Roseovarius nubinhibens ISM Rhodo- - - bacterales sp.217 + + + + Roseovarius ROS217 p - + + #? + r Loktanella vestfoldensis SKA53 SKA53 o - t EE-36 + + + Sulfitobacter sp. EE36 #? e o - #? HTCC2597 + + + Oceanicola batsensis OB2597 b Hyphomonadaceae a - - - HTCC2633 + + Oceanicaulis alexandrii OA2633 c t Caulobacterales e - - - CC + + Caulobacter crescentu s r i Parvularculales a - - - + + Parvularcula bermudensis HTCC2503 PB2503 - - - + + Erythrobacter litoralis ELI - - - + + Saro Novosphingobium aromaticivorans Sphingomo- - - - + + nadales Sphinopyxis alaskensis g RB2256 Sala D. - - - + + Zymomonas mobilis ZM Rhodo- - - + + + Gluconobacter oxydans GOX spirillales - - - + + + Rrub Rhodospirillum rubrum - - - + + + Amb Magnetospirillum magneticum SAR11 cluster - - + + HTCC1002 + Pelagibacter ubique PU1002 Rickettsiales - - - - + Rickettsia Ehrlichia and species #?' in RirA column denotes the absence of the rirA gene in an unfinished genomic sequence and the presence of candidate RirA-binding sites upstream of the iron uptake genes.

  48. Phylogenetic tree of the Fur family of transcription factors in a-proteobacteria - I Escherichia coli : P0A9A9 sp| ECOLI Fur Pseudomonas aeruginosa : sp|Q03456 PSEAE Neisseria meningitidis : sp|P0A0S7 NEIMA HELPY : sp|O25671 Helicobacter pylori BACSU Bacillus subtilis : P54574 sp| SM mur Sinorhizobium meliloti MBNC03003179 Mesorhizobium sp. BNC1 (I) BQ fur2 Bartonella quintana BMEI0375 Brucella melitensis EE36 12413 sp. EE-36 Sulfitobacter a MBNC03003593 sp. BNC1 (II) Mesorhizobium RB2654 19538 HTCC2654 Rhodobacterales bacterium AGR C 620 Agrobacterium tumefaciens RHE_CH00378 Rhizobium etli RL mur Rhizobium leguminosarum Nham 0990 Mur Nitrobacter hamburgensis X14 Nwi 0013 Nitrobacter winogradskyi RPA0450 Rhodopseudomonas palustris BJ fur Bradyrhizobium japonicum ROS217 18337 Roseovarius sp.217 Jann 1799 Jannaschia sp. CC51 SPO2477 Silicibacter pomeroyi STM1w01000993 Silicibacter sp. TM1040 MED193 22541 sp. MED193 Roseobacter OB2597 02997 HTCC2597 Oceanicola batsensis SKA53 03101 Loktanella vestfoldensis SKA53 Rsph03000505 Rhodobacter sphaeroides ISM 15430 Roseovarius nubinhibens ISM PU1002 04436 Pelagibacter ubique HTCC1002 GOX0771 Gluconobacter oxydans ZM01411 Zmomonas mobilis y Saro02001148 Novosphingobium aromaticivorans a Sala 1452 RB2256 Sphinopyxis alaskensis Fur ELI1325 Erythrobacter litoralis OA2633 10204 Oceanicaulis alexandrii HTCC2633 PB2503 04877 Parvularcula bermudensis HTCC2503 CC0057 Caulobacter crescentus Rrub02001143 Rhodospirillum rubrum Amb1009 (I) Magnetospirillum magneticum a Amb4460 Magnetospirillum magneticum (II) Irr Fur in g- and b- proteobacteria Fur in e- proteobacteria Fur in Firmicutes in a-proteobacteria Regulator of manganese uptake genes (sit, mntH) in a-proteobacteria Regulator of iron uptake and metabolism genes a-proteobacteria

  49. Erythrobacter litoralis Caulobacter crescentus Novosphingobium aromaticivorans Zymomonas mobilis Sequence logos for the identified Fur-binding sites in the D group of a-proteobacteria Oceanicaulis alexandrii Sphinopyxis alaskensis Rhodospirillum rubrum Gluconobacter oxydans Parvularcula bermudensis - Magnetospirillum magneticum Identified Mur-binding sites Bacillus subtilis The A, B, and C groups Sequence logos for the known Fur-binding sites in Escherichia coli and Bacillus subtilis Mur a of - proteobacteria - Escherichia coli

  50. Phylogenetic tree of the Fur family of transcription factors in a-proteobacteria - II Escherichia coli ECOLI : P0A9A9 sp| Fur Pseudomonas aeruginosa : sp|Q03456 PSEAE Neisseria meningitidis : sp|P0A0S7 NEIMA HELPY Helicobacter pylori : sp|O25671 BACSU Bacillus subtilis : P54574 sp| a Mur / Fur AGR C 249 Agrobacterium tumefaciens SM irr Sinorhizobium meliloti RHE CH00106 Rhizobium etli RL irr1 Rhizobium leguminosarum (I) RL irr2 Rhizobium leguminosarum (II) MLr5570 Mesorhizobium loti MBNC03003186 sp. BNC1 Mesorhizobium BQ fur1 Bartonella quintana BMEI1955 Brucella melitensis (I) BMEI1563 Brucella melitensis (II) BJ blr1216 (II) Bradyrhizobium japonicum RB2654 182 Rhodobacterales bacterium HTCC2654 SKA53 01126 Loktanella vestfoldensis SKA53 ROS217 15500 Roseovarius sp.217 ISM 00785 ISM Roseovarius nubinhibens OB2597 14726 Oceanicola batsensis HTCC2597 Jann 1652 sp. CC51 Jannaschia a I r r - Rsph03001693 Rhodobacter sphaeroides EE36 03493 Sulfitobacter sp. EE-36 STM1w01001534 sp. TM1040 Silicibacter MED193 17849 Roseobacter sp. MED193 SPOA0445 Silicibacter pomeroyi RC irr Rhodobacter capsulatus RPA2339 (I) Rhodopseudomonas palustris RPA0424* Rhodopseudomonas palustris (II) BJ irr* (I) Bradyrhizobium japonicum Nwi 0035* Nitrobacter winogradskyi Nham 1013* Nitrobacter hamburgensis X14 PU1002 04361 Pelagibacter ubique HTCC1002 Fur in g- and b- proteobacteria Fur in e- proteobacteria Fur in Firmicutes a-proteobacteria Irrin a-proteo- bacteria regulator of iron homeostasis

More Related