Prioritization of targets for Structural Genomics

Prioritization of targets for Structural Genomics Peer Bork EMBL & MDC Heidelberg & Berlin bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/

Prioritising targets forStructural Genomics Homology-based coverage Complexes and functional modules Candidates for complex diseases Associating genes to diseases www.bork.embl-heidelberg.de

Intellectual challenge coverage time

Xray selection protocol (Oct 1999) Filters Proteins Human, <500aa, annotated in sequence databases: 32349 Filter for 98% redundancy, splice forms, fragments: 20724 Match to clones available at German resource center: 6016 EST match protein in N-terminal region: 4755 Distinct expression protocol Proteins have no homologue with known 3D (fast check):

Xray selection protocol (Oct 1999) Proteins Filters ….Proteins have no homologue with known 3D (fast check): Distinct NMR protocol 1827 No transmembrane region or other composition bias: 1102 Proteins have no homologue with known 3D (sensitive check): 602 Functional features 602 known 347 255 unknown 71 Medical relevance likely

Criteria for target selection from sequence • No similar sequence with known fold • Everything that crystallizes in a given species • Everything from certain pathways/ complexes/ compartments • Everything with certain properties (e.g. thermophilic, kinase-function) • ‘All’ disease genes • Everything else left over • …. www.bork.embl-heidelberg.de

Structural Biology and Bioinformatics Target prediction for structural genomics Target prediction for structural genomics Zooming out: Protein interactions Zooming out: Protein interactions Zooming in: SNPs and 3D structures www.bork.embl-heidelberg.de

Interaction prediction Berend Snel Rich Copley +Martijn Huynen www.bork.embl-heidelberg.de

Function prediction via genomic context information Gene context: - Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’) - Surrounding and shared regulatory elements Knowledge-based context: - Pathway data (can overrule homology!) - Gene expression data (co-expression etc.) - Protein interaction /localisation - Scientific literature www.bork.embl-heidelberg.de

Context methods in Mycoplasma: Fusion, neighborhood, co-occurrence Presence in conserved operons: 213 MG total: 480 genes Fusion 27 54 178 Co-occurrence in genomes Conserved neighborhood www.bork.embl-heidelberg.de

STRING server for context retrieval www.bork.embl-heidelberg.de/STRING www.bork.embl-heidelberg.de/STRING Tryptophan biosynthesis

Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesis www.bork.embl-heidelberg.de

Modularity in “genomic association space” tyrA asd aroB truA aroC aroE hemK hyp trpF trpC trpE trpG Shikimate pathway trpA trpD trpB hyp Tryptophan synthesis pathway 2c-rr Networks based on conserved gene neighborhood reveal ‘natural’ subsystems

Applications of interaction predictions The Cellzome* yeast factory Goal: Functional characterization of all multiprotein assemblies as fast as possible (2001), move to human Methods: TAP tagging/co-purification + mass-spec Results based on ca 1400 human orthologues: 1700 genes in 230 complexes, ca 130 of them novel 3885 interactions (involving 1995 genes) predicted based on genomic context, 27% overlap, complementary *proteomics company founded at EMBL in June 2000, curr. >90 employees www.bork.embl-heidelberg.de Data provided by

Predicting candidate genes for genetically inherited diseases Association of genes to diseases Association of genes to diseases Analysis of non-synonymous SNPs Analysis of non-synonymous SNPs www.bork.embl-heidelberg.de

Shamil Sunyaev www.bork.embl-heidelberg.de

SNP data have currently fastest growth rate Integration with other data is the key to more understanding

SNPs and mutations 90% of human genetic variation due to single nucleotide polymorphism (SNP) •mapping tool •association with complex phenotypes (multifactorial diseases/ drug responses etc.) •human evolution Disease mutation - usually allele frequency <<1% SNP - allele frequency >1% cSNP - SNP in coding region nonsynonymousSNP - affects amino acid sequence

ESTs reveal SNPs and alternative splice sites... mRNA 5’ UTR coding 3’ UTR C A EST1 A A SNP prediction EST2 T EST3 T (>700 libraries!) EST4 Prediction of alternative splicing EST5 EST6 (many different tissues and age groups!) …but also lots of errors!!!

Mapping SNPs onto 3D: Identifying those that damage proteins Rules taken from protein engineering and multiple sequence analysis www.bork.embl-heidelberg.de

Selected polymorphic sites mapped onto 3D Minor allele frequency: High ( 5%) Low (<5%) www.bork.embl-heidelberg.de

Selection of Mutations for 3D mapping Filter Data sources Resulting 3 sets Keywords: ‘3D STRUCTURE and ‘DISEASE MUTATION’ 1. 551disease mutations (badies) SWISSPROT Keywords: ‘3D STRUCTURE’ and ‘POLYMORPHISM’ but not ‘DISEASE MUTATION’ Allelic variants with frequency >1% in a pool of ‘normal’ individuals OMIM 2. 86 allelic variants (‘don’t know’) HGBASE Blastx search against PDB Chakravati WEB 3. 225 and 261 neutral mutations between species (goodies) in proteins of set 1 and 2, respectively Check all proteins identified in the resulting sets above for close homologues in other species (>90% identity) and take mutations HSSP

How many sites are in structurally and functionally “important” regions? • Disease mutation sites (badies) 90% • Polymorphic sites (don’t know) 29% • Interspecies mutations (goodies) 8% Hence: Predicting phenotypic effects of cSNPs! ‘important’=surface accessibility <10%, active site, S-S bond Sunyaev/Ramensky/Bork,Trends Genet. 16(00)191 www.bork.embl-heidelberg.de Sunyaev/Ramensky/Koch/Lathe/Bork, unpubl.

Prediction of risk factors Of 36 SNPs with predicted phenotypic effects (from a well-characterized SNPs pool), 5 are already known to be disease-associated: Gene Disease risk Frequency Mutation effect HFE hemochromatosis 6% destroyed SS-bond Fructose- fructose intolerance >1% destroyed core biphosphate aldolase NAD(P)H benzene toxicity 4-20% Unfavorable dehydrogenase (post-chemotherapy substitution leukemia) a-1-anti- familial obstructive >1% destroyed core chymotrypsin lung disease a-1-antitrypsin emphysema 2-4% destroyed core

Structural Biology and Bioinformatics Target prediction for structural genomics Zooming out: Protein interactions Zooming out: Protein interactions Zooming in: SNPs and 3D structures Zooming in: SNPs and 3D structures www.bork.embl-heidelberg.de

Credits g2D Carolina Perez Miguel Andrade www.bork.embl-heidelberg.de

Literature mining for associating genotypes to phenotypes RefSeq 10 329 sequences MEDLINE gene 10 725 796 articles article article Gene biochemistry phenotype chemistry MeSH C MeSH D Gene Ontology 6 992 terms 5 070 terms 2 379 terms 6 023 924 pairs 98 969 pairs

PhenotypeCMeSH GO Gene Ontology Acidosis, Renal Tubular Acidosis Hypokalemia Nephrocalcinosis Sjogren’s Syndrome Alkalosis Kidney Diseases Kidney Failure, Chronic Nephritis, Interstitial Fanconi Syndrome … Carbonate dehydratase Hydrogen-transporting ATP syntase Hydrogen/potassieum-exchanging ATPase Hydrogen-transporting two-sector ATPase Proton transport Vacuolar hydrogen-transporting ATPase (synonim: VATPase) Pyruvate carboxylase Aminobutyrate catabolism Succinate-semialdehyde dehydrogenase … D MeSH I 7q33-q34 II MEDLINE RefSeq LocusLink Golden Path

Association to Craniofrontonasal dysplasia GO MeSH D • 0.0241 FGF receptor signaling pathway (process) 0.4615 • Receptors, Fibroblast Growth Factor 0.2500 • 0.0130 fibroblast growth factor receptor (function) MeSH C 0.1176 0.0588 0.0905 0.0322 • 0.0061 MAPKKK cascade (process) Craniosynostoses [15] • Fibroblast Growth Factor 0.0285 0.0092 0.0526 0.0058 0.0215 Craniofacial Dysostosis [7] 0.0014 0.0032 • Bone morphogenetic proteins 0.0010 0.0017 0.0531 • 0.0011 skeletal development (process) Hypertelorism [6] 0.0546 • Collagen 0.0119 Mental Retardation [4] 0.0434 • 0.0001 integral plasma membrane protein (component) 0.0052 • Keratan sulfate 0.0112 0.0153 0.0109 Bone Diseases, Developmental [3] • DNA probes 0.0092 0.0075 0.0046 • 0.0000 signal transduction (process) • Chondroitin etc... symptoms, manifestations chemicals, proteins, drugs functions

chromosome X band Xp22 From homology to disease association RefSeq 0.0123 NP_002002 fibroblast growth factor receptor 4, isoform 1 precursor - Human 0.0130 fibroblast growth factor receptor (function) 0.0241 FGF receptor signaling pathway (process) 0.0000 integral plasma membrane protein (component) 0.0083 NP_006644 suc1-associated neurotrophic factor target 2 - Human 0.0000 signal transduction (process) 0.0241 FGF receptor signaling pathway (process) 0.0009 peripheral plasma membrane protein (component) 0.0075 NP_000595 fibroblast growth factor receptor 1, isoform 1 precursor - Human 0.0130 fibroblast growth factor receptor (function) 0.0061 MAPKKK cascade (process) 0.0011 skeletal development (process) 0.0007 oncogenesis (process) 0.0241 FGF receptor signaling pathway (process) 0.0000 integral plasma membrane protein (component) 0.0026 NP_034336 fibroblast growth factor receptor 1 - Mouse 0.0000 ATP binding (function) 0.0000 membrane fraction (component) 0.0000 signal transduction (process) 0.0000 protein tyrosine kinase (function) 0.0130 fibroblast growth factor receptor (function) ... GO-scores

Benchmark of 100 disease genes Score correlates with prediction accuracy bench 100 not annotated Rank of true gene bench 10 Log R-score

Prioritization of targets for Structural Genomics