1 / 31

Comprehensive strategy for integrated target selection in structural genomics

Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu. Comprehensive strategy for integrated target selection in structural genomics. Comprehensive strategy for integrated target selection. Our research goal and current reality

mindy
Download Presentation

Comprehensive strategy for integrated target selection in structural genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu Comprehensive strategyfor integrated target selectionin structural genomics

  2. Comprehensive strategy for integrated target selection • Our research goal and current reality • Unit: sequence-structure familiesGoals: cover allentire families with good models • STAGE 1: CHOP + CLUP + filtering -> novel automatic organization of sequence-structure space • STAGE 2: Refined, manual selection ->model all family members? stop-work/hold-work? • STAGE 3: Explore experimental structure • Answers and perspectives • How many structures needed for completion? • Euka-proka-archae: overlap? • Why collaborate on targets? • Multiplexing helpful? • High-throughput protein production in eukaryotes?

  3. Computational biology & bioinformatics

  4. Sequence-structure family Sequence-structure family U Sequence-structure family U’

  5. EVA: comparative modelling Cumulative distribution PSI-BLAST 10-3 Marc Marti Renom & Andrej Sali (UCSF) http://eva.compbio.ucsf.edu/~eva/cm/http://cubic.bioc.columbia.edu/eva Accuracy Coverage V Eyrich, MA Marti-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali & B Rost (2001) Bioinformatics 17, 1242-1243 MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost, A Sali (2002) Structure 10, 435-440

  6. How to decide when we exclude/include? C Sander & R Schneider 1991 Proteins, 9, 56-68 B Rost 1999 Prot Engng, 12, 85-94

  7. Scooping families from proteomes, in practice Problems: • domains • overlaps

  8. Choose targets: single-linkage clustering Conclusions: • NO clustering of full- length proteins • have to chop into structural-domain- like fragments (single-linkage DOES work on PrISM) ~100,000 eukaryotic proteins (yeast, fly, worm, weed, human) 22 112 clusters 46 318 in largest cluster NONSENSE! Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in press Liu & Rost 2003 Proteins, submitted

  9. CHOP proteins into structural domains Liu & Rost 2003 Proteins, submitted

  10. CHOP: dissection of proteins into domains Average domain length • in proteins ≥ 2 domains: ~100 residues • in proteins with 1 domain: 1.7-3 times longer Single-domain proteins: 61% in PDB 28% in 62 proteomes Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in press Liu & Rost 2003 Proteins, submitted

  11. To take or not to take Take if > 50 globular residues and no known 3D

  12. Structural residue coverage in reality (any) 53% of residues to do ! ~28% ~19% J Liu & B Rost 2002 Bioinformatics, 18, 922-933

  13. If you believe 53% is pessimistic ... 53% residue coverage today based on E-value 1!!

  14. Clustering after CHOP 21,000 fragment clusters Jinfeng • 103 796 eukaryotic proteins (Yeast, Fly, Worm, Arabidopsis, Human/30)247 222 domain-like fragments167 717 no PDB (E-value 10-1, HSSP-distance -3) 44 718 not good 4 us (membrane, coil, SEG, NORS, signal peptide) • 122 999 2 go 95 330 non-singleton Liu, Montelione & Rost 2003 Proteins, in press

  15. Computational biology & bioinformatics

  16. Main goal of Stage 2 analysis Diana Murray, Cornell • Refine Stage 1 automatic target selection through manual sequence analysis • Concept: USE comparative modeling and structural features directly for refined target selection • For each sequence-structure family from Stage 1:predict minimal set of exp. structures needed to high-quality model entire family.

  17. Refinement protocol 4 new 3DTarget re-prioritization based on weekly PDB updates Diana Murray, Cornell Toolbox Input: PDB + NESG cluster 1. Fold recognitionand sequence-to-structure profiles 2. Comparative modeling (PrISM, Nest) 3. Structure evaluation tools (e.g. Verify3d) 4. Calculate biophysical properties Recommend 2 do additional structure if: 1) NESG-cluster members poorly modeled 2) Biophysical properties of models incompatible with known function 3) Models suggest novel functionality

  18. Target Status Example of stop work recommendation IR21 solved, PDB: 1MOS ET28 Purified JR15 Expressed TT777 Expressed GR7 Expressed AR12 Cloned WR204 Selected XR4 Expressed Diana Murray, Cornell Experimental structure of IR21 yielded high-quality models for all members of this NESG sequence/structure family Stop work SPINE/ ZebaView

  19. HR291 AR1731 HR2295 HR291 AR1731 HR2295 A HR291 AR1731 HR2295 KR12 DR11 B KR12 DR11 Diana Murray, Cornell Two structures required to cover family: Predicted by Stage 2 analysis and verified by Stage 3 analysis NESG family: HR291 (99% identical to 1P9O), AR1731, HR2295, KR12, DR11 breaks into two clusters: A = (HR291, AR1731, HR2295) and B = (KR12, DR11) Recommendation: Solve structure of KR12 (purified)

  20. Model suggests novel function: 30S ribosomal protein S27 Archaeal structure Diana Murray, Cornell NESG ID: GR2; PDB ID: 1QXF Archaeoglobus fulgidis S27e protein has only archae and eukaryotic members. Archae and eukaryotes share conserved hydrophobic motif (yellow). Only eukaryotes have N-terminal extension, and their models have strikingly different electrostatic properties. Human protein recommended for structure determination! Model for human homologue

  21. Summary Stage 2 refinement Diana Murray, Cornell • Statistics: • Many families currently under investigation Hold work recommendation: • family member at advanced experimental stage • predicted to yield good models for entire family -> hold-work for members at early exp. stages re-assess once structure done!

  22. Computational biology & bioinformatics

  23. Exploit structure to speculate about function • 43 no previous annotation about functiondefined by ‘no publication in biological journal’39 analyzed • 31 result in some predictions about function • 8 clear success: functional annotation achieved e.g. predicted active site based on structure typically: conformation of annotation transfer • 23 some hints (16 ‘hypothetical proteins’) e.g. some clue about active site mostly completely new! • 8 no clue Sharon Goldsmith & Barry Honig

  24. Answers • How many structures needed for completion? • Euka-proka-archae: overlap? • Why collaborate on targets? • Multiplexing helpful? • High-throughput protein production in eukaryotes?

  25. How many targets for prokaryotes + archae? 16,000 min 8,000 give: 72% fragments 72% proteins 67% residues

  26. How many targets for euka-proka-archae? 8,000 8,000 give: 67% fragments 67% proteins 59% residues BUT: 50% of residues remaining

  27. Overlap between euka-proka-archae? • ~60% of fragments from eukaryotes no sequence-structure family member from prokaryotes or archae • much higher for ‘largest 8,000’: • 2,690 (34%) proka+archae only • 4,277 (53%) euka only • 1,033 (13%) mix • surprisingly small overlap overall • even lower for largest families • most big families are eukaryotic!

  28. Why collaborate on target list? 32% overlap competition between consortia has already hampered success-rate considerably!

  29. Does multiplexing help? Date: 2003-07-28 ~4% Multiplex DOUBLES success rate!

  30. Integrated strategy • NESG unique, comprehensive, integrated strategy optimized to organize sequence space in structural terms: • Stage 1: CHOP+CLUP+filter yields high success in focusing on sequence-structure families • Stage 2: detailed refinement embeds comparative models into selection and optimizes structural coverage for family • Stage 3: use experimental structure to increase structural family coverage and to allow functional exploitation • Needed to do ‘em all: • ~38,000 non-singletons • 8,000 largest -> 50% of the residues that remain! • Genomics: Surprises + our structural perspective changed the ‘world’! The revolutions continue ...

  31. Thanksgiving Data: Jinfeng Liu (CUBIC) Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD) NESG: Guy Montelione (Rutgers) Barry Honig (Columbia) Diana Murray (Cornell, NYC) Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto) Wayne Hendrickson (Columbia) EVA: Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid) Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC) $$: NIH/NSF

More Related