390 likes | 507 Views
Case Study #4: The Encyclopedia of Life: A Toolkit for High-Throughput Proteomics. Mark A. Miller. Principal Investigator Integrative BioSciences. Output from Genome Sequencing Project:.
E N D
Case Study #4: The Encyclopedia of Life:A Toolkit for High-Throughput Proteomics • Mark A. Miller Principal Investigator Integrative BioSciences
Output from Genome Sequencing Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141 tttggaagat attgccgctc accctgaaat catcaagccc ctgcaggaag agatcaggag 1201 agttattgcc gaagaagggt ggaacacgaa ggctatgtac aagatgttcc ttctcgacag 1261 cgtatttaag gaaacccaac gattgaaacc cattcaagtt ggtaagttga acaattatcg 1321 cttttgtaaa gtcgcttgct tacctcaaat cagcttcaat ggtgcgagaa gcgcagtccg 1381 acatcacact
30,000 107 Goal of Genome Sequencing Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct
BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141
Goal of EOL Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141 Software pipeline
The EOL project has three goals: • Putative functional and 3-D structure assignment through a large computation ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide http://eol.sdsc.edu
The EOL project has three goals: • Putative functional and 3-D structure assignment through a large computation • Integration of annotations with output from key biological resources ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide http://eol.sdsc.edu
The EOL project has three goals: • Putative functional and 3-D structure assignment through the largest computation ever attempted in biology • Integration of annotations with output from key biological resources • Innovative information delivery systems, including a peer to peer EOL Notebook ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide • The EOL project has three goals: • Putative functional and 3-D structure assignment through a large computation • Integration of annotations with output from key biological resources • Innovative information delivery systems, including a peer to peer EOL Notebook ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide http://eol.sdsc.edu
Tool 1 Tool 2 Tool 3 Tool 4 Step 1. Create a genome Annotation Pipeline Individual genome sequences Annotation tools from the community Output Output Output Output
Tool1 Tool2 Tool 3 Tool4 The EOL project 1) incorporates the best sequence analysis tools 2) analyses all genomes through an automated annotation process, and 3) uses web tools to serve the results to the community. ALL genome sequences International Data Providers Federated DATABASES SOAP Services Annotation tools from the community • Features: • annotation of all genomes by automated program portfolio • all runs stored in federated database • federation of local and public databases at API level • results served via SOAP server • interface facilitates novel queries • interface facilitates data management and exchange P2P HTTP Pages Third party providers
Annotation Strategy High cpu requirements So far “embarrassingly parallel” Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Show value in the annotation pipeline in a manual 1 genome run Port the pipeline to local and partner resources Run the pipeline remotely on distributed local resources APST: Globus,Condor friendly; but also Globus,Condor independent Running on EOL Cluster, Sun Ultra, 4 Sun E10’s. Run the pipeline remotely on partner resources using APST
The iGAP Pipeline Today Arabidopsis protein sequences structure info sequence info SCOP, PDB NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSI-Pred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in DB
~107 cpu hours* = (> 1000 cpu years and growing) *for one pass through the pipeline! Where are we now? High cpu requirements So far embarrassingly parallel Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Current pipeline rate: about 1 cpu hour/sequence ~800 genomes (and growing) =~107 ORF’s (and growing) Allocated Data Star cpu hours in an NRAC year: 4 X 106
Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Run the pipeline remotely on partner resources using APST In production: SDSC NPACI Grid BII, Singapore Monash University, Australia TITEC, Japan Belfast E-Science Center, Ireland UFCG, Brasil University of Michigan, USA Teragrid, USA
Annotation Strategy Distributed resources Home resource dagger.sdsc.edu Staged jobs Tomcat Staged jobs Struts USER Interface APST Staged jobs Staged jobs Workflow Manager User Axis Staged jobs Staged jobs All WEB Job Status Database Staged jobs Services
OK, we CAN do this annotation, butshould we???? • The Biologists are investigating first pass annotations • of the human and fly genomes….. • Before embarking upon a calculation of this magnitude, • we must show a clear value in the results!
In the meantime….. First, we can allow investigators to deploy the pipeline concept. • We can make our IT accomplishments valuable by putting them in the hands of investigators.
Make the tools generic: The original EOL pipeline consisted of a 2000 line bourne shell script. It is rigid, fragile, and cumbersome.
Kepler provides a flexible environment for workflow creation
Encyclopedia of Life PDB JCSG Protein Kinase Resource Portals ADIT Interface Web Services Data Sources Protein Data Bank (15,000 structures) Step 2. Create a federated database CE Portal Tracking/Analysis Portal Data Entry/Analysis Portal SOAP Protein Kinase Resource (12,000 kinases) Joint Center for Structural Genomics(4,000 targets) Encyclopedia of Life (55 genomes annotated) DB2 DB2 Oracle MySQL from a collection of independent data projects….
Web Services SOAP Encyclopedia of Life (55 annotated genomes) Small Molecule QM Database (all PDB molecules) Alliance for Cell Signaling (3,000 mol. pages) Protein Kinase Resource (12,000 kinases) Joint Center for Structural Genomics (4,000 targets) Protein Data Bank (15,000 structures) Unite Databases with Information Integrator middleware… Portals CE Portal GAMESS Portal Encyclopedia of Life Notebook Phase II: Interface allows scientists access to all federated databases Interface Query interface Federation Data Sources ….6 databases can be queried by the entire community through a user interface, and portals will create new functionalities for data manipulation…..
Summary • EOL project provides a community-based effort to serve academic community. • EOL has created tools to deploy jobs across distributed resources. The tools can be adapted to a variety of applications. • EOL adds value by adding structure information and confidence predictions to conventional sequence tools. • EOL project has begun to federate data among public biology data resources, starting with local SDSC resources. • EOL project has an open interface for sequence querying, and provides structure viewers for analyzing homologous structures. • Continued growth will be spurred by User input……
The Next Goal: The Virtual Cell DNA RNA Modeling Software lipids