200 likes | 292 Views
Bioinformatics at USDA-ARS Livestock Issues Research Unit. Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton. Projects. Future: Interactive neural network based models to describe and predict gene expression in Livestock and Pathogens
E N D
Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton
Projects • Future: Interactive neural network based models to describe and predict gene expression in Livestock and Pathogens • Present: Various Projects Various States Leading to the Future • Molecular Modeling • Gene Finding • Distributed BLAST • Whole Genome Comparison • Functional Genomics and pathways • Pathway or system targeted Microarray design
Functional Genomics • Functional Genomics/Gene Ontology- controlled vocabulary • Define, annotate, categorize, and describe large genetic datasets (e.g. est, mRNA) • We have developed a custom curated database for functional domain BLAST (regular blast and rps-BLAST using kog, cog, pfam, hmmr, smart domains) • Ultimately will become a comprehensive .NET suite of analyses for microarray design from new sequence all the way to result visualization.
Ontology • Annotation – propogation of error in definitions • Ca
BLAST: need for speed (II) • We are working with roughly 5000-100,000 queries against 1GB databases • 1 query takes a fairly fast PC 3 minute to complete • dual 3.2 GHZ XEON • 6 GB RAM • RAID0 SCSI-320 HD • Other methods MPI-BLAST, WU-BLAST, THREADED BLAST, SGE-BLAST, commercial TURBO BLAST, DNAstar etc.
BLAST ALGORITHM Cgtcgctcgctgtaagtac– query e.g.1000 letter word Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) A basic local alignment search tool. Journal of Molecular Biology 215, 403-410. • What database sequence is most similar to my query. • Databases one of ours is 60GB worth of letters • BLAST generates statistics based upon similarity and substitution probabilities In simplest form purine to purine better than purine to pyrimidine • Slide along 4 GB database find word match and try to extend
BLASTX as example-Translation into 6 reading frames, search database with these 6 sequences with word size of 3. • Time to BLAST • Up to a point decreased time correlated with number of slaves available • Average test machines (2.4 ghz/1gb RAM/SATA150) • (e.g. 90 seq/13 CPU/3 min) vs (90seq/1CPU/38.5 min) 350MB db GB-LAN
.NET Distributed BLAST • Take advantage of unused laboratory compute resources • Provide easy, powerful tool for Distributing BLAST • Target Atmosphere • Windows LAN • Current Open Source Distributed BLAST Applications • Require server class master or version of UNIX • Difficult to set up, configure databases, compile and submit jobs. • No large job fault tolerance
W.ND BLAST : A Bioinformatician promoting windows? • .NET C# • First tests Condor, MPI, a ported remote shell • Contractor • Project Manager • Database formatter • Worker machines • Job leasing • Output processing HT backend apps
Functionality • Network bandwidth would eventually be limited • Fault tolerant to worker failure • Resume upon reboot if Contractor fails • No statistical problems with search results • Complete BLAST database on each worker node if resources allow • Easy to install a breeze to use
.NET Distributed BLAST • Queue at each node • Contractor only allows maximum of two query sequences in each node’s queue • Ensures application wait a minimal amount of time between completion and next job • Thread per node • Makes use of .NET Asynchronous Delegate / AD – scalability ??? • Thread Invokes BLAST on remote node • Upon completion, remote node sends “finished” message to the Contractor • The contractor collects results and performs validity check • Once results are verified, remote worker BLAST starts on queue sequence and Contractor prepares allocates future job
.NET Distributed BLAST • Fault Tolerance-revisited • Task migration handled through application-level checkpointing • Worker encounters fault or crashes, • Contractor redirects failed nodes sequence on another worker node. • Minimal loss of time • Integrating QOS functionality- current in works • decrease priority when workstation is in use –based upon system remote call checking CPU%, memory etc • GUI allows increasing or decreasing priority – rev gauges and throttles • Storage requirement limitations - redirect query to other database source (working with 10 connection limitation in XP pro)
Future Directions • Quality of Service • Allow Contractor to set priority for application • Contractor Fault Tolerance • Large Network Optimization • Sub Contractors • Asynch Del. Thread limit- ewww kewl WEB SERVICE! • Shadow (Sub) Contractors- network load balance
The End! • Questions? • Suggestions? • Advice? • Even Criticism?