1 / 41

High performance computational analysis of DNA sequences from different environments

High performance computational analysis of DNA sequences from different environments. Rob Edwards Computer Science Biology. edwards.sdsu.edu. www.theseed.org. Outline. There is a lot of sequence Tools for analysis More computers Can we speed analysis.

josh
Download Presentation

High performance computational analysis of DNA sequences from different environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High performancecomputational analysis ofDNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu www.theseed.org

  2. Outline • There is a lot of sequence • Tools for analysis • More computers • Can we speed analysis

  3. How much has been sequenced? 100 bacterial genomes Environmental sequencing First bacterial genome 1,000 bacterial genomes Number of known sequences Year

  4. How much will be sequenced? Everybody in USA Everybody in San Diego One genome from every species 100 people Most major microbial environments All cultured Bacteria

  5. Metagenomics(Just sequence it) 200 liters water 5-500 g fresh fecal matter 50 g soil Concentrate and purify bacteria, viruses, etc Epifluorescent Microscopy Extract nucleic acids Sequence Publish papers

  6. The SEED Family

  7. The metagenomics RAST server

  8. Automated Processing

  9. Summary View

  10. Metagenomics ToolsAnnotation & Subsystems

  11. Metagenomics ToolsPhylogenetic Reconstruction

  12. Metagenomics ToolsComparative Tools

  13. Outline • There is a lot of sequence • Tools for analysis • More computers

  14. How much data so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome ~300 GS20 ~300 FLX ~300 Sanger

  15. Computes

  16. Linear compute complexity

  17. Just waiting …

  18. Overall compute time ~19 hours of compute per input megabyte Hours of Compute Time Input size (MB)

  19. How much so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome Compute time (on a single CPU): 328,814 hours = 13,700 days = 38 years ~300 GS20 ~300 FLX ~300 Sanger

  20. Outline • There is a lot of sequence • Tools for analysis • More computers • Can we speed analysis

  21. Shannon’s Uncertainty • Shannon’s Uncertainty – Peter’s surprisal p(xi) is the probability of the occurrence of each base or string

  22. Surprisal in Sequences

  23. Uncertainty Correlates With Similarity

  24. But it’s not just randomness…

  25. Uncertainty in complete genomes Which has more surprisal: coding regions or non-coding regions? Coding regionsNon-coding regions

  26. More extreme differences with 6-mers Coding regionsNon-coding regions

  27. Can we predict proteins • Short sequences of 100 bp • Translate into 30-35 amino acids • Can we predict which are real and could be doing something? • Test with bacterial proteins

  28. Kullback-Leibler Divergence Difference between two probability distributions Difference between amino acid composition and average amino acid composition Calculate KLD for 372 bacterial genomes

  29. KLD varies by bacteria Colored by taxonomy of the bacteria

  30. KLD varies by bacteria

  31. Most divergent genomes • Borreliagarinii–Spirochaetes • Mycoplasmamycoides– Mollicutes • Ureaplasmaparvum– Mollicutes • Buchneraaphidicola– Gammaproteobacteria • Wigglesworthiaglossinidia–Gammaproteobacteria

  32. Divergence and metabolism Bifidobacterium Salmonella Bacillus Chlamydophila Nostoc Mean of all bacteria

  33. Divergence and amino acids Ureaplasma Wigglesworthia Borrelia Buchnera Mycoplasma Bacteria mean Archaea mean Eukaryotic mean

  34. Predicting KLD KLD per genome y = 2x2-2x+0.5 Percent G+C

  35. Summary • Shannon’s uncertainty could predict useful sequences • KLD varies too much to be useful and is driven by %G+C content

  36. New solutions for old problems?

  37. Xen and the art of imagery

  38. The cell phone problem

  39. Searching the seed by SMS AUTO SEEDSEARCHES 1 2 3 4 5 6 7 8 9 * 0 # seed search histidine coli @ 22 proteins in E. coli SEED databases GMAIL.COM ) ) edwards. sdsu. edu ) ) ) ) ) ) Anywhere Idaho GMCS429 Argonne

  40. Challenges • Too much data • Not easy to prioritize • New models for HPC needed • New interfaces to look at data

  41. Acknowledgements • SajiaAkhter • Rob Schmieder • Nick Celms • Sheridan Wright • Ramy Aziz • FIG • The mg-RAST team • Rick Stevens • Peter Salamon • Barb Bailey • Forest Rohwer • AncaSegall

More Related