1 / 12

Parallel computational methods for sequence analysis

Andrew Meade ( A.Meade@Reading.ac.uk ) School of Biological Sciences. Parallel computational methods for sequence analysis. Molecular sequence growth rates from 600 to 100 million sequences in 25 years. Human Genome project. Molecular sequence growth rates.

catori
Download Presentation

Parallel computational methods for sequence analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Andrew Meade (A.Meade@Reading.ac.uk) School of Biological Sciences Parallel computational methods for sequence analysis

  2. Molecular sequence growth ratesfrom 600 to 100 million sequences in 25 years Human Genome project

  3. Molecular sequence growth rates • 18 million new sequences a year (2007 – 2008) • Rate of growth is accelerating • Doubling every 2 years • Likely to continue with new sequencing technology • Cost, time and technical ability required has reduced

  4. Its worse than it looks • Lack of suitably tools for sequence analysis • Analysis methods don’t always scale linearly • Methods have changed • Simple heuristics  Statistical methods • Simple rules  More realistic models • Descriptive results  Biological process • Sub system analysis  Systems biology • Computing power a major rate limiting steep • The widening gap between data and analytical methods is increasing

  5. Tools for genomic analysis Current Tools Required Tools • Co-opted for purpose • Designed for smaller data sets • Limited to a single computer • External data required • Hard to generalise • Custom build • Limited by available hardware • Use available computers • Models derived from data • Identify informative information in the data

  6. 454 parallel sequencing • Fast, 400-600 million bases per 10 hours • Human genome in 100 hours, HGP 13 years • Cheap, 20¢ per kb, currently $12 • Human genome for $100,000, HGP $10 billion • Accurate, 99% accurate on 400th base • Small chunks 400 – 800 bases per sequence • Similar to parallel computing, hard to convert raw power to usefully results • The catch - analysis

  7. 454 sequencing • Sequence populations of bacteria (16s) taken from cow guts under different experiential conditions • Identify how changes in feed affects bacteria populations. • 332,000 sequence in total • £8,000 using 454, previously over £2 million

  8. 454 sequencing analysis • Find how closely related sequence are to each other. • Perform an approximate match between all pairs of sequences. Allowing for insertions, deletions and mutations. • 332,000^2 * 0.5 = 5.5 * 1010 comparisons • 874 years on a single computer • Trivially parallel task, easy to distribute over nodes, different clusters, different OS / hardware.

  9. 454 sequencing analysis 2 • Cluster sequences from previous steep to find what species are present and in what quantities • 102 GB of data. Distributed code to reduce memory and processing requirements. • Liner scaling (memory, CPU) up to 200 nodes • Problems with disk access.

  10. Bayesian Phylogenetic inference • Infer evolutionally histories (phylogenies) from molecular data. • Widely uses in all arias for biology. • Used to investigate how genes and proteins change and adapt to their environment • How viruses spread and mutate • Reconstruct ancestral genes and proteins • Used in conservation studies to identify species that are most at risk of extinction and most valuable to conserve

  11. Mammal Mitochondrial 44 Taxa 13 Protein coding regions 16400 Nucleotides

  12. Mammal Mitochondrial scaling x x x • ~ 70 days • 60 ~ 2 days x Number of computers

More Related