1 / 55

Understanding the Computational Challenges in Large-Scale Genomic Analysis

Understanding the Computational Challenges in Large-Scale Genomic Analysis. Glenn K. Lockwood, Ph.D. User Services Group San Diego Supercomputer Center. Acknowledgments. STSI Kristopher Standish * Tristan M. Carland Nicholas J. Schork * SDSC Wayne Pfeiffer Mahidhar Tatineni

teigra
Download Presentation

Understanding the Computational Challenges in Large-Scale Genomic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding the Computational Challenges in Large-Scale Genomic Analysis Glenn K. Lockwood, Ph.D. User Services Group San Diego Supercomputer Center

  2. Acknowledgments STSI • Kristopher Standish* • Tristan M. Carland • Nicholas J. Schork* SDSC • Wayne Pfeiffer • Mahidhar Tatineni • Rick Wagner • Christopher Irving Janssen R&D • Chris Huang • Sarah Lamberth • ZhenyaCherkas • Carrie Brodmerkel • Ed Jaeger • Martin Dellwo • Lance Smith • Mark Curran • SandorSzalma • GunaRajagopal * Now at the J. Craig Venter Institute

  3. Outline • What's the problem? • The 438-genome project with Janssen • The scientific premise • 4-step computational procedure • Gained insight • Costs of population-scale sequencing • Designing systems for genomics • Final remarks

  4. Computational Challenges in Large-Scale Genomic Analysis Computational Demands

  5. Cost of Sequencing Source: NHGRI (genome.gov)

  6. Sequencing Cost vs. Transistor Cost Source: Intel (ark.intel.com)

  7. Sequencing is only the Beginning

  8. Scaling up to Populations Petco Park = 40,000 people Gordon = 16,384 cores 65,536 GB memory 307,200 GB solid-state disk

  9. Core Computational Demands • Read mapping and variant calling • Required by almost all (human) genomic studies • Multi-step pipeline to refine mapped genome • Pipelines are similar in principle, different in detail • Statistical analysis (science) • Population-scale solutions don't yet exist • Difficult to define computational requirements • Storage • Short-term capacity for study • Long-term archiving • Cost: store vs. re-sequence

  10. Computational Challenges in Large-Scale Genomic Analysis 438 Patients: A Case Study

  11. Scientific Problem Rheumatoid Arthritis (RA) • autoimmune disorder • permanent joint damage if left untreated • patients respond to treatments differently Janssen developed a new treatment(TNFα inhibitor golimumab) • more effective treatment in some patients • ...but patients respond to treatments differently X-ray showing permanent joint damage resulting from RA. Source: Bernd Brägelmann

  12. Scientific Goals Can we predict if a patient will respond to the new treatment? • Sequence whole genome of patients undergoing treatment in clinical trials • Correlate variants (known and novel) with patients' response/non-response to treatment • Develop a predictive model based on called variants and patient response

  13. Computational Goals • Problem: original mapped reads and individually called variants provided no insight • BWA -alnpipeline • SOAPsnp and SAMtools pileup to call SNPs and indels • Solution: Re-align all reads with newer algorithms and employ group variant calling • BWA -mempipeline • GATK HaplotypeCaller high-quality called variants

  14. Computational Approach 1. Understand computational requirements 2. Develop workflow 3. Load dataset into HPC resource 4. Perform full-scale computations

  15. Computational Approach 1. Understand computational requirements 2. Develop workflow 3. Load dataset into HPC resource 4. Perform full-scale computations

  16. Step #1: Data Requirements • Input Data • raw reads from 438 full human genomes (fastq.gz) • 50 TB of compressed data from Janssen R&D • Output Data • + 50 TB of high-quality mapped reads • + small amount (< 1 TB) of called variants • Intermediate (Scratch) Data • + 250 TB (more = better) • Performance • Data must be stored online • High bandwidth to storage (> 10 GByte/s)

  17. Step #1: Compute & Other Requirements • Processing Requirements • perform read mapping on all genomes; 9-step pipeline to achieve high-quality read mapping • perform variant calling on groups of genomes 5-step pipeline to do group variant calling • Engineering Requirements • FAST turnaround (all 438 genomes done in < 2 months) requires high capacity supercomputer (many CPUs) • EFFICIENT (minimum core-hours used) requires data-oriented architecture (RAM, SSDs, IO)

  18. Can we even satisfy these requirements? Data Requirements: SDSC Data Oasis • 1,400 terabytes of project data storage • 1,600 terabytes of fast scratch storage • Data available to every compute node Processing/Eng'g Requirements: SDSC Gordon • 1024 compute nodes • 64 GB of RAM each • 300 GB local SSD each • Access Data Oasis at up to 100 GB/s

  19. Computational Approach 1. Understand computational requirements 2. Develop workflow 3. Load dataset into HPC resource 4. Perform full-scale computations

  20. Step #2: The Workflow 00 f6 0b 3f bf 0c 49 46 80 3d 99 a8 0f a0 99 8f 65 f8 b7 94 4f 9b 3f 44 00 fc 50 ac 4a db 5d a2 69 ce d1 70 ce bc ab b3 76 90 60 fe e5 96 23 9b d6 c5 8d b9 56 a6 d5 2e 14 15 40 36 8a c4 4c 09 de 9e 89 92 34 4a 64 1e f4 f9 66 e0 15 74 36 64 f0 b1 7f a2 96 a0 0d 3c 78 83 87 74 5b b3 7a 07 79 b3 8b 54 5e 91 ba d3 6f 23 2f 7c 49 a0 df 84 9e 46 a8 93 7c ca 6d e1 0e 52 76 12 19 69 70 03 6b 6d b1 93 de 6a 65 f0 61 bd 6c 73 80 25 db 90 4c e4 1a cc 43 17 ae c9 e1 68 19 88 9a be 9b 9e 3a 1f a5 07 98 a0 8d 43 ae aa cd 71 83 3d ef ba 43 80 81 3d 84 f4 Applications Input Data Workflow Engine ##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description="Low quali ##FORMAT=<ID=AD,Number=.,Type=Integer,Descr ##FORMAT=<ID=DP,Number=1,Type=Integer,Descr ##FORMAT=<ID=GQ,Number=1,Type=Integer,Descr ##FORMAT=<ID=GT,Number=1,Type=String,Descri ##FORMAT=<ID=PL,Number=G,Type=Integer,Descr ##INFO=<ID=AC,Number=A,Type=Integer,Descrip ##INFO=<ID=AF,Number=A,Type=Float,Descripti ##INFO=<ID=AN,Number=1,Type=Integer,Descrip ##INFO=<ID=BaseQRankSum,Number=1,Type=Float ##INFO=<ID=DB,Number=0,Type=Flag,Descriptio ##INFO=<ID=DP,Number=1,Type=Integer,Descrip User Results Storage Compute Hardware Resource Policies

  21. Step #2: The Workflow • Automate as much as you can • tired researchers are frequent causes of failure • self-documenting and reproducible • Understand the problem • your data - volume, metadata • your applications - in, out, and in between • your hardware - collaborate with user services! • Don't rely on prepackaged workflow engines* • every parallel computing platform is unique • work with user services * Unless you really know what you are doing

  22. Step #2: The Workflow 2. Understand the problem • your data - volume, metadata • your applications - in, out, and in between • your hardware - collaborate with user services! How do we "understand the problem?" Test a pilot pipeline on a subset of input and thoroughly characterize it.

  23. Characterizing the Applications Map(BWA MEM) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort(SAMtools) Mark Duplicates (Picard) Target Creator (GATK) IndelRealigner(GATK) Base Quality Score Recalibration (GATK) Print Reads (GATK)

  24. Characterizing the Hardware Gordon - 1024 compute nodes • Two 8-core CPUs • 64 GB RAM (4 GB/core) • 300 GB SSD per node* Resource Policy: • Jobs allocated in half-node increments (8 cores, 32 GB RAM) • Jobs are charged per core-hour(8 core-hours per real hour) How do we efficiently pack each pipeline stage? * mounted via iSER. Above: 256x 300 GB SSDs

  25. Mapping Application Demands to Hardware Map8 threads sam to bam 1 thread Merge Lanes 1 thread Sort1 thread Mark Dups 1 thread Target Creator 1 thread IndelRealigner1 thread BQSR 8 threads Print Reads 8 threads

  26. Packing a Thread Per Core Map8 threads sam to bam 1 thread Merge Lanes 1 thread Sort1 thread Mark Dups 1 thread Target Creator 1 thread IndelRealigner1 thread BQSR 8 threads Print Reads 8 threads

  27. Memory-Limited Steps Map 6.49 GB sam to bam 0.0056 GB Merge Lanes 0.0040 GB Sort1.73 GB Mark Duplicates 5.30 GB Target Creator 2.22 GB IndelRealigner5.27 GB BQSR 20.0 GB Print Reads 15.2 GB

  28. Packing Based on Threads AND Memory Map 1 / job sam to bam 8 / job Merge Lanes 8 / job Sort8 / job Mark Duplicates 4 / job Target Creator 4 / job IndelRealigner5 / job BQSR 1 / job Print Reads 1 / job

  29. Our Final Read Mapping Pipeline Map 1 / job sam to bam 8 / job Merge Lanes 8 / job Sort8 / job Mark Duplicates 4 / job Target Creator 4 / job IndelRealigner5 / job BQSR 1 / job Print Reads 1 / job

  30. Summary of Read Mapping

  31. Summary of Read Mapping

  32. Disk IO: Another Limiting Resource SAMtools Sort • External sort - data does not fit into memory • Breaks BAM file into 600 - 800 smaller BAMs • Constantly shuffles data to/from disk to RAM 16 cores (2 CPUs) per node, so... • 16 BAMs processed per node 1.8 TB of input data • 10,000 - 12,000 files generated and accessed 3.6 TB of intermediate data

  33. Disk IO: Another Limiting Resource High-performance parallel filesystems - can Data Oasis handle this? 16 files, 1.8 TB of input data 10,000 files, 3.6 TB of scratch data repeatedly open/close 10,000 files Local SSDs necessary to deliver the IOPS

  34. Disk IO: Another Limiting Resource Can using an SSD work around this? 16 files, 1.8 TB of input data 10,000 files, 3.6 TB of scratch data repeatedly open/close 10,000 files Each node's SSD cannot handle 3.6 TB of data

  35. BigFlash Nodes • Aggregate 16x SSDs on a single compute node to get • 4.4 TB RAID0 array • 3.8 GB/s bandwidth • 320,000 IOPS

  36. SAMtools Sort I/O Profile • Fully loaded compute node sustained for 5h: • 3,500 IOPS • 1.0 GB/s read • For reference, spinning disk can deliver: • 200 IOPS • 175 MB/s

  37. Characterizing the Applications What about the variant calling pipeline?

  38. Haplotype Caller - Threads 100 hours for ¼ genome 25 patients per group 20 groups MASSIVE data reduction SNP/Indel recalibration steps were insignificant Dimensions approx. drawn to scale

  39. Haplotype Caller - Memory Avg HC job req'd 28 GB RAM Up to 42 GB needed Dimensions approx. drawn to scale

  40. Computational Approach 1. Understand computational requirements 2. Develop workflow 3. Load dataset into HPC resource 4. Perform full-scale computations

  41. Step #3: Loading Data • Input: raw reads from 438 full human genomes • 50 TB of compressed, encrypted data from Janssen • 4,230 files (paired-end .fastq.gz) How do you get this data into a supercomputer? They don't exactly have USB ports Pictured: 2 of 12 Data Oasis racks; each blue light is 2x4 TB disks

  42. Loading Data via Network External • 60 Gbit/s connectivity to outside world • 100 Gbit/s connection installed and being tested Internal • 60 Tbit/s switching capacity • 100 Gbit/s from Data Oasis to edge coming in summer Gordon • 20 Gbit/s from IO nodes to Data Oasis • 40 Gbit/s dedicated storage fabric (IB) Pictured: 2x Arista 7508 switches, core of the HPC network

  43. Loading Data via Network External • 60 Gbit/s connectivity to outside world • 100 Gbit/s connection installed and being tested Internal • 60 Tbit/s switching capacity • 100 Gbit/s from Data Oasis to edge coming in summer Gordon • 20 Gbit/s from IO nodes to Data Oasis • 40 Gbit/s dedicated storage fabric (IB) We can move a lot of data Pictured: 2x Arista 7508 switches, core of the HPC network

  44. Loading Data via "Other Means" • Most labs and offices are not wired like SDSC • Janssen's 50 TB of data behind a DS3 link (45 Mbit/sec) • Highest bandwidth approach... Left: 18 x 4 TB HDDs from JanssenRight: BMW 328i, capable of > 1 Tbit/sec

  45. Step #3: Loading Data • Use the network if possible: • .edu, federal labs (Internet2, ESnet, etc) • Cloud (e.g., Amazon S3) • 100 MB/sec to us-west-2 disk-to-disk • 50 MB/sec to us-east-1 disk-to-disk • SDSC Cloud Services (> 1.6 GB/sec) • Sneakernet is high bandwidth, high latency • Sneakernet is error-prone and tedious • Neither scalable nor future-proof • For Janssen, it was more time-effective to transfer results at 50 MB/s than via USB drives and a car trunk

  46. Computational Approach 1. Understand computational requirements 2. Develop workflow 3. Load dataset into HPC resource 4. Perform full-scale computations

  47. Footprint on Gordon: CPUs and Storage Used 257 TB Data Oasis scratch (16% of Gordon's scratch) 5,000 cores(30% of Gordon's cores)

  48. Time to Completion... • Overall: • 36 core-years of compute used in 6 weeks • equivalent to 310 cores running 24/7 • Read Mapping Pipeline • 5 weeks including time for learning • 16 days actually computing • Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically) • Variant Calling (GATK 2.7 Haplotype Caller) • 5 days and 3 hours on Gordon • 10.5 months of 24/7 compute on a 16-core workstation

  49. Computational Challenges in Large-Scale Genomic Analysis Gained Insight

  50. Characteristics of Read Mapping Real Time(researcher time) Core Hours($$$)

More Related