Understanding the Computational Challenges in Large-Scale Genomic Analysis

Understanding the Computational Challenges in Large-Scale Genomic Analysis Glenn K. Lockwood, Ph.D. User Services Group San Diego Supercomputer Center

Acknowledgments STSI • Kristopher Standish* • Tristan M. Carland • Nicholas J. Schork* SDSC • Wayne Pfeiffer • Mahidhar Tatineni • Rick Wagner • Christopher Irving Janssen R&D • Chris Huang • Sarah Lamberth • ZhenyaCherkas • Carrie Brodmerkel • Ed Jaeger • Martin Dellwo • Lance Smith • Mark Curran • SandorSzalma • GunaRajagopal * Now at the J. Craig Venter Institute

Outline • What's the problem? • The 438-genome project with Janssen • The scientific premise • 4-step computational procedure • Gained insight • Costs of population-scale sequencing • Designing systems for genomics • Final remarks

Computational Challenges in Large-Scale Genomic Analysis Computational Demands

Cost of Sequencing Source: NHGRI (genome.gov)

Sequencing Cost vs. Transistor Cost Source: Intel (ark.intel.com)

Sequencing is only the Beginning

Scaling up to Populations Petco Park = 40,000 people Gordon = 16,384 cores 65,536 GB memory 307,200 GB solid-state disk

Core Computational Demands • Read mapping and variant calling • Required by almost all (human) genomic studies • Multi-step pipeline to refine mapped genome • Pipelines are similar in principle, different in detail • Statistical analysis (science) • Population-scale solutions don't yet exist • Difficult to define computational requirements • Storage • Short-term capacity for study • Long-term archiving • Cost: store vs. re-sequence

Computational Challenges in Large-Scale Genomic Analysis 438 Patients: A Case Study

Scientific Problem Rheumatoid Arthritis (RA) • autoimmune disorder • permanent joint damage if left untreated • patients respond to treatments differently Janssen developed a new treatment(TNFα inhibitor golimumab) • more effective treatment in some patients • ...but patients respond to treatments differently X-ray showing permanent joint damage resulting from RA. Source: Bernd Brägelmann

Scientific Goals Can we predict if a patient will respond to the new treatment? • Sequence whole genome of patients undergoing treatment in clinical trials • Correlate variants (known and novel) with patients' response/non-response to treatment • Develop a predictive model based on called variants and patient response

Computational Goals • Problem: original mapped reads and individually called variants provided no insight • BWA -alnpipeline • SOAPsnp and SAMtools pileup to call SNPs and indels • Solution: Re-align all reads with newer algorithms and employ group variant calling • BWA -mempipeline • GATK HaplotypeCaller high-quality called variants

Computational Approach 1. Understand computational requirements 2. Develop workflow 3. Load dataset into HPC resource 4. Perform full-scale computations

Step #1: Data Requirements • Input Data • raw reads from 438 full human genomes (fastq.gz) • 50 TB of compressed data from Janssen R&D • Output Data • + 50 TB of high-quality mapped reads • + small amount (< 1 TB) of called variants • Intermediate (Scratch) Data • + 250 TB (more = better) • Performance • Data must be stored online • High bandwidth to storage (> 10 GByte/s)

Step #1: Compute & Other Requirements • Processing Requirements • perform read mapping on all genomes; 9-step pipeline to achieve high-quality read mapping • perform variant calling on groups of genomes 5-step pipeline to do group variant calling • Engineering Requirements • FAST turnaround (all 438 genomes done in < 2 months) requires high capacity supercomputer (many CPUs) • EFFICIENT (minimum core-hours used) requires data-oriented architecture (RAM, SSDs, IO)

Can we even satisfy these requirements? Data Requirements: SDSC Data Oasis • 1,400 terabytes of project data storage • 1,600 terabytes of fast scratch storage • Data available to every compute node Processing/Eng'g Requirements: SDSC Gordon • 1024 compute nodes • 64 GB of RAM each • 300 GB local SSD each • Access Data Oasis at up to 100 GB/s

Step #2: The Workflow 00 f6 0b 3f bf 0c 49 46 80 3d 99 a8 0f a0 99 8f 65 f8 b7 94 4f 9b 3f 44 00 fc 50 ac 4a db 5d a2 69 ce d1 70 ce bc ab b3 76 90 60 fe e5 96 23 9b d6 c5 8d b9 56 a6 d5 2e 14 15 40 36 8a c4 4c 09 de 9e 89 92 34 4a 64 1e f4 f9 66 e0 15 74 36 64 f0 b1 7f a2 96 a0 0d 3c 78 83 87 74 5b b3 7a 07 79 b3 8b 54 5e 91 ba d3 6f 23 2f 7c 49 a0 df 84 9e 46 a8 93 7c ca 6d e1 0e 52 76 12 19 69 70 03 6b 6d b1 93 de 6a 65 f0 61 bd 6c 73 80 25 db 90 4c e4 1a cc 43 17 ae c9 e1 68 19 88 9a be 9b 9e 3a 1f a5 07 98 a0 8d 43 ae aa cd 71 83 3d ef ba 43 80 81 3d 84 f4 Applications Input Data Workflow Engine ##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description="Low quali ##FORMAT=<ID=AD,Number=.,Type=Integer,Descr ##FORMAT=<ID=DP,Number=1,Type=Integer,Descr ##FORMAT=<ID=GQ,Number=1,Type=Integer,Descr ##FORMAT=<ID=GT,Number=1,Type=String,Descri ##FORMAT=<ID=PL,Number=G,Type=Integer,Descr ##INFO=<ID=AC,Number=A,Type=Integer,Descrip ##INFO=<ID=AF,Number=A,Type=Float,Descripti ##INFO=<ID=AN,Number=1,Type=Integer,Descrip ##INFO=<ID=BaseQRankSum,Number=1,Type=Float ##INFO=<ID=DB,Number=0,Type=Flag,Descriptio ##INFO=<ID=DP,Number=1,Type=Integer,Descrip User Results Storage Compute Hardware Resource Policies

Step #2: The Workflow • Automate as much as you can • tired researchers are frequent causes of failure • self-documenting and reproducible • Understand the problem • your data - volume, metadata • your applications - in, out, and in between • your hardware - collaborate with user services! • Don't rely on prepackaged workflow engines* • every parallel computing platform is unique • work with user services * Unless you really know what you are doing

Step #2: The Workflow 2. Understand the problem • your data - volume, metadata • your applications - in, out, and in between • your hardware - collaborate with user services! How do we "understand the problem?" Test a pilot pipeline on a subset of input and thoroughly characterize it.

Characterizing the Applications Map(BWA MEM) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort(SAMtools) Mark Duplicates (Picard) Target Creator (GATK) IndelRealigner(GATK) Base Quality Score Recalibration (GATK) Print Reads (GATK)

Characterizing the Hardware Gordon - 1024 compute nodes • Two 8-core CPUs • 64 GB RAM (4 GB/core) • 300 GB SSD per node* Resource Policy: • Jobs allocated in half-node increments (8 cores, 32 GB RAM) • Jobs are charged per core-hour(8 core-hours per real hour) How do we efficiently pack each pipeline stage? * mounted via iSER. Above: 256x 300 GB SSDs

Mapping Application Demands to Hardware Map8 threads sam to bam 1 thread Merge Lanes 1 thread Sort1 thread Mark Dups 1 thread Target Creator 1 thread IndelRealigner1 thread BQSR 8 threads Print Reads 8 threads

Packing a Thread Per Core Map8 threads sam to bam 1 thread Merge Lanes 1 thread Sort1 thread Mark Dups 1 thread Target Creator 1 thread IndelRealigner1 thread BQSR 8 threads Print Reads 8 threads

Memory-Limited Steps Map 6.49 GB sam to bam 0.0056 GB Merge Lanes 0.0040 GB Sort1.73 GB Mark Duplicates 5.30 GB Target Creator 2.22 GB IndelRealigner5.27 GB BQSR 20.0 GB Print Reads 15.2 GB

Packing Based on Threads AND Memory Map 1 / job sam to bam 8 / job Merge Lanes 8 / job Sort8 / job Mark Duplicates 4 / job Target Creator 4 / job IndelRealigner5 / job BQSR 1 / job Print Reads 1 / job

Our Final Read Mapping Pipeline Map 1 / job sam to bam 8 / job Merge Lanes 8 / job Sort8 / job Mark Duplicates 4 / job Target Creator 4 / job IndelRealigner5 / job BQSR 1 / job Print Reads 1 / job

Summary of Read Mapping

Disk IO: Another Limiting Resource SAMtools Sort • External sort - data does not fit into memory • Breaks BAM file into 600 - 800 smaller BAMs • Constantly shuffles data to/from disk to RAM 16 cores (2 CPUs) per node, so... • 16 BAMs processed per node 1.8 TB of input data • 10,000 - 12,000 files generated and accessed 3.6 TB of intermediate data

Disk IO: Another Limiting Resource High-performance parallel filesystems - can Data Oasis handle this? 16 files, 1.8 TB of input data 10,000 files, 3.6 TB of scratch data repeatedly open/close 10,000 files Local SSDs necessary to deliver the IOPS

Disk IO: Another Limiting Resource Can using an SSD work around this? 16 files, 1.8 TB of input data 10,000 files, 3.6 TB of scratch data repeatedly open/close 10,000 files Each node's SSD cannot handle 3.6 TB of data

BigFlash Nodes • Aggregate 16x SSDs on a single compute node to get • 4.4 TB RAID0 array • 3.8 GB/s bandwidth • 320,000 IOPS

SAMtools Sort I/O Profile • Fully loaded compute node sustained for 5h: • 3,500 IOPS • 1.0 GB/s read • For reference, spinning disk can deliver: • 200 IOPS • 175 MB/s

Characterizing the Applications What about the variant calling pipeline?

Haplotype Caller - Threads 100 hours for ¼ genome 25 patients per group 20 groups MASSIVE data reduction SNP/Indel recalibration steps were insignificant Dimensions approx. drawn to scale

Haplotype Caller - Memory Avg HC job req'd 28 GB RAM Up to 42 GB needed Dimensions approx. drawn to scale

Step #3: Loading Data • Input: raw reads from 438 full human genomes • 50 TB of compressed, encrypted data from Janssen • 4,230 files (paired-end .fastq.gz) How do you get this data into a supercomputer? They don't exactly have USB ports Pictured: 2 of 12 Data Oasis racks; each blue light is 2x4 TB disks

Loading Data via Network External • 60 Gbit/s connectivity to outside world • 100 Gbit/s connection installed and being tested Internal • 60 Tbit/s switching capacity • 100 Gbit/s from Data Oasis to edge coming in summer Gordon • 20 Gbit/s from IO nodes to Data Oasis • 40 Gbit/s dedicated storage fabric (IB) Pictured: 2x Arista 7508 switches, core of the HPC network

Loading Data via Network External • 60 Gbit/s connectivity to outside world • 100 Gbit/s connection installed and being tested Internal • 60 Tbit/s switching capacity • 100 Gbit/s from Data Oasis to edge coming in summer Gordon • 20 Gbit/s from IO nodes to Data Oasis • 40 Gbit/s dedicated storage fabric (IB) We can move a lot of data Pictured: 2x Arista 7508 switches, core of the HPC network

Loading Data via "Other Means" • Most labs and offices are not wired like SDSC • Janssen's 50 TB of data behind a DS3 link (45 Mbit/sec) • Highest bandwidth approach... Left: 18 x 4 TB HDDs from JanssenRight: BMW 328i, capable of > 1 Tbit/sec

Step #3: Loading Data • Use the network if possible: • .edu, federal labs (Internet2, ESnet, etc) • Cloud (e.g., Amazon S3) • 100 MB/sec to us-west-2 disk-to-disk • 50 MB/sec to us-east-1 disk-to-disk • SDSC Cloud Services (> 1.6 GB/sec) • Sneakernet is high bandwidth, high latency • Sneakernet is error-prone and tedious • Neither scalable nor future-proof • For Janssen, it was more time-effective to transfer results at 50 MB/s than via USB drives and a car trunk

Footprint on Gordon: CPUs and Storage Used 257 TB Data Oasis scratch (16% of Gordon's scratch) 5,000 cores(30% of Gordon's cores)

Time to Completion... • Overall: • 36 core-years of compute used in 6 weeks • equivalent to 310 cores running 24/7 • Read Mapping Pipeline • 5 weeks including time for learning • 16 days actually computing • Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically) • Variant Calling (GATK 2.7 Haplotype Caller) • 5 days and 3 hours on Gordon • 10.5 months of 24/7 compute on a 16-core workstation

Computational Challenges in Large-Scale Genomic Analysis Gained Insight

Characteristics of Read Mapping Real Time(researcher time) Core Hours($$$)

Understanding the Computational Challenges in Large-Scale Genomic Analysis

Understanding the Computational Challenges in Large-Scale Genomic Analysis

Presentation Transcript

Large scale genomic data integration for functional metagenomics

Computational Methods for Large Scale DNA Data Analysis

Computational and mathematical challenges involved in very large-scale phylogenetics

Understanding the Viability of Large-Scale System Designs

Large-Scale Phylogenetic Analysis

Large scale genomic data mining

Computation of Large-Scale Genomic Evaluations

Challenges and Solutions in Large-Scale Computing Systems

Large-scale Data Processing Challenges

Large scale genomic data mining

Linear Solver Challenges in Large-Scale Circuit Simulation

Challenges in International Large-Scale Assessments

Computational and mathematical challenges involved in very large-scale phylogenetics

Challenges managing large-scale wireless networks

Multi Scale Computational Challenges in Materials Science

Large-Scale Static Timing Analysis

large scale data analysis

Operational Challenges in Large-Scale Reparation Programmes

Computational Mathematics for Large-scale Data Analysis

Challenges in International Large-Scale Assessments

Challenges managing large-scale wireless networks

Computational and mathematical challenges involved in very large-scale phylogenetics