Massively Parallel Computing for Protein Alignment

Massively Parallel Computing for Protein Alignment Bertil Schmidt School of Computer Engineering Nanyang Technological University Singapore

Contents • Motivation • Smith-Waterman Algorithm • Parallelization on the Hybrid Architecture • Parallelization on the Fuzion 150 • Performance Evaluation • Conclusion and Future Work

Motivation • Genetic sequence databases are growing exponentially • Database growth rate will continue for the foreseeable future, since multiple concurrent genome projects have begun, with more to come

Motivation • Discovered sequences are analyzed by comparison with databases • Complexity of sequence comparison is proportional to the product of query size times database size •  Analysis too slow on sequential computers • Two possible approaches • Heuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results • Parallel Processing, get high-quality results in reasonable time

Mycobacterium Smegmatis Mycobacterium Tuberculosis 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids Full Genome Comparison • related Organisms, but Tuberculosis causes a disease  find common and different parts • 16106 pairwise sequence comparisons • Project with IMCB, Thomas Dick

GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII |||::::| : |::| ||:::||||:|:|||:: ::| |:::: GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV Slower Search Speed Faster Data Quality Lower Higher Protein Sequence Alignment • BLAST, FastA, Smith-Waterman Smith- Waterman FastA BLAST

Smith-Waterman Algorithm • Optimal local alignment of two sequences • Performs an exhaustive search for the optimal local alignment • Complexity O(nm) for sequence lengths n and m • Based on the 'dynamic programming' (DP) algorithm • Fill the DP matrix using a substitution (mutation) matrix • Find the maximal value (score) in the matrix • Trace back from the score until a 0 value is reached

Smith-Waterman Algorithm • Aligning S1 and S2 of length l1 and l2 using Recurrences: • Calculate three possible ways to extend the alignment • by one AminoAcid (AA) in each sequence • by one AA in the first sequence and align it with a gap in the second • by one AA in the second sequence and align it with a gap in the first

 A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 A T C T C G T A T G A T G G T C T A T C A C Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC 0 0 0 0 0 0 2 1 0 0 2 1 0 2 2 =1, =1 4 3 5 7 9 8 10

Systola 1024: PC add-on board with 1024 processors • Fuzion 150: 1536 processors on a single chip Parallel Architectures for Bioinformatics • Embedded Massively Parallel Accelerators • Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA

Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Parallel Architectures for Bioinformatics • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer

Previous Applications • Volume Visualization • Automatic Visual Quality Control (Opel) • Cryptography • Computer Tomography • Video Compression • Range of Transforms (Fourier, Wavelet, Hough, Radon) • Computer Graphics

RAM NORTH RAM WEST Controller program memory host computer bus ISA Interface processors Architecture of Systola 1024 • Instruction Systolic Array: • 32  32 mesh of processing elements • wavefront instruction execution

- + - - * - + - - - * * * * + - + * + - + + * * - + + * * + + * - + + * * + + + + * * column selectors + instructions - * + * - + - - + * row selectors Instruction Systolic Array • wavefront instruction execution  fast accumulation operations (e.g. row sum, broadcast, ringshift)

l1 P1 P2 P13  A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 2 1 0 2 G 0 0 0 0 0 2 T 0 0 0 2 2 1 1 1 2 2 1 1 4 3 2 1 1 3 2 3 2 l2 C 0 0 0 1 1 3 4 4 3 4 3 3 3 2 1 0 2 2 4 T 0 0 0 3 2 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 1 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 Parallelization of Smith-Waterman • matrix cells along a single diagonal are computed in parallel • comparison is performed in l1+l21 steps on l1 PEs 0 5

a1023 a1022 a992 a63 a62 a32 bk….b1b0 a31 a30 a0 Mapping onto Systola 1024 • Subject sequences can be pipelined with only step delay  k steps for subject sequence of length k a: query sequence (equal to 1024) b: subject sequence …c1c0 X • Efficient routing on the ISA: Row Ringshift and Broadcast

Query sequence length 256 512 1024 2048 4096 Systola 1024 speedup to PIII 850 294 5 577 6 1137 6 2241 6 4611 6 Cluster of 16 Systolas speedup to PIII 850 20 81 38 86 73 91 142 94 290 94 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length and number of PCs • Computing time dominates data transfer time need a state-of-the-art architecture

Fuzion 150 Architecture Linear SIMD Array 1536PEs each with 2 Kbytes DRAM SIMD Controller • 0.25-m, single-chip, SIMD architecture • 1536 PEs @ 200 MHz  300 GOPS • 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth • Multithreading (control units interact via semaphores) • developed by Clearspeed Technology (UK) for graphics, networking processing Instruction Fetch Local Memory Host AGP Rambus FUZION Bus 1,2 or 4 Channels (6.4 GB/s) 32-bit EPU (ARC) Video I/O Display

Instructions ALU (8 bits) Register file 32 Bytes Left PE Right PE PE Memory 2 KByte DRAM Block I/O Channel Fuzion 150 Architecture Local Memory Block 5 Fuzion Bus PE (5,0) PE (5,1) PE (5,255) Block 1 PE (1,0) PE (1,1) PE (1,255) Block 0 PE (0,0) PE (0,1) PE (0,255)

Fuzion 150 - Debugger

a1535 a1534 a1280 a511 a510 a256 a0 a1 a255 Mapping onto the Fuzion 150 Block 5 a: query sequence (equal to 1536) Block 1 b: subject sequence Block 0 bk….b1b0 …c1c0 X • No fast global communication  2-step local communcication • Subject sequence can be pipelined with only step delay

Query sequence length 256 512 1024 2048 4096 Fuzion 150 speedup to PIII 850 12 136 22 151 42 157 82 163 162 165 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length • Computing time dominates data transfer time

Performance Evaluation • Normalized time Comparison for a 10 Mbase search on different parallel architectures with different query length • 4faster than 16K-PE MasPar • 6faster than Kestrel • 5faster than SAMBA (special-purpose 3-board architecture)

Conclusions and Future Work • Demonstrated how fine-grained parallel architectures can be applied efficiently for Comparative Genomics • Significant runtime savings for full genome comparisons and database searching  More Discovery Is Possible at a good price-performance ratio • Accelerating other Bioinformatics Applications, e.g. Hidden Markov Models • Build a next generation architecture at Center for High Performance Embedded Systems, NTU • Integration of accelerators in a Grid Environment

Massively Parallel Computing for Protein Alignment

Massively Parallel Computing for Protein Alignment

Presentation Transcript

Massively Parallel Processors

Error model for massively parallel (454) DNA sequencing

Parallel Computing

Parallel Computing Explained Parallel Computing Overview

Parallel Computing

Programming Massively Parallel Graphics Processors

Approximate History Map for Massively Parallel Environments

CUDA Lecture 1 Introduction to Massively Parallel Computing

Parallel Computing

A Massively Parallel Architecture for Bioinformatics

Computing beyond a Million Processors - bio-inspired massively-parallel architectures

Mass Market Applications of Massively Parallel Computing

Centre for Parallel Computing

Protein Multiple Alignment

Massively Parallel Multgrid for Finite Elements

Parallel Computing

Massively Parallel Solutions for Molecular Sequence Analysis

Massively Parallel Signature Sequencing (MPSS)

EECE 571e (Fall 2014) (Massively) Parallel Computing Platforms

PAVEMENT/PIO Parallel I/O System for Massively Parallel Processors

CM-5 Massively Parallel Supercomputer

Massively Parallel Solutions for Molecular Sequence Analysis