250 likes | 461 Views
Massively Parallel Computing for Protein Alignment. Bertil Schmidt School of Computer Engineering Nanyang Technological University Singapore. Contents. Motivation Smith-Waterman Algorithm Parallelization on the Hybrid Architecture Parallelization on the Fuzion 150
E N D
Massively Parallel Computing for Protein Alignment Bertil Schmidt School of Computer Engineering Nanyang Technological University Singapore
Contents • Motivation • Smith-Waterman Algorithm • Parallelization on the Hybrid Architecture • Parallelization on the Fuzion 150 • Performance Evaluation • Conclusion and Future Work
Motivation • Genetic sequence databases are growing exponentially • Database growth rate will continue for the foreseeable future, since multiple concurrent genome projects have begun, with more to come
Motivation • Discovered sequences are analyzed by comparison with databases • Complexity of sequence comparison is proportional to the product of query size times database size • Analysis too slow on sequential computers • Two possible approaches • Heuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results • Parallel Processing, get high-quality results in reasonable time
Mycobacterium Smegmatis Mycobacterium Tuberculosis 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids Full Genome Comparison • related Organisms, but Tuberculosis causes a disease find common and different parts • 16106 pairwise sequence comparisons • Project with IMCB, Thomas Dick
GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII |||::::| : |::| ||:::||||:|:|||:: ::| |:::: GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV Slower Search Speed Faster Data Quality Lower Higher Protein Sequence Alignment • BLAST, FastA, Smith-Waterman Smith- Waterman FastA BLAST
Smith-Waterman Algorithm • Optimal local alignment of two sequences • Performs an exhaustive search for the optimal local alignment • Complexity O(nm) for sequence lengths n and m • Based on the 'dynamic programming' (DP) algorithm • Fill the DP matrix using a substitution (mutation) matrix • Find the maximal value (score) in the matrix • Trace back from the score until a 0 value is reached
Smith-Waterman Algorithm • Aligning S1 and S2 of length l1 and l2 using Recurrences: • Calculate three possible ways to extend the alignment • by one AminoAcid (AA) in each sequence • by one AA in the first sequence and align it with a gap in the second • by one AA in the second sequence and align it with a gap in the first
A T C T C G T A T G A T G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 A T C T C G T A T G A T G G T C T A T C A C Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC 0 0 0 0 0 0 2 1 0 0 2 1 0 2 2 =1, =1 4 3 5 7 9 8 10
Systola 1024: PC add-on board with 1024 processors • Fuzion 150: 1536 processors on a single chip Parallel Architectures for Bioinformatics • Embedded Massively Parallel Accelerators • Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA
Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Parallel Architectures for Bioinformatics • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer
Previous Applications • Volume Visualization • Automatic Visual Quality Control (Opel) • Cryptography • Computer Tomography • Video Compression • Range of Transforms (Fourier, Wavelet, Hough, Radon) • Computer Graphics
RAM NORTH RAM WEST Controller program memory host computer bus ISA Interface processors Architecture of Systola 1024 • Instruction Systolic Array: • 32 32 mesh of processing elements • wavefront instruction execution
- + - - * - + - - - * * * * + - + * + - + + * * - + + * * + + * - + + * * + + + + * * column selectors + instructions - * + * - + - - + * row selectors Instruction Systolic Array • wavefront instruction execution fast accumulation operations (e.g. row sum, broadcast, ringshift)
l1 P1 P2 P13 A T C T C G T A T G A T G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 2 1 0 2 G 0 0 0 0 0 2 T 0 0 0 2 2 1 1 1 2 2 1 1 4 3 2 1 1 3 2 3 2 l2 C 0 0 0 1 1 3 4 4 3 4 3 3 3 2 1 0 2 2 4 T 0 0 0 3 2 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 1 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 Parallelization of Smith-Waterman • matrix cells along a single diagonal are computed in parallel • comparison is performed in l1+l21 steps on l1 PEs 0 5
a1023 a1022 a992 a63 a62 a32 bk….b1b0 a31 a30 a0 Mapping onto Systola 1024 • Subject sequences can be pipelined with only step delay k steps for subject sequence of length k a: query sequence (equal to 1024) b: subject sequence …c1c0 X • Efficient routing on the ISA: Row Ringshift and Broadcast
Query sequence length 256 512 1024 2048 4096 Systola 1024 speedup to PIII 850 294 5 577 6 1137 6 2241 6 4611 6 Cluster of 16 Systolas speedup to PIII 850 20 81 38 86 73 91 142 94 290 94 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length and number of PCs • Computing time dominates data transfer time need a state-of-the-art architecture
Fuzion 150 Architecture Linear SIMD Array 1536PEs each with 2 Kbytes DRAM SIMD Controller • 0.25-m, single-chip, SIMD architecture • 1536 PEs @ 200 MHz 300 GOPS • 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth • Multithreading (control units interact via semaphores) • developed by Clearspeed Technology (UK) for graphics, networking processing Instruction Fetch Local Memory Host AGP Rambus FUZION Bus 1,2 or 4 Channels (6.4 GB/s) 32-bit EPU (ARC) Video I/O Display
Instructions ALU (8 bits) Register file 32 Bytes Left PE Right PE PE Memory 2 KByte DRAM Block I/O Channel Fuzion 150 Architecture Local Memory Block 5 Fuzion Bus PE (5,0) PE (5,1) PE (5,255) Block 1 PE (1,0) PE (1,1) PE (1,255) Block 0 PE (0,0) PE (0,1) PE (0,255)
a1535 a1534 a1280 a511 a510 a256 a0 a1 a255 Mapping onto the Fuzion 150 Block 5 a: query sequence (equal to 1536) Block 1 b: subject sequence Block 0 bk….b1b0 …c1c0 X • No fast global communication 2-step local communcication • Subject sequence can be pipelined with only step delay
Query sequence length 256 512 1024 2048 4096 Fuzion 150 speedup to PIII 850 12 136 22 151 42 157 82 163 162 165 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length • Computing time dominates data transfer time
Performance Evaluation • Normalized time Comparison for a 10 Mbase search on different parallel architectures with different query length • 4faster than 16K-PE MasPar • 6faster than Kestrel • 5faster than SAMBA (special-purpose 3-board architecture)
Conclusions and Future Work • Demonstrated how fine-grained parallel architectures can be applied efficiently for Comparative Genomics • Significant runtime savings for full genome comparisons and database searching More Discovery Is Possible at a good price-performance ratio • Accelerating other Bioinformatics Applications, e.g. Hidden Markov Models • Build a next generation architecture at Center for High Performance Embedded Systems, NTU • Integration of accelerators in a Grid Environment