1 / 1

Challenges

Adnan Ozsoy, Arun Chauhan, Martin Swany School of Informatics and Computing, Indiana University, Bloomington. Towards Tera - scale Performance for Longest Common Subsequence using Graphics Processors. INDIANA UNIVERSITY. Objective. Design Principles. Longest Common Subsequence.

niabi
Download Presentation

Challenges

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adnan Ozsoy, Arun Chauhan, Martin Swany School of Informatics and Computing, Indiana University, Bloomington Towards Tera-scale Performance for Longest Common Subsequence using Graphics Processors INDIANA UNIVERSITY Objective Design Principles Longest Common Subsequence • A subsequence of a string of symbols is derived from the original • string by deleting some elements without changing their order. • {b,c,e} is a subsequence of {a,b,c,d,e}. • A classic computer science problem is finding the longest common subsequence (LCS) between two or more strings. Widely encountered in several fields: • The sequence alignment problem in bioinformatics • Voice and image analysis • Social networks analysis - matching event and friend suggestions • Computer security - virus signature matching • In data mining, pattern identification Optimizing MLCS on GPUs by leveraging its semi-regular structure and identifying a regular core of MLCS that consists of highly regular data-parallel bit-vector operations, which is combined with a relatively irregular post-processing step more efficiently performed on the CPUs. LCS Lengths Query Seq. Subject Seqs. Among all the returned LCS lengths, TOP N of them are identified TOP N Build Alpha Strings Proposed Solution • Observations: • Matching information of every single element in the sequences is required • Binary matrix representation that summarizes the matching result of each symbol • The computation of such a matrix is • - Highly data parallel • - Mapped efficiently to GPU threads • * Homogeneous workload distribution • * No control-flow divergence Combined results for each device and four different implementations. Actual LCS is calculated Move alpha strings and subject sequences to GPU CPU Traditional Solutions CPU LCS Lengths CUDA Blocks Each thread consuming one sequence referring to alpha strings • Arbitrary number of input sequences is NP-hard. Constant number of sequences, polynomial time. • Two main scenarios : • one-to-one matching: two input sequences are compared • one-to-many matching (MLCS) one query sequence is compared to a set of sequences, called subject sequences. • A straightforward way to solve MLCS is to perform one-to-one • LCS for each subject sequence. Most popular solution to one-to-one LCS problem is using dynamic programming. Using bit operations OR, AND, XOR, shift Comparison of different alphabet sizes. CUDA Occupancy Calculator (Courtesy of NVIDIA). • Each entry in the binary matrix can be stored as a single bit • - Reduces space requirement • - Use of bit ops on words – • Word-size parallelism • Matches can be pre-computed for each symbol in the alphabet Alpha Strings GPU Parallelizing and Optimizing on GPUs Dynamic Programming ApproachFill a score matrix, H, through a scoring mechanism given in Equation 2, based on Equation 1. The best score is the length of the LCS and the actual subsequence can be found by tracking it back through the matrix. • Steps taken: • Intra- vs. Inter-task Parallelism • - One subject sequence is assigned to each CUDA thread • Memory Spaces • - Constant and shared memory usage • Hiding Data-copying Latencies • - Hide data-copying latencies with asynchronous execution • Leveraging Multiple GPUs • Streaming Sequences • - Multiple runs of execution to consume large sets Parallel Bit-Vector based length of LCS, CPU vs. GPU comparison. Bit-vector approach Conclusion & Future Work Row wise bit operations on the binary matrix can be used to compute a derived matrix that provides a quick readout of the length of the LCS. Allison and Dix proposed solving the length of LCS problem using bit representation, in 1986 (Eq 3). The algorithm follows and marks each anti-chain to identify k-dominant matches in each row. Crochemore at al.’s algorithm uses fewer bit operations that lead to a better performance (Eq 4). • Bit-vector operations that exploit the word-size parallelism. • Inter-task parallelism • - Independent computation: each one-to-one LCS comparison • Allison-Dix and Crochemore et. al. on GPUs. • Analysis and design steps of GPU specific memory optimizations. • Post-processing • - Benefit from the heterogeneous environment. • Multi GPU achieves TeraCUPS performance for MLCS problem • - a first for LCS algorithm. • 8.3x better performance than parallel CPU implementation • Sustainable performance with very large data sets • Two orders of magnitude better performance compared to previous related work • ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Investigate gap-penalty - Needleman-Wunsch, Smith-Waterman • Distributing over multiple nodes • Common problem space in string matching Results 1 Metrics The unit that is used to represent our results is cell updates per second (CUPS).R -length of the query reference T -total length of subject sequences S -time it takes to compute 5 2 3 Challenges 4 Testbed configurations Scoring matrix creates three-way dependencies. This prevents parallelization along the rows/columns. A possible solution is to compute all the cells on an anti-diagonal in parallel as illustrated. Problems: Parallelism is limited -the beginning and the end of the matrix Memory access patterns are not amenable to coalescing O(N2) memory requirement Poor distribution of workload Sub-optimal utilization of GPU resources Lack of heterogeneous (CPU/GPU) resource awareness Acknowledgements INDIANA UNIVERSITY NVIDIA Hardware Request Program Register signatures for different versions Contact Information aozsoy@indiana.edu, achauhan@indiana.edu, swany@indiana.edu (Courtesy of Crochemore et. al.)

More Related