1 / 36

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari. Nargess Memarsadeghi CMSC 838 Presentation. Talk Overview. Overview of talk Motivation Background Techniques Evaluation Related work Observations. Motivation: EST Clustering. Problem: EST Clustering Cluster fragments of cDNA

anaya
Download Presentation

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel EST ClusteringbyKalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation

  2. Talk Overview • Overview of talk • Motivation • Background • Techniques • Evaluation • Related work • Observations CMSC 838T – Presentation

  3. Motivation: EST Clustering • Problem: EST Clustering • Cluster fragments of cDNA • Related to ‘fragment assembly’ problem • Detecting overlapping fragments • Overlaps can be computed: • Pairwise alignment algorithm • Dynamic programming • Alternative: • Approximate overlap detection algorithms • Dynamic programming CMSC 838T – Presentation

  4. Motivation • Common Tools: • Takes too long • Days for 100,000 ESTs • Runs out of memory • This paper: • PaCE: • Parallel Clustering of ESTs • Efficient parallel EST Clustering • Space efficient algorithm • Reduce total work • Reduce run-time CMSC 838T – Presentation

  5. Background: EST Clustering Tools • Three traditional software: • Originally designed for fragment assembly: • TIGR Assembler • Phrap • CAP3 • One parallel software: • UICLUSTER: assumes EST’s from 3’ end CMSC 838T – Presentation

  6. EST Clustering Tools • Basic approach • Find pairs of similar sequences • Align similar pairs • Dynamic programing • Quality of EST clustering • Phrap: Fastest • avoids dynamic programming • Relies on approximation, lower quality • CAP: Least # of erroneous clusters CMSC 838T – Presentation

  7. EST Clustering Tools’ Performance • With 50,000 maize ESTs • Using PC with dual Pentium 450MHZ , 512 RAM : • TIGR: ran out of memory • Phrap: 40 min • CAP: > 24 hours • With 100,000 maize ESTs • all ran out of memory • CAP would require 4 days CMSC 838T – Presentation

  8. Goal • Space efficient algorithm • Space requirement linear in the size of the input data set • Reduce total work • Without sacrificing quality of clustering • Reduce run-time and facilitate the clustering of large data sets • Through parallel processing • Scale memory with # of processors CMSC 838T – Presentation

  9. Approach • Expense: • Pairwise alignment (time + memory) • Promising pairs ≈ • Common string: |s|= w • Cost: if common |s|=l > w , then repeats l-w+1 times CMSC 838T – Presentation

  10. Approach (Cont ..) • Approach: • Use trie structure • Identify promising pairs • Merge clusters with strong overlaps • Avoid storing/testing all similar pairs • Parallel EST Clustering Software: • Generalized Suffix Tree (GST) • Multiple processors: • Maintain and updates EST Clusters • Others generate batches of promising pairs, perform alignment CMSC 838T – Presentation

  11. Approach (Cont …) CMSC 838T – Presentation

  12. Tries • Index for each char • N leaves • Height N CMSC 838T – Presentation

  13. Suffix Tries (Cont ..) • TRIM suffix trie CMSC 838T – Presentation

  14. Suffix Tries (Cont ..) • Indicies • Storage O(n), constant is high though • Common string • Longest common substring CMSC 838T – Presentation

  15. Suffix Tries (Cont ..) a b 5 b $ a a $ b b $ 4 $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern. CMSC 838T – Presentation

  16. Parallel Generation of GST • GST: Generalized Suffix Tree • Compacted trie • Longest common prefix found in constant time • Used for on-demand pair generation • Sequential: O(nl) • Parallel: O(nl/p) CMSC 838T – Presentation

  17. Parallel Generation of GST (Cont …) • Previous implementations: • CRCW/CREW PRAM model • Work-optimal • Involves alphabetical ordering of characters • Unrealistic assumptions • synchronous operation of processors • infinite network bandwidth • no memory contention • Not practically efficient CMSC 838T – Presentation

  18. Parallel Generation of GST (Cont …) • Paper’s approach: • EST’s equally distributed among processors • Each processor • Partitions suffixes of ESTs into buckets • Distribute buckets to the processors: • All suffixes in a bucket allocated to the same processor • Total # of suffixes allocated to a processor ≈ O ( ) CMSC 838T – Presentation

  19. Parallel Generation of GST (Cont …) • Each bucket’s processor: • Compute compacted trie of all its suffixes • Cannot use sequential construction • Suffixes of a string • not in the same bucket • Each bucket: • Subtree in the GST • Nodes: • Depth first search traversal of the trie • Pointer to the right most child CMSC 838T – Presentation

  20. On-demand Pair Generation • A pair should be generated if • Share substring of length ≥ treshhold • Maximal • Leaves in a common node • Share a substring of length = depth of node • Parallel algorithm • Each processor works with its trie if • Depth of its root in GST < threshhold CMSC 838T – Presentation

  21. On-demand Pair Generation • To process • Sort internal nodes • Decreasing order of depth • Lists of a node • Generated after process • Removed after parent is processed • Limits space O(nl) • Run time ≈ # pairs generated + cost of sorting • Rejected pairs increase run-time by a factor of 2 • Eliminating duplicates reduce run-time CMSC 838T – Presentation

  22. Parallel Clustering • Master-Slave paradigm: • Master processor: • Maintains and updates clusters • Using union-find data structure • Receives messages from slave processors • A batch of next promising pairs generated by slave • Results of the pairwise alignment • Determines which ones to explore • Determines if merging should occur • Slave processors: • Generate pairs on demand • Perform pairwise alignments of pairs dispatched by the master processor CMSC 838T – Presentation

  23. Parallel Clustering (Cont…) Organization of Parallel Clustering Software • Batch of promising pairs generated + results of pairwise alignment • Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair Slave P Master P Slave P slave P CMSC 838T – Presentation

  24. Parallel Clustering (Cont..) • To start: • Slave P starts with 3× batchsize pairs • Sends the 3rd batch to Master P • Starts alignment on 1st batch • Sends results on 1st + a newly generated batch • While waiting to receive results from Master P, aligns 2nd batch • Processor always has the next batch to work between: • Submitting the results of previous batch • Receiving another set of pairs CMSC 838T – Presentation

  25. Parallel Clustering (Cont..) • Improve and control quality • Parameters: • Match and mismatch scores • Gap penalties • Post processing: • Detection of alternating splicing • Consulting protein databases • Organism specific CMSC 838T – Presentation

  26. Experimental environment • Used C and MPI • Tested • Quality of software: • Arabidopsis thaliana (due to availability of its genome) • Run-time behavior: • 50,000 Maize ESTs with 32-processor IBM SP • # of processors • Data size • (# of Promising pairs) vs data size • Batchsize vs (# processors) • # of Clusters • Master processor’s time CMSC 838T – Presentation

  27. Quality Assessment • To asses quality • A data set and its correct clustering • ESTs from plant Arabidopsis thaliana • Splice program • Align ESTs to the genome • Discard ESTs that • Don’t align • Aligned in multiple spots CMSC 838T – Presentation

  28. Quality Assessment (Cont …) • False negative: • A pair in correct clustering is not paired in the output • 5% • False positive: • A pair not in correct clustering appears in results • Negligible (< 0.04%) • Due to conservative nature of algorithm CMSC 838T – Presentation

  29. Quality Assessment Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs. CMSC 838T – Presentation

  30. Quality Assessment (Cont..) CMSC 838T – Presentation

  31. Run-time Assessment • Experiment with 50,000 maize ESTs: • 32-processor IBM SP-2 • 16 minutes CMSC 838T – Presentation

  32. Run-time Assessment (Cont …) Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors. CMSC 838T – Presentation

  33. Run-time Assessment (Cont ..) • Run-time as a function of batchsize • Small batchsize • Increase in communication overhead • Large batchsize • Slaves less responsive to the need of generating pairs • Slave does not use latest clustering results • Optimal batchsize • Determined by experiment • Master processor’s time • Fixed batchsize, increase in # of processors • Gradual increase in Master P’s time • With 32 processors, increase < 1% • Using 1 Master Processor in not bottleneck CMSC 838T – Presentation

  34. Results • Space Linear in size of the input data set • Reduced total work without sacrificing quality • Reduced run-time • Parallel processors • Eliminating pairs • Faciliate clustering • Scale memory with # Processors CMSC 838T – Presentation

  35. Observations • PaCE: Approaches EST clustering problem directly • Better than • CAP3 • Phrap • TIGR Assembler • Compare time/quality • TIGICL (TIGR Indices Clustering Tool) • Support for PVM • MegaBlast • STACK • Large data sets • Lots of Processors • Can improve clustering time? • Clustering algorithm CMSC 838T – Presentation

  36. References • http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf • Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988. CMSC 838T – Presentation

More Related