Genome-scale Disk-based Suffix Tree Indexing

Genome-scale Disk-based Suffix Tree Indexing Phoophakdee and Zaki

Outline • Suffix Tree introduction • Application in Bioinformatics • Trellis • Trellis performance • Conclusion

Example Suffix Tree • Sequence • ACGACG$ • What are Suffix Links

Suffix tree runtime • Time complexity • Construction of suffix tree: • O(n) time and space where n is the size of the text being searched • Substring Search: • O(m) time where m is size of substring/search pattern • Knuth-Morris-Pratt and Boyer-Moore algorithm comparison

Application in Bioinformatics • Database search • Exact matching • Approximate matching* • Longest common substring • Genome alignment* • Structural motifs* • Tandem repeats* • Sequence comparison

Problems with Genome-scale suffix trees • Efficient O(n) suffix tree generating algorithms • Tree must fit entirely in main memory • e.g. Ukkonen’s algorithm • Genomes are very large • Human genome is 3 Gbp (0.75 GB) • Data structure no longer able to fit in memory

What Trellis solves • Prevents data skew in prefix partitioning • Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory. • From non-uniform distribution of alphabit/DNA • Efficient disk-base implementation • Function under low memory constraints • Efficient disk IO usage • Able to recover suffix links

Trellis Steps • Prefix Creation Phase • Partitioning Phase • Merging Phase • Suffix Link Recovery Phase (Optional)

Trellis Overview

Merging Phase

Threshold (t) • Determines partition of sequence • Suffix subtree fits into memory during partitioning phase. • Determines cutoff for prefix set inclusion • Recombined prefixed suffix subtree will fit entirely into memory during merging phase. • Allows input string and two sets of internal nodes to fit entirely into memory during suffix link recovery phase

Trellis Overview

Performance • O(n2) time and O(n) space (where n is sequence length) • Comparison to TDD • Currently only other algorithm that scales up to genome level • Same time complexity • Does not calculate suffix links

Suffix Tree Construction

Query Times

Conclusion • Efficient disk-based suffix tree generation that works well with limited memory • Suffix links are recoverable • Future work • Extend to larger alphabets • Buffer input sequence • Parallelize partitioning and merging

Genome-scale Disk-based Suffix Tree Indexing

Genome-scale Disk-based Suffix Tree Indexing

Presentation Transcript

Large-scale genome projects

Pattern Matching: Suffix Tree Applications

Genome-Scale Mutagenesis

Genome-scale disk-based suffix tree indexing

Genome-scale Disk-based Suffix Tree Indexing

Knowledge-based Analysis of Genome-scale Data

Genome-scale phylogenomics

Tree Indexing (1)

Tree-based Indexing

iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Suffix Tree Based Prediction for Pervasive Computing Environments

Tree-based Indexing

Suffix Tree

Suffix tree and suffix array techniques for pattern analysis in strings

Faster Suffix Tree Construction With Missing Suffix Links

Genome Scale Family Based Association Testing using Condor

Indexing Genome Sequences

Trie/Suffix Trie/Suffix Tree

B-Tree Indexing

Disk Based Storage

Suffix Tree and Suffix Array

Genome Scale Family Based Association Testing using Condor