180 likes | 465 Views
Genome-scale Disk-based Suffix Tree Indexing. Phoophakdee and Zaki. Outline. Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion. Example Suffix Tree. Sequence ACGACG$ What are Suffix Links. Suffix tree runtime. Time complexity
E N D
Genome-scale Disk-based Suffix Tree Indexing Phoophakdee and Zaki
Outline • Suffix Tree introduction • Application in Bioinformatics • Trellis • Trellis performance • Conclusion
Example Suffix Tree • Sequence • ACGACG$ • What are Suffix Links
Suffix tree runtime • Time complexity • Construction of suffix tree: • O(n) time and space where n is the size of the text being searched • Substring Search: • O(m) time where m is size of substring/search pattern • Knuth-Morris-Pratt and Boyer-Moore algorithm comparison
Application in Bioinformatics • Database search • Exact matching • Approximate matching* • Longest common substring • Genome alignment* • Structural motifs* • Tandem repeats* • Sequence comparison
Problems with Genome-scale suffix trees • Efficient O(n) suffix tree generating algorithms • Tree must fit entirely in main memory • e.g. Ukkonen’s algorithm • Genomes are very large • Human genome is 3 Gbp (0.75 GB) • Data structure no longer able to fit in memory
What Trellis solves • Prevents data skew in prefix partitioning • Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory. • From non-uniform distribution of alphabit/DNA • Efficient disk-base implementation • Function under low memory constraints • Efficient disk IO usage • Able to recover suffix links
Trellis Steps • Prefix Creation Phase • Partitioning Phase • Merging Phase • Suffix Link Recovery Phase (Optional)
Threshold (t) • Determines partition of sequence • Suffix subtree fits into memory during partitioning phase. • Determines cutoff for prefix set inclusion • Recombined prefixed suffix subtree will fit entirely into memory during merging phase. • Allows input string and two sets of internal nodes to fit entirely into memory during suffix link recovery phase
Performance • O(n2) time and O(n) space (where n is sequence length) • Comparison to TDD • Currently only other algorithm that scales up to genome level • Same time complexity • Does not calculate suffix links
Conclusion • Efficient disk-based suffix tree generation that works well with limited memory • Suffix links are recoverable • Future work • Extend to larger alphabets • Buffer input sequence • Parallelize partitioning and merging