1 / 11

Kunihiko Sadakane Department of Information Science University of Tokyo

A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression. Kunihiko Sadakane Department of Information Science University of Tokyo. faster than PPMs decoding is much faster comparable performance with PPMs. search data structure

holden
Download Presentation

Kunihiko Sadakane Department of Information Science University of Tokyo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Modified Burrows-Wheeler Transformation forCase-insensitive Searchwith Application toSuffix Array Compression Kunihiko Sadakane Department of Information Science University of Tokyo

  2. faster than PPMs decoding is much faster comparable performance with PPMs search data structure can find any substring memory efficient than suffix trees Promising Techniques Block Sorting Compression [Burrows, Wheeler 94] Suffix Array [Manber, Myers 93] We unify compression and search by using them. Key: the Burrows-Wheeler Transformation (BWT)

  3. Block Sorting Compression • Burrows-Wheeler Transformation (BWT) performs permutation of text symbols in lexicographic order of their suffixes. • Permuted text becomes more compressible.

  4. Novel Feature of the Block Sorting • BWT is defined by the suffix array (sorted indexes of suffixes) • The suffix array is recovered from the compressed text Suffix array can be compressed by the Block Sorting! But, it cannot be used for case-insensitive search.

  5. Our Contribution • propose Modified Burrows-Wheeler Transformation • used for compressing text and its suffix array • Decoded suffix array can be used for case-insensitive search. • Any unification function is available. (not only case-insensitive search)

  6. An Application Distributed Web Search Robots Web sites Web sites search robot search robot collected text xyz Abc XYZ ABC compress by Block Sorting transfer via network

  7. transfer via network decode text xyz Abc XYZ ABC suffix array 3 10 8 5 2 7 ... 14 2 8 3 9 5 10 ... merge into database 8 4 100 251 58 ... suffix array on disk Search Server

  8. 3 ABCAb c 0 AbcAB C 4 BCAbc A 5 CAbcA B 1 bcABC A 2 cABCA b A A B C b c 0 AbcABC 1 bcABCA 2 cABCAb 3 ABCAbc 4 BCAbcA 5 CAbcAB sorting The original BWT Input text BWTed text suffix array 3 0 4 5 1 2 BWT reverse BWT

  9. unify capital/small letters (tolower) DCC = dcc unify double-byte codes and single-byte codes in Japanese EUC code ABC (a3c1 a3c2 a3c3) = ABC (41 42 43) unify Japanese Hiragana and Katakana あいうえお = アイウエオ Unification We identify character equivalence.

  10. 0 abcabc$ 1 bcabc$ 2 cabc$ 3 abc$ 4 bc$ 5 c$ sorting Modified BWT permutes symbols by suffix array of unified text Input text suffix array MBWTed text AbcABC unify c c a a b b a a b b c c 3 0 4 1 5 2 3 abc$ c 0 abcabc$ C 4 bc$ A 1 bcabc$ A 5 c$ B 2 cabc$ b unify reverse BWT MBWT reverse MBWT

  11. unification func. identical (BWT) normal (MBWT) LSB4 MSB4 zero (no BWT) comp. ratio 1.743 1.764 2.523 2.707 5.772 comp. time (s) 363.58 363.41 443.89 438.04 411.74 Compression Ratio and Speed HTML files (total 90Mbytes) Block size: 9Mbytes • small difference between BWT and MBWT • MBWT provides case-insensitive searches.

More Related