1 / 23

A Simple Optimal Representation for Balanced Parentheses

A Simple Optimal Representation for Balanced Parentheses. Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India). A Parentheses Data Structure. Given: Balanced string of 2 n parentheses.

alethea
Download Presentation

A Simple Optimal Representation for Balanced Parentheses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India) CPM 2004

  2. A Parentheses Data Structure • Given: Balanced string of 2n parentheses. ( ( ( ( ) ))( )( ) ) • Support operations: • ENCLOSE ( i ) • FINDCLOSE ( i ), FINDOPEN( i ) • EXCESS ( i ) • Applications to suffix tree, ordinal trees and stack-sortable permutations. CPM 2004

  3. Parentheses Representation • 2n bits, O(n) time. • Θ(n lg n ) bits, O(1) time. • O(n) bits, O(1) time. [Jacobson, `89] • 2n+o(n) bits, O(1) time. [Munro, Raman, `01] • 2n+o(n) bits, O(1) time. New data structure. • Our new DS • is simpler (no perfect hash tables), • smaller o(n) term, • uniform o(n) time and space construction algorithm. • Implemented and shown to be quite practical • far more compact than D/S using naïve representation, • speed comparable to D/S using naïve representation. CPM 2004

  4. XML • XML: eXtensible Markup Language • de facto standard for electronic data interchange. • Document Object Model (DOM) standard API for manipulating XML documents • holds all data in memory, • large memory usage. CPM 2004

  5. Example XML document person <person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob> </person> name dob firstname day month year surname • DOM NODE interface has methods PARENT(x), NEXTSIB(x), PREVSIB(x), LASTCHILD(x),FIRSTCHILD(x) CPM 2004

  6. Obvious representation • 2n pointers • DOM: 3n. • Ω(n log n) bits. CPM 2004

  7. Using parentheses 1 <person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob> </person> 2 5 3 4 6 7 8 parentheses representation: ( ( ( ) ( ) ) ( ( ) ( ) ( ) ) ) 1 2 3 4 5 6 7 8 2n + o(n) bits for tree structure. CPM 2004

  8. Node interface ops using Parentheses DS Node interface Parentheses DS PARENT ENCLOSE NEXTSIB FINDCLOSE PREVSIB FINDOPEN LASTCHILD FINDCLOSE, FINDOPEN CPM 2004

  9. Succinct DOM • Succinct DOM: • uses far less space than standard DOM, • performance competitive with DOM. • Node interface implemented by natural parentheses ops. • Operations supported by parentheses data structures • Jacobson `89, • Munro and Raman `01, • Our new data structure. CPM 2004

  10. Our new D/S Input: balanced string of 2n parentheses. Assume recursive data structure to store balanced string of 2N 2n parentheses. If N is O(n / lg2n) store answers explicitly for every pair of parentheses. Otherwise Divide into blocks of size Number of blocks CPM 2004

  11. FINDCLOSE(x) ( (( )((( ))) ( )(( ))( )) ) 1 2 3 4 5 6 7 8 9 10 • FINDCLOSE(3)? • Matching parenthesis inside block – near parenthesis. • Pre-computed table stores position of matching parentheses for all near parentheses. • O(1) time if near parenthesis. • Table size is CPM 2004

  12. Pioneer Parentheses ( (( )((( ))) ( )(( ))( )) ) FINDCLOSE(5)? Matching parenthesis outside block – far parenthesis. b(p) = block# of parenthesis at position p = position of match of p q is 1st far parenthesis before p p is pioneer if At most 2β-3 open pioneers. Similarly at most 2β-3 close pioneers. 1 2 3 4 5 6 7 8 9 10 CPM 2004

  13. Pioneer Family ( (( )((( ))) ( )(( ))( )) ) • Pioneer family: set of all opening and closing pioneers along with their matching parentheses. • Balanced string of size at most 4β-6. 1 2 3 4 5 6 7 8 9 10 ( ( ) ) CPM 2004

  14. Our D/S 2N ( (( )((( ))) ( )(( ))( )) ) NND O(N / lg N) ( ( ) ) Two levels of recursion. When pioneer family is O(N/lg2N) we store explicit answers. CPM 2004

  15. Space usage NND uses O(N lg lg N / lg N) bits. Tables use O( N lglg N / lg N) bits. S(n) = 2n+ O(n lglg n / lg n) = 2n +o(n) bits. CPM 2004

  16. Pseudo-pioneers • Near blocks: blocks which have no pioneers. • Insert pseudo-pioneers at start and end of every near block. • Pseudo-pioneers do not effect FINDOPEN(x), FINDCLOSE(x), ENCLOSE(x) • Gap between pioneers now at most 2B = O(lg N). CPM 2004

  17. NND ( (( )((( ))) ( )(( ))( )) ) • 2n-bit vector used to find the pioneer for a far parenthesis. • If pioneer at pos i in parentheses string then 1 at i in NND. • Operations we need: • Find address of most recent 1 at position i r = Rank(i) p = Select(r) • Find ith 1in bit vector p = Select(i) • We want succinct representation. • D/S should be simple and fast. 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 CPM 2004

  18. NND Bit vector of length M with N 1s. Gap between 1s at most (lg M)c. t = lg M / 2 c lg lg M. CPM 2004

  19. Select(i) • Find ith 1 in bit vector. • Array A1 stores position of every tth 1 • Space is • Array A2 stores gaps between consecutive 1s • Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits. • Table T1 allows us to lookup sum of upto t gap. • Space is SELECT(i) i’ = i’’ = (i+1)/mod t y = concat of A2[i’+1],..,A2[i’+i’’] return A1[x] + T1[y] CPM 2004

  20. Rank(i) • Prefix sum at position i. • Need two more arrays and tables of size at most O(M lg lg M / lg M) bits. CPM 2004

  21. Implementation Details • C++ on Sun UltraSparc-III and Pentium 4. • Implemented new and optimised Jacobson D/S. • CenterPoint XML for DOM. • Sample of 12 XML documents of varying sizes and node counts. • Blocksizes 32, 64, 128 and 256. • Test was depth first tree walk, counting nodes of a given XML type. CPM 2004

  22. Space usage and performance • Space usage for tree structure • Std DOM: 96 bits per node. • Jacobson: 3.3 – 16 bits per node. • New D/S: 2.9 – 12.8 bits per node. • Avg performance for succinct D/S relative to std DOM • UltraSparc: 1 to 2.5 times slower. • Pentium 4: 1.7 to 4 times slower. CPM 2004

  23. Conclusions and Future work • Conceptually simple succinct representation for balanced parentheses with O(1) time ops. • o(n) time and space construction algorithm. • Improved lower bound term for space bound. • Relative performance very good on UltraSparc but poorer on Pentium 4, which has small cache • Cache optimisation is an interesting problem. • Complete set of D/S for succinct DOM. CPM 2004

More Related