1 / 30

Search Algorithms Winter Semester 2004/2005 8 Nov 2004 4th Lecture

Search Algorithms Winter Semester 2004/2005 8 Nov 2004 4th Lecture. Christian Schindelhauer schindel@upb.de. Chapter II. Chapter II Searching in Compressed Text 08 Nov 2004. Searching in Compressed Text (Overview). What is Text Compression Definition The Shannon Bound Huffman Codes

iram
Download Presentation

Search Algorithms Winter Semester 2004/2005 8 Nov 2004 4th Lecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search AlgorithmsWinter Semester 2004/20058 Nov 20044th Lecture Christian Schindelhauer schindel@upb.de

  2. Chapter II Chapter II Searching in Compressed Text 08 Nov 2004

  3. Searching in Compressed Text (Overview) • What is Text Compression • Definition • The Shannon Bound • Huffman Codes • The Kolmogorov Measure • Searching in Non-adaptive Codes • KMP in Huffman Codes • Searching in Adaptive Codes • The Lempel-Ziv Codes • Pattern Matching in Z-Compressed Files • Adapting Compression for Searching

  4. What is Text Compression? • First approach: • Given a text s  n • Find a compressed version c  m such that m < n • Such that s can be derived from c • Formal: • Compression Function f : *  * • is one-to-one (injective) and efficiently invertible • Fact: • Most of all text is uncompressible • Proof: • There are (||m+1-1)/(||-1) strings of length at most m • There are ||n strings of length n • From these strings at most (||m+1-1)/(||-1) strings can be compressed • This is fraction of at most ||m-n+1/(||-1) • E.g. for || = 256 and m=n-10 this is 8.3 × 10-25 • which implies that only 8.3 × 10-25 of all files of n bytes can be compressed to a string of length n-10

  5. Why does Text Compression work? • Usually texts are using letters with different frequencies • Relative Frequencies of Letters in General English Plain text From Cryptographical Mathematics, by Robert Edward Lewand: • e: 12%, t: 10%, a: 8%, i: 7%, n: 7%, o: 7% • ... • k: 0.4%, x: 0. 2%, j: 0. 2%, q: 0. 09%, z:0. 06% • Special characters like $,%,# occur even less frequent • Some character encodings are (nearly) unused, e.g. bytecode: 0 of ASCII • Text underlies a lot of rules • Words are (usually) the same (collected in dictionaries) • Not all words can be used in combination • Sentences are structured (grammar) • Program codes use code words • Digitally encoded pictures have smooth areas, where colors change gradually • Patterns repeat

  6. Information Theory: The Shannon bound • C. E. Shannon in his 1949 paper "A Mathematical Theory of Communication". • Shannon derives his definition of entropy • The entropy rate of a data source means the average number of bits per symbol needed to encode it. • Example text: ababababab • Entropy: 1 • Encoding: • Use 0 for a • Use 1 for b • Code: 0101010101 • Huffman Codes are a way to derive such a Shannon bound (for sufficiently large text)

  7. Huffman Code Huffman Code is adapted for each text (but not within the text) consists of a dictionary, which maps each letter of a text to a binary string and the code given as a prefix-free binary encoding Prefix-free code uses strings s1,s2,...,sm of variable length such that no strint si is a prefix of sj Example of Huffman encoding: Text: m a n a m a m a p a t i p i t i p i Encoding: 001 01 111 01 001 01 111 01 000 10 110 10 000 10 000 10 111 10 t i p i t i p i m a n a m a m a p a

  8. Computing Huffman Codes Compute the letter frequencies Build root nodes labeled with frequencies repeat Build node connected the two least frequent unlinked nodes Mark sons with 0 and 1 Father node carries the sum of the frequencies until one tree is left The path to each letter carries the code 1 0 18 10 8 1 0 0 1 5 4 1 1 0 0 5 3 2 4 2 2 p n a i t m 110 10 01 001 000 111

  9. Searching in Huffman Codes • Let u be the size of the compressed text • Let v be the size of the pattern Huffman-encoded according to the text dictionary • KMP can search in Huffman Codes in time O(u+v+m) • Encoding the pattern takes O(v+m) steps • Building the prefix takes time O(v) • Searching the text on a bit level takes time O(u+v) • Problems: • This algorithm is bit-wise not byte-wise • Exercise: Develop a byte-wise strategy

  10. The Downside of Huffman Codes • Example: Consider 128 Byte text: • abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba • will be encoded using 16 Bytes (and an extra byte for the dictionary) as • 0110011001100110011001100110011001100110011001100110011001100110 • 0110011001100110011001100110011001100110011001100110011001100110 • This does not use the full compression possibilities for this text • E.g. using (abba)^32 would need only 9 Bytes • The perfect code: • A self-extracting program for a string x is a program that started without input produces the output x and then halts. • So, the smallest self-extracting-program is the ultimate encoding • Kolmogorov complexity K(x) of a string x denotes the length of such an self-extracting program for x

  11. Kolmogoroff Complexity • Is the Kolmogorov Complexity depending on programming language? • No, as long as the programming language is universal, e.g. can simulate any Turing machine Lemma Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x: K1(x)  K2(x) + c • Is the Kolmogorov Complexity useful? • No: Theorem K(x) is not recursive.

  12. Ziv-Lempel-Welch (LZW)-Codes • From the Ziv-Lempel-Family • LZ77, LSZZ, LZ78, LZW, LZMZ, LZAP • Literature • LZW: Terry A. Welch: "A Technique for High Performance Data Compression", IEEE Computer vol. 17 no. 6, Juni 1984, p. 8-19 • LZ77 J. Ziv, A. Lempel: "A Universal Algorithm for Sequential Data Compression", IEEE Transactions, p. 337-343 • LZ78 J. Ziv, A. Lempel: "Compression of Individual Sequences Via Variable-Rate Coding", IEEE Transactions on Information, p. 530-536 • known as Unix-command: “compress” • Uses: • TRIES

  13. Trie = “reTRIEval TREE” Name taken out of “ReTRIEval” Tree for storing/encoding text efficient search for equal prefices Structure Edges labelled with letters Nods are numbered Mapping Every node encodes a word of the text The text of a node can be read on the path from the root to the node Node 1 = “m” Node 6 = “at” Inverse direction: Every word uniquely points at a node (or at least some prefix points to a leaf) “it” = node 11 “manaman” points with “m” to node 1 Encoding of “manamanatapitipitipi” 1,2,3,4,5,6,7,8,9,10,11,12 or 1,5,4,5,6,7,11,10,11,10,8 Decoding of 5,11,2 “an”, “it”, “a” = anita 0 m a n i t 1 2 3 8 9 m n t p t p 4 5 6 7 11 10 i 12

  14. How LZW builds a TRIE LZW works bytewise starts with the 256-leaf trie with leafs “a”, “b”, ... numbered with “a”, “b”, ... LZW-Trie-Builder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od - a b c d z a b c d ... z Example: nanananananana - a n ... ... a n a Scanned: na na

  15. How LZW builds a TRIE LZW works bytewise starts with the 256-leaf trie with leafs “a”, “b”, ... numbered with “a”, “b”, ... LZW-Trie-Builder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od Example: nanananananana - a n ... ... a n a Scanned: nanananananana na Continue with: nanananananana na Residual part: nanananananana

  16. How does it continue? • Exercise: • Consider the text: “nananana...na”of length 2n • Describe the LZW-Trie • How many nodes are there in the final tree? • Compute the asymptotic compression ratio, • i.e. size of LZW-Encoding/length of text • Compare this result with Huffmann encoding and Shannon bounds • Is the LZW-Trie-algorithm an optimal algorithm for words of this kind? • Prove it!

  17. How LZW produces the encoding LZW-Encoder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then output (m,u,T[i]) m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od if u  root(TRIE) then output (u) fi • start-Trie = • 256-leaf trie with • bytes encoded as • 0,1,2,..,255 The output m is predictable: 256,257,258,... Therefore use only output(u,T[i])

  18. LZW-Encoder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then output (u,T[i]) m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od if u  root(TRIE) then output (u) fi Encoding of m a n a m a n a t a p i t i p i t i p i (m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i) 256 257 258 259 260 261 262 264 264 ma na (ma)n at ap it ip (it)i pi An Example Encoding 0 m i n a t p m n a i t p a a t p t p i 256 257 259 260 261 262 264 n i 258 263

  19. The Decoder LZW-Decoder(Code) i  1 TRIE  start-TRIE m  255 for i  0 to 255 do C(i)=“i” od while not end of file do (u,c)  read-next-two-symbols(Code); if c exists then output (C(u), c) m  m+1 append leaf m to u with edge label c C(m)  (C(u),c) else output (C(u)) od 0 m i n a t p m n a i t p a a t p t p i 256 257 259 260 261 262 264 n i 258 263 If the last string of the code did not produce a new node in the trie then output thecorresponding string Encoding of m a n a m a n a t a p i t i p i t i p i (m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i) 256 257 258 259 260 261 262 264 264 ma na (ma)n at ap it ip (it)i pi

  20. Performance of LZW • Encoding can be performed in time O(n) • where n is the length of the given text • Decoding can be performed in time O(n) • where n is the length of the uncompressed output • The memory consumption is linear in the size of the compressed code • LZW can be nicely implemented in hardware • There is no software patent • so it is very populary, see “compress” for UNIX • LZW can be further compressed using Huffman-Codes • Every second character is a plain copy from the text! • Search in LZW is difficult • The encoding is embedded in the text (adaptive encoding) • For one search in a text there is an linear number of possiblities to encode a search pattern (EXERCISE)

  21. The Algorithm of Amir, Benson & Farach“Let Sleeping Files Lie” • Ideas • Build the Trie, but do not decode • Use KMP-Matcher with the nodes of the LZW-Trie • Prepare a data structure based on the pattern m • Then, scan the text and update this data structure • Goal: Running time of O(n + f(m)) • where n is the code length • f(m) is some small polynomial depending on the pattern length m • for well compressed codes and f(m)<n it should be faster than decoding and then running text search

  22. Searching in LZW-CodesInside a node Example: Search for tapioca abtapiocaab abar blahblah b b tapioca is “inside” a node Then we have found tapioca For all nodes u of a trie: Set: Is_inside[u]=1 if the text of u contains the pattern

  23. Searching in LZW-CodesTorn apart Example: Search for tapioca Parts are hidden in some other nodes All parts arenodes of theLZW-Trie abrasta carasi i p o The end is thestart of anothernode Startingsomewhere in a node

  24. Finding the start: longest_prefixThe Suffix of Nodes = Prefix of Patterns Classify all nodes of the trie Is the suffix of the node a prefix of the pattern And if yes, how long is it? For very long text encoded by a node only the last m letters matter Can be computed using the KMP-Matcher-algorithm while building the Trie Example: Pattern: “manamana” The last fourletter are the first four ofthe pattern pamana amanaplanacanalpamana length of suffix of node which is prefix of patter is 2 result: 4 mama papa amanaplanacanalpamana result: 0 m mana result: 4 amanaplanacanalpamanam

  25. Is the node inside of a Pattern Find positions where the text of the node is inside the pattern Several occurences are possible e.g. one letter There are at most m(m-1)/2 encodings of such sub-strings For every sub-string there is exactly one node that fits Define table Inside-Node of size O(m2) Inside-Node[start,end] := Node that encodes pattern P[start]..P[end] From Inside-Node[start,end] one can derive Inside-Node[start,end+1] as soon as the corresponding node is created To find all occurences quickly an pointer Next-inside-occurence(start,end) indicates the next position where the substrings lies It is initialized for start=end with the next occurence of the letter Example: Pattern: “manamana” ana This text could be in positions 2-4 or positions 6-8 of the pattern anam result: (2,5) othorgonal result: (0,0) is not in the pattern

  26. Finding the End: longest_suffixPrefix of the Node = Suffix of the Pattern Classify all nodes of the trie Is the prefix of the node a suffix of the pattern And if yes, does it complete the pattern, if already i letters were found? For very long text encoded by a node only the first m letters matter Since the text is added at the right side this property can be derived from the ancestor Example: Pattern: “manamana” ananimal Here 3 and 1 could be the solution We take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using  on the reverse string) manammanaaaaaaaaaaaa manamanamana result: 8 manammanaaaaaaaaaaaa m panamacanal result: 0 manammanaaaaaaaaaaaam

  27. How does it fit? On the left side we have the maximum prefix of the pattern On the right side we have the maximum suffix of the pattern Pattern: mamapamana 6 letter suffix found 8 letter prefix found This does not fit pamanapana panamamapama Yet the pattern is inside, though, since the last 8 letters + the first 6 letters of the pattern give the pattern Solution: Define prefix-suffix-table PS-T(p,s) = 1 if p-letter-prefix and s-letter-prefix contain the patter

  28. ABF-LZW-Matcher(LZW-Code C, uncompressed pattern P) n  length( C), m  length( M) Init_DS(P) TRIE  start-TRIE v  255 prefix  0 for i  0 to 255 do C(i)=“i” od for l  1 to n do (u,c)  read-next-two-symbols(Code) v  v+1 Update_DS() Check_for_Occurence() od Update_DS() length[v]  length[u]+1 C[v] C[u]c is_inside[v]  is_inside[u] if longest_prefix[u]< m and P[longest_prefix[u]+1]= c then longest_prefix[v]  longest_prefix[u] +1 fi if length[u]<m then for all entries (start,end) of u in inside_node do if P[end+1]=c then inside_node[start,end+1]  v Link entry of v fi do if longest_suffix[u] < length[u] or P[length[v]] c then longest_suffix[v]  longest_suffix[u] else longest_suffix[v]  1+longest_suffix[u] if longest_suffix[v] = m then is_inside[v]  1 fi fi

  29. Check for Occurences • if is_inside[v] = m then • return “pattern found at l” • prefix  prefixlength[v] • else if prefix = 0 then • prefix  prefixlength[v] • else if prefix + length[v] < m then • while prefix  0 and inside-node[prefix+1,prefix+length[v]]  v do • prefix  (prefix) • od • if prefix = 0 then prefix  prefixlength[v] • else prefix  prefix+length[v] • fi • else • suffix  longest_suffix[v] • if PS-T[prefix,suffix]=1 then • return “pattern found at l” • prefix  prefixlength[v] • else • prefix  prefixlength[v] • fi Loop possibly needs time m for each symbol Amortized analysis will not heal this

  30. Thanks for your attentionEnd of 4th lectureNext lecture: Mo 15 Nov 2004, 11.15 am, FU 116Next exercise class: Mo 15 Nov 2004, 1.15 pm, F0.530 or We 17 Nov 2004, 1.00 pm, E2.316

More Related