170 likes | 345 Views
Algorithm Programming 1 89-210 Some Topics in Compression. Bar-Ilan University 2007-2008 תשס"ח by Moshe Fresko. Huffman Coding. Variable-length encoding Works on probabilities of symbols (characters, words, etc.) Build a tree Get two least frequent symbols/nodes
E N D
Algorithm Programming 189-210Some Topics in Compression Bar-Ilan University 2007-2008 תשס"ח by Moshe Fresko
Huffman Coding • Variable-length encoding • Works on probabilities of symbols (characters, words, etc.) • Build a tree • Get two least frequent symbols/nodes • Join them into a parent node • Parent node’s frequency is sum of child nodes’ • Continue until the tree contains all nodes and symbols • The path of a leaf indicates its code • Frequent symbols are near the root giving them short codes
LZ77 • Introduced in 1977 by Abraham Lempel and Jacob Ziv • Dictionary based • Works in a window size n • Decoding is easy and fast (but not Encoding) • Produces a list of tuples (Pos,Len,C) • Pos : Position backwards from the current position • Len : Number of symbols to be taken • C : Next character
LZ77 • Based on strings that repeat themselves An outcry in Spain is an outcry in vain An outcry in Spa(6,3)is a(22,12)v(21,3) aaaaaaaaaa a(1,9)
LZ77 - Example • Window size : 5 • ABBABCABBBBC NextSeqCode A (0,0,A) B (0,0,B) BA (1,1,A) BC (3,1,C) ABB (3,2,B) BBC (2,2,C)
LZ77 - Some Variations • LZSS - A flag bit for distinguishing pointers from the other items. • LZR - No limit on the pointer size. • LZH - Compress the pointers in Huffman coding.
LZ78 • Instead of a window to previously seen text, a dictionary of phrases will be build • Both encoding and decoding are simple • From the current position in the text, find the longest phrase that is found in the dictionary • Output the pair (Index,NextChar) • Index : The dictionary phrase of that index • NextChar : The next character after that phrase • Add to the dictionary the new phrase by appending the next character
LZ78 - Example • ABBABCABBBBC Input Output Add to dictionary A (0,A) 1 = “A” B (0,B) 2 = “B” BA (2,A) 3 = “BA” BC (2,C) 4 = “BC” AB (1,B) 5 = “AB” BB (2,B) 6 = “BB” BC (4,EOLN) • Dictionary size
LZW • Produces only a list of dictionary entry indexes • Encoding • Starts with initial dictionary • For example, possible ascii characters (0..255) • From the input, find the longest string that exists in the dictionary • Output this string’s index in the dictionary • Append the next character in the input to that string and add it into the dictionary • Continue from that character on from (2)
LZW - Example • ABBABCABBBBC • Initial dictionary 0=“A”, 1=“B”, 2=“C” Input NextChar Output Add to dictionary A B 0 3 = “AB” B B 1 4 = “BB” B A 1 5 = “BA” AB C 3 6 = “ABC” C A 2 7 = “CA” AB B 3 8 = “ABB” BB B 4 9 = “BBB” B C 1 10 = “BC” C - 2 - • Dictionary size : ?
LZW – Encoding Example • T=ababcbababaaaaaaa • Initial Dictionary Entries :1=a 2=b 3=c Input Output NextSymbol Add To Dictionary a 1 b 4 = ab b 2 a 5 = ba ab 4 c 6 = abc c 3 b 7 = cb ba 5 b 8 = bab bab 8 a 9 = baba a 1 a 10 = aa aa 10 a 11 = aaa aaa 11 a 12= aaaa a 1 - -
LZW – Encoding Algorithm w = Empty while ( read next symbol k ) { if wk exists in the dictionary w = wk else add wk to the dictionary; output the code for w; w = k; }
LZW – Decoding Algorithm read a code k output dictionary entry for k w = k while ( read a code k ) { entry = dictionary entry for k output entry add w + entry[0] to dictionary w = entry }
LZW – Decoding • There is a special case problem with the previous algorithm • It can be confronted on every decoding process of a big file • It is the case where the index number read is not in the dictionary yet • Example : ABABABA • Initially : A=1,B=2 • Output=1 2 3 5 • In decoding above algorithm will not find the dictionary entry ABA=5 • An additional small check will solve the problem • Be careful to do it in the Exercise 3
LZW – Dictionary Length • Dictionary length • Typically : 14 bits = 16384 entries (first 256 of them are single bytes) • What if we are out of dictionary length • Don’t add to the dictionary any more • Delete the whole dictionary (This will be used in the exercise) • LRU : Throw those that are not used recently • Monitor performance, and flush dictionary when the performance is poor. • Double the dictionary size