220 likes | 354 Views
The LZ family. LZ77 LZR LZSS LZB LZH – used by zip and unzip LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG. Overview of LZ family. To demonstrate: simple alphabet containing only two letters, a and b, and create a sample stream of text. LZ family overview.
E N D
The LZ family • LZ77 • LZR • LZSS • LZB • LZH – used by zip and unzip • LZ78 • LZW – Unix compress • LZC – Unix compress • LZT • LZMW • LZJLZFG
Overview of LZ family • To demonstrate: • simple alphabet containing only two letters, a and b, • and create a sample stream of text
LZ family overview • Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
Sender : The Compressor • Before compression, the pieces of text from the breaking-down process are indexed from 1 to n:
LZ • indices are used to number the pieces of data. • The empty string (start of text) has index 0. • The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa. • String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.
LZ • the process of renaming pieces of text starts to pay off. • Small integers replace what were once long strings of characters. • can now throw away our old stream of text and send the encoded information to the receiver
Bit Representation of Coded Information • Now, want to calculate num bits needed • each chunk is an int and a letter • num bits depends on size of table permitted in the dictionary • every character will occupy 8 bits because it will be represented in US ASCII format
Compression good? • in a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text. • example: 12 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.
Receiver: The Decompressor (Implementation • receiver knows exactly where boundaries are, so no problem in reconstructing the stream of text. • Preferable to decompress the file in one pass; otherwise, we will encounter a problem with temporary storage..
Lempel-Ziv applet • See • http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/#JavaApplet
Lempel Ziv Welsch (LZW) • previous methods worked only on characters • LZW works by encoding strings • some strings are replaced by a single codeword • for now assume codeword is fixed (12 bits) • for 8 bit characters, first 256 (or less) entries in table are reserved for the characters • rest of table (257-4096) represent strings
LZW compression • trick is that string-to-codeword mapping is created dynamically by the encoder • also recreated dynamically by the decoder • need not pass the code table between the two • is a lossless compression algorithm • degree of compression hard to predict • depends on data, but gets better as codeword table contains more strings
LZW encoder Initialize table with single character strings STRING = first input character WHILE not end of input stream CHARACTER = next input character IF STRING + CHARACTER is in the string table STRING = STRING + CHARACTER ELSE Output the code for STRING Add STRING + CHARACTER to the string table STRING = CHARACTER END WHILE Output code for string
Demonstrations • Another animated LZ algorithm … • http://www.data-compression.com/lempelziv.html
LZW encoder example • compress the string BABAABAAA
Lempel-Ziv compression • a lossless compression algorithm • All encodings have the same length • But may represent more than one character • Uses a “dictionary” approach – keeps track of characters and character strings already encountered
LZW decoder example • decompress the string <66><65><256><257><65><260>
LZW Issues • compression better as the code table grows • what happens when all 4096 locations in string table are used? • A number of options, but encoder and decoder must agree to do the same thing • do not add any more entries to table (as is) • clear codeword table and start again • clear codeword table and start again with larger table/longer codewords (GIF format)
LZW advantages/disadvantages • advantages • simple, fast and good compression • can do compression in one pass • dynamic codeword table built for each file • decompression recreates the codeword table so it does not need to be passed • disadvantages • not the optimum compression ratio • actual compression hard to predict
Entropy methods • all previous methods are lossless and entropy based • lossless methods are essential for computer data (zip, gnuzip, etc.) • combination of run length encoding/huffman is a standard tool • are often used as a subroutine by other lossy methods (Jpeg, Mpeg)
Lempel-Ziv compression • a lossless compression algorithm • All encodings have the same length • But may represent more than one character • Uses a “dictionary” approach – keeps track of characters and character strings already encountered