820 likes | 1.36k Views
Learn how to optimize data compression using Huffman coding, explore encoding rules, constructing binary trees, and achieving minimal bit length for files. Practice with examples and understand the Huffman algorithm.
E N D
A simple example • Suppose we have a message consisting of 5 symbols, e.g. [►♣♣♠☻►♣☼►☻] • How can we code this message using 0/1 so the coded message will have minimum length (for transmission or saving!) • 5 symbols at least 3 bits • For a simple encoding, length of code is 10*3=30 bits
A simple example – cont. • Intuition: Those symbols that are more frequent should have smaller codes, yet since their length is not the same, there must be a way of distinguishing each code • For Huffman code, length of encoded message will be ►♣♣♠☻►♣☼►☻ =3*2 +3*2+2*2+3+3=24bits
Another Example • A = 0B = 100C = 1010D = 1011R = 11 • ABRACADABRA = 01001101010010110100110 • This is eleven letters in 23 bits • A fixed-width encoding would require 3 bits for five different letters, or 33 bits for 11 letters • Notice that the encoded bit string can be decoded!
Huffman codes • Binary character code: each character is represented by a unique binary string. • A data file can be coded in two ways: The first way needs 1003=300 bits. The second way needs 45 1+13 3+12 3+16 3+9 4+5 4=232 bits.
Variable-length code • Need some carefulness to read the code. • 001011101 (codeword: a=0, b=00, c=01, d=11.) • Where to cut? 00 can be explained as either aa or b. • Prefix of 0011: 0, 00, 001, and 0011. • Prefix codes: no codeword is a prefix of some other codeword. (prefix free) • Prefix codes are simple to encode and decode.
Using codeword in Table to encode and decode • Encode: abc = 0.101.100 = 0101100 • (just concatenate the codewords.) • Decode: 001011101 = 0.0.101.1101 = aabe
100 0 0 1 100 1 a:45 14 86 0 1 0 0 1 0 1 1 58 14 0 28 0 1 0 1 0 1 c:12 b:13 d:16 14 30 0 1 25 55 a:45 b:13 c:12 d:16 e:9 f:5 e:9 f:5 • Encode: abc = 0.101.100 = 0101100 • (just concatenate the codewords.) • Decode: 001011101 = 0.0.101.1101 = aabe • (use the (right)binary tree below:) Tree for the fixed length codeword Tree for variable-length codeword
Binary tree • Every nonleaf node has two children. • Why? • The fixed-length code in our example is not optimal. • The total number of bits required to encode a file is • f ( c ): the frequency (number of occurrences) of c in the file • dT(c): denote the depth of c’s leaf in the tree
Constructing an optimal coding scheme • Formal definition of the problem: • Input:a set of characters C={c1, c2, …, cn}, each cC has frequency f[c]. • Output: a binary tree representing codewords so that the total number of bits required for the file is minimized. • Huffman proposed a greedy algorithm to solve the problem.
c:12 b:13 a:45 d:16 0 1 f:5 e:9 14 (a) f:5 e:9 c:12 b:13 d:16 a:45 (b)
a:45 0 1 c:12 b:13 d:16 0 1 a:45 f:5 e:9 0 1 1 0 c:12 b:13 d:16 0 1 f:5 e:9 14 14 30 25 25 (c) (d)
a:45 0 1 0 100 1 0 1 1 0 a:45 c:12 b:13 d:16 0 1 0 1 f:5 e:9 0 1 1 0 c:12 b:13 d:16 14 14 30 30 0 1 55 55 25 25 f:5 e:9 (f) (e)
HUFFMAN(C) 1 n:=|C| 2 Q:=C 3 for i:=1 to n-1 do 4 z:=ALLOCATE_NODE() 5 x:=left[z]:=EXTRACT_MIN(Q) 6 y:=right[z]:=EXTRACT_MIN(Q) 7 f[z]:=f[x]+f[y] 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q)
The Huffman Algorithm • This algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. • C is a set of n characters, and each character c in C is a character with a defined frequency f[c]. • Q is a priority queue, keyed on f, used to identify the two least-frequent characters to merge together. • The result of the merger is a new object (internal node) whose frequency is the sum of the two objects.
Time complexity • Lines 4-8 are executed n-1 times. • Each heap operation in Lines 4-8 takes O(lg n) time. • Total time required is O(n lg n). Note: The details of heap operation will not be tested. Time complexity O(n lg n) should be remembered.
An Complete ExampleScan the original text Eerie eyes seen near lake. • What characters are present? E e r i space y s n a l k .
Char Freq. Char Freq. Char Freq. E 1 y 1 k 1 e 8 s 2 . 1 r 2 n 2 i 1 a 2 space 4 l 1 Building a TreeScan the original text Eerie eyes seen near lake. • What is the frequency of each character in the text?
E 1 i 1 y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 sp 4 e 8 Building a Tree • The array after inserting all nodes
E 1 i 1 y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 sp 4 e 8 Building a Tree
Building a Tree y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 sp 4 e 8 2 i 1 E 1
Building a Tree 2 y 1 l 1 k 1 . 1 r 2 s 2 n 2 a 2 sp 4 e 8 E 1 i 1
Building a Tree 2 k 1 . 1 r 2 s 2 n 2 a 2 sp 4 e 8 E 1 i 1 2 y 1 l 1
Building a Tree 2 2 k 1 . 1 r 2 s 2 n 2 a 2 sp 4 e 8 y 1 l 1 E 1 i 1
Building a Tree 2 r 2 s 2 n 2 a 2 2 sp 4 e 8 y 1 l 1 E 1 i 1 2 k 1 . 1
Building a Tree 2 r 2 s 2 n 2 a 2 sp 4 e 8 2 2 k 1 . 1 y 1 l 1 E 1 i 1
Building a Tree n 2 a 2 2 sp 4 e 8 2 2 E 1 i 1 y 1 l 1 k 1 . 1 4 r 2 s 2
Building a Tree n 2 a 2 e 8 2 sp 4 2 4 2 k 1 . 1 r 2 s 2 E 1 i 1 y 1 l 1
Building a Tree e 8 4 2 2 2 sp 4 r 2 s 2 y 1 l 1 k 1 . 1 E 1 i 1 4 n 2 a 2
Building a Tree e 8 4 4 2 2 2 sp 4 r 2 s 2 n 2 a 2 y 1 l 1 k 1 . 1 E 1 i 1
Building a Tree e 8 4 4 2 sp 4 r 2 s 2 n 2 a 2 k 1 . 1 4 2 2 E 1 i 1 l 1 y 1
Building a Tree 4 4 4 2 e 8 sp 4 2 2 r 2 s 2 n 2 a 2 k 1 . 1 E 1 i 1 l 1 y 1
Building a Tree 4 4 4 e 8 2 2 r 2 s 2 n 2 a 2 E 1 i 1 l 1 y 1 6 sp 4 2 k 1 . 1
Building a Tree 6 4 4 e 8 4 2 sp 4 2 2 n 2 a 2 r 2 s 2 k 1 . 1 E 1 i 1 l 1 y 1 What is happening to the characters with a low number of occurrences?
Building a Tree 4 6 e 8 2 2 2 sp 4 k 1 . 1 E 1 i 1 l 1 y 1 8 4 4 n 2 a 2 r 2 s 2
Building a Tree 4 6 8 e 8 2 2 2 sp 4 4 4 k 1 . 1 E 1 i 1 l 1 y 1 n 2 a 2 r 2 s 2
Building a Tree 8 e 8 4 4 10 n 2 a 2 r 2 s 2 4 6 2 2 2 sp 4 E 1 i 1 l 1 y 1 k 1 . 1
Building a Tree 8 10 e 8 4 4 4 6 2 2 2 n 2 a 2 r 2 s 2 sp 4 E 1 i 1 l 1 y 1 k 1 . 1
Building a Tree 10 16 4 6 2 2 e 8 8 2 sp 4 E 1 i 1 l 1 y 1 k 1 . 1 4 4 n 2 a 2 r 2 s 2
Building a Tree 10 16 4 6 e 8 8 2 2 2 sp 4 4 4 E 1 i 1 l 1 y 1 k 1 . 1 n 2 a 2 r 2 s 2
Building a Tree 26 16 10 4 e 8 8 6 2 2 2 4 4 sp 4 E 1 i 1 l 1 y 1 k 1 . 1 n 2 a 2 r 2 s 2
Building a Tree After enqueueing this node there is only one node left in priority queue. 26 16 10 4 e 8 8 6 2 2 2 4 4 sp 4 E 1 i 1 l 1 y 1 k 1 . 1 n 2 a 2 r 2 s 2
Using heap: P P P P P P L L L L L L R R R R R R c b d f e a 12 16 9 5 45 13
Using heap: P P P P P P L L L L L L R R R R R R a c b e f d 16 12 9 45 5 13 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Using heap: P P P P P P L L L L L L R R R R R R a c b e f d 16 12 9 45 5 13 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Using heap: P P P P P P L L L L L L R R R R R R e c b a f d 16 12 45 9 5 13 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Using heap: P P P P P P L L L L L L R R R R R R e c a b f d 16 12 13 9 5 45 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Using heap: P P P P P P L L L L L L R R R R R R e c a b f d 16 12 13 9 5 45 CS3335 Design and Analysis of Algorithms/WANG Lusheng
P P g g P P L L L L L L R R R R R R e a d c b f 45 16 5 9 12 13 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Using heap: P g P P g P P L f L L L L L R e R R R R R d c b f g a e 5 9 14 12 45 16 13 CS3335 Design and Analysis of Algorithms/WANG Lusheng