0% found this document useful (0 votes)
15 views22 pages

Huffman

The document discusses data compression techniques, particularly focusing on Huffman coding, which utilizes variable-length encoding based on character frequency to reduce file size. It explains the construction of prefix codes using a binary tree, allowing for efficient and unique decoding of messages. The document also outlines the greedy algorithm approach for generating optimal Huffman codes, emphasizing its effectiveness in minimizing average code length.

Uploaded by

yadavanshikaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Huffman

The document discusses data compression techniques, particularly focusing on Huffman coding, which utilizes variable-length encoding based on character frequency to reduce file size. It explains the construction of prefix codes using a binary tree, allowing for efficient and unique decoding of messages. The document also outlines the greedy algorithm approach for generating optimal Huffman codes, emphasizing its effectiveness in minimizing average code length.

Uploaded by

yadavanshikaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Compression

• Suppose we have 1000000000 (1G) character data file that


we wish to include in an email.
• Suppose file only contains 26 letters {a,…,z}.
• Suppose each letter  in {a,…,z} occurs with frequency f.
• Suppose we encode each letter by a binary code
• If we use a fixed length code, we need 5 bits for each
character
• The resulting message length is 5( fa+ fb + … + fz)

• Can we do better?
Huffman Codes

• Most character code systems (ASCII, unicode) use


fixed length encoding
• If frequency data is available and there is a wide
variety of frequencies, variable length encoding can
save 20% to 90% space
• Which characters should we assign shorter codes;
which characters will have longer codes?
Data Compression: A Smaller Example
• Suppose the file only has 6 letters {a,b,c,d,e,f}
with frequencies
a b c d e f
.45 .13 .12 .16 .09 .05
000 001 010 011 100 101 Fixed length

0 101 100 111 1101 1100 Variable length

• Fixed length 3G=3000000000 bits


• Variable length
.45  1  .13  3  .12  3  .16  3  .09  4  .05  4 2.24G
How to decode?

• At first it is not obvious how decoding


will happen, but this is possible if we
use prefix codes
Prefix Codes
• No encoding of a character can be the prefix of the
longer encoding of another character, for example,
we could not encode t as 01 and x as 01101 since 01
is a prefix of 01101
• By using a binary tree representation we will
generate prefix codes provided all letters are leaves
Prefix codes
• A message can be decoded uniquely.

• Following the tree until it reaches to a leaf, and


then repeat!

• Draw a few more tree and produce the codes!!!


Some Properties
• Prefix codes allow easy decoding
– Given a: 0, b: 101, c: 100, d: 111, e: 1101, f: 1100
– Decode 001011101 going left to right, 0|01011101, a|
0|1011101, a|a|101|1101, a|a|b|1101, a|a|b|e
• An optimal code must be a full binary tree (a tree
where every internal node has two children)
• For C leaves there are C-1 internal nodes
• The number of bits to encode a file is

where f(c) is the freq of c, dT(c) is the tree depth of


c, which corresponds to the code length of c
Optimal Prefix Coding Problem

• Input: Given a set of n letters (c1,…, cn) with


frequencies (f1,…, fn).

• Construct a full binary tree T to define a prefix


code that minimizes the average code length

Average(T ) i 1 f i  lengthT ci 


n
Greedy Algorithms
• Many optimization problems can be solved more
quickly using a greedy approach
– The basic principle is that local optimal decisions may
may be used to build an optimal solution
– But the greedy approach may not always lead to an
optimal solution overall for all problems
– The key is knowing which problems will work with
this approach and which will not
• We will study
– The problem of generating Huffman codes
Greedy algorithms
• A greedy algorithm always makes the choice that
looks best at the moment
– The hope: a locally optimal choice will lead to a
globally optimal solution
– For some problems, it works
• Greedy algorithms tend to be easier to code
David Huffman’s idea

•Build the tree (code) bottom-up in a greedy


fashion
Building the Encoding Tree
Building the Encoding Tree
Building the Encoding Tree
Building the Encoding Tree
Building the Encoding Tree
The Algorithm

• An appropriate data structure is a binary min-heap


• Rebuilding the heap is lg n and n-1 extractions are
made, so the complexity is O( n lg n )
• The encoding is NOT unique, other encoding may
work just as well, but none will work better
Correctness of Huffman’s Algorithm
Lemma A:

Since each swap does not increase the cost, the


resulting tree T’’ is also an optimal tree
Proof of Lemma A
• Without loss of generality, assume f[a]f[b] and
f[x]f[y]
• The cost difference between T and T’ is
B (T )  B (T ' )   f (c)dT (c)   f (c)dT (c) '

cC cC
 f [ x]dT ( x)  f [a ]dT (a )  f [ x]dT ( x)  f [a ]dT (a )
' '

 f [ x]dT ( x)  f [a ]dT (a )  f [ x]dT (a )  f [a ]dT ( x)


( f [a ]  f [ x])( dT (a )  dT ( x))
0
B(T’’)  B(T), but T is optimal,
B(T)  B(T’’)  B(T’’) = B(T)
Therefore T’’ is an optimal tree in which x and y
appear as sibling leaves of maximum depth
Correctness of Huffman’s Algorithm
Lemma B:

•Observation: B(T) = B(T’) + f[x] + f[y]  B(T’) = B(T)-f[x]-f[y]


–For each c C – {x, y}  dT(c) = dT’(c) f[c]dT(c) = f[c]dT’(c)
–dT(x) = dT(y) = dT’(z) + 1
–f[x]dT(x) + f[y]dT(y) = (f[x] + f[y])(dT’(z) + 1) = f[z]dT’(z) + (f[x] + f[y])
B(T’) = B(T)-f[x]-f[y]

z:14

B(T’) = 45*1+12*3+13*3+(5+9)*3+16*3
= B(T) - 5 - 9
B(T) = 45*1+12*3+13*3+5*4+9*4+16*3
Proof of Lemma B
• Prove by contradiction.
• Suppose that T does not represent an optimal prefix code
for C. Then there exists a tree T’’ such that B(T’’) < B(T).
• Without loss of generality, by Lemma A, T’’ has x and y
as siblings. Let T’’’ be the tree T’’ with the common
parent x and y replaced by a leaf with frequency f[z] =
f[x] + f[y]. Then
• B(T’’’) = B(T’’) - f[x] – f[y] < B(T) – f[x] – f[y] = B(T’)
– T’’’ is better than T’  contradiction to the
assumption that T’ is an optimal prefix code for C’

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy