Huffman Coding by Akas
Huffman Coding by Akas
1 FEASIBILITY STUDY
In computer science and information theory, Huffman coding is an entropy encoding algorithm
used for lossless data compression. The term refers to the use of a variable-length code table for
encoding a source symbol (such as a character in a file) where the variable-length code table has
been derived in a particular way based on the estimated probability of occurrence for each
possible value of the source symbol. It was developed by David A. Huffman at MIT, and
published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
Huffman coding uses a specific method for choosing the representation for each symbol,
resulting in a prefix code (sometimes called "prefix-free codes") (that is, the bit string
representing some particular symbol is never a prefix of the bit string representing any other
symbol) that expresses the most common characters using shorter strings of bits than are used for
less common source symbols. Huffman was able to design the most efficient compression
method of this type: no other mapping of individual source symbols to unique strings of bits will
produce a smaller average output size when the actual symbol frequencies agree with those used
to create the code
1.1 History:
In 1951, David A. Huffman and his MIT information theory classmates were given the choice of
a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the
problem of finding the most efficient binary code. Huffman, unable to prove any codes were the
most efficient, was about to give up and start studying for the final when he hit upon the idea of
using a frequency-sorted binary tree and quickly proved this method the most efficient. In doing
so, the student outdid his professor, who had worked with information theory inventor Claude
Shannon to develop a similar code.
5. Scan text again and create new file using the Huffman code.
The above construction algorithm uses a priority queue where the node with lowest probability is
given highest priority. This priority queue is used to build the Huffman tree which assigns more
frequent symbols less number of bits and symbols that occur less frequently would take up more
“To design the system that will allow the user to enter the total number of characters with their
frequencies at the terminal and then display the Huffman codes on the terminal in an interactive
manner. “
The main aim of the feasibility study activity is to determine whether it would be
financially, and technically feasible to develop the product. After thoroughly analyzing the
problem definition and Huffman coding algorithm from various standard books on information
theory and internet various strategies for solving the problem were analyzed and finally the
algorithm based on priority queue (singly linked list) was chosen.
We need C++ compiler software and a computer for our Huffman algorithm which is available.
The personal involved with the project should be well versed with the basic concepts of C, C++
and data structures.
Economic feasibility is dependent on the complexity of the problem and number of personnels
involved with the project. Huffman coding is a moderate problem and thus will need atleast 4
group members. Our team comprises of 4 engineering students who are familiar with data
structures and thus is economically feasible
This project is developed for information theory professionals and the proposed system provides
the fast and efficient operations. So it will be acceptable to a large extent. So the proposed system
is operationally feasible.
The goal of the requirement gathering activity is to collect all relevant information regarding the
product to be developed from the customer with a view to clearly understand the customer
requirements .To thoroughly understand the problem, first a study of Graph theory was
conducted using standard text books (refer Bibliography) to understand the Huffman tree. The
the study of Huffman coding was taken up from various relevant sources like internet and books
on data compression. Finally, the method used by Huffman coding to compress the data was
understood to understand the motive behind the project.
The goal of requirement analysis activity is to weed out the incompleteness and inconsistencies
in the above gathered requirements. The data collected from various sources including a group of
users usually contain several contradictions, ambiguities, incompleteness, inconsistencies, etc
since each user typically has only a partial and incomplete view of the system. In case of the
Huffman coding algorithm the main requirement is the frequency distribution table which needs
to be checked for contradictions, ambiguities, and incompleteness.
For example, the user should not enter the same symbol twice resulting in inconsistency. This
would lead to errors as each symbol has only one Huffman code. if the user enters only one
symbol this leads to incompleteness which is removed by not using of algorithm and we can
conventionally assign 1 or 0 to the symbol. Similarly, the number of symbols cant be greater
than 94.
64 40 @ 80 50 P 96 60 ` 112 70 p
65 41 A 81 51 Q 97 61 a 113 71 q
66 42 B 82 52 R 98 62 b 114 72 r
67 43 C 83 53 S 99 63 c 115 73 s
68 44 D 84 54 T 100 64 d 116 74 t
69 45 E 85 55 U 101 65 e 117 75 u
70 46 F 86 56 V 102 66 f 118 76 v
71 47 G 87 57 W 103 67 g 119 77 w
72 48 H 88 58 X 104 68 h 120 78 x
73 49 I 89 59 Y 105 69 i 121 79 y
74 4A J 90 5A Z 106 6A j 122 7A z
75 4B K 91 5B [ 107 6B k 123 7B {
76 4C L 92 5C \ 108 6C l 124 7C |
77 4D M 93 5D ] 109 6D m 125 7D }
78 4E N 94 5E ^ 110 6E n 126 7E ~
79 4F O 95 5F _ 111 6F o 127 7F
As can be seen from the above table, the number of valid symbols that can be entered can’t be
greater than 94, because the standard ascii uses 32 among 128 for control purposes.
This incompleteness needs to removed by prompting the proper error message. The Huffman
coding has important feature of being “prefix free” i.e., none of the code words are prefix of any
other codeword’s, so that ambiguity will not occur among the Huffman code words because of
the prefix property and thus leads to correct decoding of the message .
The customer requirements identified during the requirements gathering and analysis activity are
organized into a SRS document. The important components of this document are:
Purpose:
a) Data compression: -The main purpose of the software is generating Huffman codes
used for data compression.
1. The user provides symbol and its corresponding frequency as input. Frequency has to be a
positive integer.
2. The symbol and its corresponding frequency are inserted into the node of priority queue.
3. Once we have inserted all the symbols and their corresponding frequencies into the priority
queue , we build the Huffman tree for the symbols
4. Once the complete tree is created we determine the Huffman code of each symbol by
traversing the tree.
1. download gcc ;in unix type “sudo apt-get install build-essential” at the terminal
2. Make minor changes to run on posix plateform i.e unix or mac eg change clrscr
with clear which is the corresponding command in linux for same purpose.
4. GCC has been adopted as the standard compiler by most other modern Unix-like
computer operating systems, including GNU/Linux, the BSD family and Mac OS
X.
5. command for running on unix after the installation of gcc components “g++
-Wall -W -Werror huff.cpp -o huff”
Goals of implementation:
Huffman coding today is often used as a "back-end" to some other compression method.
DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end
model and quantization followed by Huffman coding.
Example implementations:
DEFLATE (a combination of LZ77 and Huffman coding) – used by ZIP, gzip and PNG
files
JPEG (image compression using a discrete cosine transform, then quantization, then
Huffman coding)
MPEG (audio and video compression standards family in wide use, using DCT and
motion-compensated prediction for video)
o MP3 (a part of the MPEG-1 standard for sound and music compression, using
subbanding and MDCT, perceptual modeling, quantization, and Huffman coding)
o AAC (part of the MPEG-2 and MPEG-4 audio coding specifications, using
MDCT, perceptual modeling, quantization, and Huffman coding)
During the software design phase, the designer transforms the SRS document document into the
design document. The design document produced at the end of design phase should be
implemented using a programming language in the coding phase.
The items that are taken into consideration in the design phase are the different modules which
constitute it. Control relationships and interfaces among different modules are identified.
Suitable data structures for the data to be stored need to be properly designed and documented.
One of the basic steps in the design process involves graphical representation of our main
problem. We use DFD’S for graphical representation.
Coding
The Huffman software takes the frequency distribution table as input and computes the
corresponding Huffman codes for each symbol.
Level 1 DFD
input
Take input
From user
0.1
valid
Priority
Insert into queue node
Priority
Queue 0.2
Insert
into
Composite tree node
node of tree
symbol
0.3
Generate
codes
0.4 code
In Level 1 DFD we provide character and its corresponding frequency as input, if the input is
valid then the input is inserted into the priority queue. Once we have inserted all the symbols
and their corresponding frequencies , we deque two nodes of lowest frequencies from the
priority queue and form the composite node of type tree which is then reinserted back in the
priority queue if it is not empty. Once the complete tree is created we determine the Huffman
code of each symbol by traversing the tree .
input
Take no. of
symbols
0.1.1
valid
Enter the
symbol
0.1.2
valid
Enter
frequency
0.1.3
valid
In this DFD , we take the number of symbols as input , the user can enter only integers from 1 to
94. If the user enters any other value e.g characters then the error message is to be displayed and
the user must enter the valid input. After getting the valid number of characters the user must
enter the symbol for which the Huffman code is required. Here the user can enter the symbols.
After the valid input has been taken, we take the frequency of the character. In this case we have
to accept only the integer values; if the user provides the input other than integers then error
message is to be displayed.
valid
Calculate
location in
Pque 0.2.1
position
Insert
node in
Composite node of tree Pque 0.2.2
Priority queue
node
In this DFD, the proper location of the valid symbol or composite node is determined & placed at
the same location. In this case if the front of the priority queue is null then the node is inserted at
the beginning of the priority queue else the node is placed at specific position depending on its
frequency.
symbol
Calculate path
from root to leaf
Traverse path
assigning 0 to left
and 1 to right child
0.4.2 code
Here the input is the symbol for which we have to generate the Huffman code. While traversing
from root to leaf (symbol) we assign zero as we move to left child and 1 as we move to right.
Data Dictionary
Name Description
Composite node Tree or priority queue node
Tree Symbol,its frequency and pointers to left and right childern
Priority queue Symbol,its frequency and pointer to next node in the queue
Frequency distribution table Symbols & their frequency
Huffman code Bit pattern used to represent a symbol
Root Root node of huffman tree
Leaf Symbols entered(having no children)
Path Unique sequence of edges from root to a symbol
The main operations that the data structure must support are as follows:
1. Priority queue:
Priority queue holds the symbol, frequency and the pointer to the next
node in the queue. T his datastructure is used to create a node in the increasing
order priority queue. The main functions that manipulate this priority queue like
finding position for a node in the queue based on frequency, dequeing the first
node of the priority queue for composite node creation.
2. Binary tree:
Binary tree is a tree that holds symbol, frequency of the character, and
pointers to right and left children. This datastructure is used to create a node of
Huffman tree. Finally, the Huffman tree is traversed to generate the Huffman
code.
MAIN
WELCOME
INPUT INSERT DISPLAY ENCODE
ENQUE DEQUE
FP
The main module calls input module which in turn calls enque module which inserts the nodes in
the priority queue at proper position based on frequency by calling the find position module.
Main module then calls the insert module. Insert module calls the deque module which returns
the lowest priority node. This information is used to create a composite node of type tree .The
summation of the frequencies and concatenation of the symbols of two nodes dequeued are then
used as parameters to create a composite node of type ‘priority queue’ into the queue if it is
empty . The main module may call the display module to display the symbols in the in-order
manner for debugging purposes. At last the encode module is used to generate the Huffman
codes for each symbol.
3.2.3.1 Flowcharts:
start
no
All
symbols
entered
yes
stop
start
Find position
yes
no
Stop
start
Pointer =front
Front=front->next
Return pointer
stop
start
Root!=Null yes
Display root
start
Front != Null
stop no
yes
yes
Node has
children no
Insert composite
yes
Priority node into priority queue
queue no
empty
stop
no
Symbol != node
stop
yes
Print 0
yes
Node= left child Symbol found
in left child
no
Print 1
STEP1: begin
Step7: stop.
STEP1: begin
Print 0
Goto 2
Else
Print 1
Goto 2
Step3: stop.
Root=left child
Display (root)
Root=right child
Display(root)
Step3: stop.
Step1: Begin
STEP5: STOP
STEP1: Begin.
STEP2: return the node having the lowest frequency from priority queue.
Step4: stop.
STEP1: begin
Step7: stop.
2 CODING
/* HUFFMAN CODING (Mini Project)*/
/*
Implemented By :
*/
/*header files:*/
#include<iostream.h>
#include<string.h>
#include<math.h>
#include<stdlib.h>
#include<conio.h>
#include<ctype.h>
#include<graphics.h>
/*Global declarations*/
int n;
char b[94][2];
/* Structure Specifications */
struct tree
char a[94];
int s;
}*root=NULL,*tt[47]={NULL},*temp,*temp2,*t2,*ri,*le;
struct pqu
int info;
char a[94];
}*front=NULL,*t,*par,*t1,*p1,*p2;
//main program
int i;
welcome();
input();
insert();
//disp(root);
clrscr();
for(i=0;i<n;i++)
encode(b[i]);
cout<<"\t";
getch();
3 TESTING
Our Huffman coding software consists of different modules. An error in any one of these
modules will result in system error and if these errors are not debugged it will result in defective
system. This may result in rejection by the customer at the customer testing phase, and thus will
result in the project failure. Thus testing is an important activity. We need not to carry out unit
testing as our program is simple with fewer modules. However we do carry out phased testing of
our code to verify that the code is working properly. Various modules which perform the
function of inputting the frequency table, enquing the symbols into priority queue at their
specific position as per their frequency; this task is performed by find position (fp) module; and
then dequeinq the symbols from priority queues and inserting it into the binary tree.
At this phase we include a debugging module named display to display the symbols of complete
binary tree by inorder traversal. If the result of display module is not what is expected then the
errors in the code were identified and corrected. We have taken some conventions as specified in
SRS document. They are:
1. We assign 0 to the left child and 1 to the right child in the tree.
2. The position of a new composite node with same frequency as of those symbol present in the
priority queue, we place the one which enters the priority queue next at high priority.
3. We assign symbol with higher frequency as left child & lower frequency as right child in the
tree.
4. We assign 1 for any string or character in the case of only one character.
By taking all the above conventions into consideration we conducted integration testing using
phased approach because of the fewer number of modules.
Input:
{ a=3,b=3,c=4,d=4,e=5,f=5 }
ffeefebadcbbaabadcddcc
Result:
Verification successful
Iteration (1)
Priority queue
Processing:
a1=a b1=b
temp
root
ba=6
tt[0]
temp2
b=3 a=3
Iteration (2)
Priority queue
Processing:
a1=c b1=d
temp
root
dc=6
tt[1]
temp2
d=4 c=4
Iteration (3)
Priority queue
Processing:
a1=e b1=f
temp
root
fe=10
tt[2]
temp2
f=5 e=5
Iteration (4)
Priority queue
Processing:
a1=ab b1=cd
temp
root
badc
tt[2]
temp2
ba=66 dc=85
666
Priority queue
Processing:
a1=fe b1=badc
febadc =24
badc=14
fe=10
Which was the expected output, thus the routines involved in creation of priority queue and
Huffman tree are verified to be correct. This completes the phased integration testing .
We first designed various test cases wit which the software was tested with various test
cases used:
CASE(1)
Input:
no. of symbols = a
Result: Success
For the test case(1) we incorporated certain lines of code which checked whether the input; for
number of symbols; is a number or not. This validation was performed by using isdigit( )
CASE (2)
Input:
no of symbols = 0
Result: success
We simply checked whether the input; for no. of characters; is 0, and display a message to enter
non-zero number and then continue the program to take new input
CASE (3).
Input:
no. of symbols = -1
tn[0] contains negative sign and thus isdigit( ) will be false and message to enter valid positive
integers will again be displayed.
CASE(4).
Input:
no. of symbols = 1
Expected=error message
Result: success
We don’t need to execute Huffman algorithm because it won’t be necessary nor will it result in
any data compression. Thus we simply exit the program.
CASE (5).
input
Result: success
Strlen is used to first calculate the length of string entered and if it is more than 1 message is
displayed to enter one symbol only
CASE (6).
no.of symbols = 98
Result: success
We check the condition if n>94 i.e. if the no. of symbols is specified more than 94 we display the
message to the user to enter numbers less than 94.
CASE (7).
Input
enter symbol = aa
Result: success
To eliminate redundancy or ambiguity we include a loop which runs for the number of symbols,
continuously comparing the newly input symbol with the previous input symbols which are
stored in b[ ], if the match is found then message is displayed notifying the user that the newly
input symbol is already entered and current loop is exited.
After all the above validation we performed regression testing of all the previous cases and
verify the result. Now that all the cases behaved as expected & thus system testing is complete.
Future scope:
The huffman coding the we have considered is simple binary Huffman coding but many
variations of Huffman coding exist,
The n-ary Huffman algorithm uses the {0, 1, ... , n − 1} alphabet to encode message and build an
n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm
applies as for binary (n equals 2) codes, except that the n least probable symbols are taken
together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source
words can properly form an n-ary tree for Huffman coding. In this case, additional 0-probability
place holders must be added. If the number of source words is congruent to 1 modulo n-1, then
the set of source words will form a proper Huffman tree.
A variation called adaptive Huffman coding calculates the probabilities dynamically based on
recent actual frequencies in the source string. This is somewhat related to the LZ family of
algorithms.
Most often, the weights used in implementations of Huffman coding represent numeric
probabilities, but the algorithm given above does not require this; it requires only a way to order
weights and to add them. The Huffman template algorithm enables one to use any kind of
weights (costs, frequencies etc)
Length-limited Huffman coding is a variant where the goal is still to achieve a minimum
weighted path length, but there is an additional restriction that the length of each codeword must
be less than a given constant. The package-merge algorithm solves this problem with a simple
greedy approach very similar to that used by Huffman's algorithm. Its time complexity is O(nL),
where L is the maximum length of a codeword. No algorithm is known to solve this problem in
linear or linearithmic time, unlike the presorted and unsorted conventional Huffman problems,
respectively.
In the standard Huffman coding problem, it is assumed that each symbol in the set that the code
words are constructed from has an equal cost to transmit: a code word whose length is N digits
will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc.
When working under this assumption, minimizing the total cost of the message and minimizing
the total number of digits are the same thing.Huffman coding with unequal letter costs is the
Moreover we can extend the range of the huffman coding software to incorporate unicode
whicb will require an interfacing module that will interpret a perticular key in different languages
based on the option selected. In that case the maximum number of symbols will be of the order
of 232 .