Data Compression (RCS 087)
Data Compression (RCS 087)
Unit-I
Introduction
• Data Compression
• Compression Techniques
• Loss less compression
• Lossy Compression
• Measures of performance
• Modeling and coding
• Mathematical Preliminaries for Lossless compression
• A brief introduction to information theory
• Models: Physical models
• Probability models
• Markov models
• composite source model
• Coding: uniquely decodable codes
• Prefix codes
Data Compression
• The below table shows the difference between lossless and lossy data
compression -
Difference between lossless and lossy data compression
S.No Lossless data compression Lossy data compression
1. In Lossless data compression, In Lossy data compression,
there is no loss of any data and there is a loss of quality and
quality. data, which is not measurable.
• character Frequency
• c 12
• d 13
• Internal Node 14
• e 16
• f 45
• Step 3: Extract two minimum frequency nodes from heap. Add a new
internal node with frequency 12 + 13 = 25
• Now min heap contains 4 nodes where 2 nodes are roots of trees with
single element each, and two heap nodes are root of tree with more
than one nodes
• character Frequency
• Internal Node 14
• e 16
• Internal Node 25
• f 45
• Step 4: Extract two minimum frequency nodes. Add a new internal
node with frequency 14 + 16 = 30
• Now min heap contains 3 nodes.
• character Frequency
• Internal Node 25
• Internal Node 30
• f 45
• Step 5: Extract two minimum frequency nodes. Add a new internal
node with frequency 25 + 30 = 55
• Now min heap contains 2 nodes.
• character Frequency
• f 45
• Internal Node 55
• Step 6: Extract two minimum frequency nodes. Add a new internal
node with frequency 45 + 55 = 100
• Now min heap contains only one node.
• character Frequency
• Internal Node 100
• Since the heap contains only one node, the algorithm stops here.
Steps to print codes from Huffman Tree:
Traverse the tree formed starting from the root. Maintain an auxiliary
array. While moving to the left child, write 0 to the array. While
moving to the right child, write 1 to the array. Print the array when a
leaf node is encountered.
• The codes are as follows:
• character code-word
• f 0
• c 100
• d 101
• a 1100
• b 1101
• e 111
LZW (Lempel–Ziv–Welch)
• There are two categories of compression techniques, lossy and
lossless. Whilst each uses different techniques to compress files, both
have the same aim: To look for duplicate data in the graphic (GIF for
LZW) and use a much more compact data representation. Lossless
compression reduces bits by identifying and eliminating statistical
redundancy. No information is lost in lossless compression. On the
other hand, Lossy compression reduces bits by removing unnecessary
or less important information. So we need Data Compression mainly
because:
LZW (Lempel–Ziv–Welch)
• Uncompressed data can take up a lot of space, which is not good for limited
hard drive space and internet download speeds.
• While hardware gets better and cheaper, algorithms to reduce data size also
helps technology evolve.
• Example: One minute of uncompressed HD video can be over 1 GB.How
can we fit a two-hour film on a 25 GB Blu-ray disc?
• Lossy compression methods include DCT (Discreet Cosine Transform),
Vector Quantisation and Transform Coding while Lossless compression
methods include RLE (Run Length Encoding), string-table compression,
LZW (Lempel Ziff Welch) and zlib. There Exist several compression
Algorithms, but we are concentrating on LZW.
LZW (Lempel–Ziv–Welch)
• The LZW algorithm is a very common compression technique. This
algorithm is typically used in GIF and optionally in PDF and TIFF.
Unix’s ‘compress’ command, among other uses. It is lossless, meaning
no data is lost when compressing. The algorithm is simple to
implement and has the potential for very high throughput in hardware
implementations. It is the algorithm of the widely used Unix file
compression utility compress, and is used in the GIF image format.
The Idea relies on reoccurring patterns to save data space. LZW is the
foremost technique for general purpose data compression due to its
simplicity and versatility. It is the basis of many PC utilities that claim
to “double the capacity of your hard drive”.
LZW (Lempel–Ziv–Welch) how to work
• LZW compression works by reading a sequence of symbols, grouping
the symbols into strings, and converting the strings into codes.
Because the codes take up less space than the strings they replace, we
get compression.Characteristic features of LZW includes,
• LZW compression uses a code table, with 4096 as a common choice for the
number of table entries. Codes 0-255 in the code table are always assigned to
represent single bytes from the input file.
• When encoding begins the code table contains only the first 256 entries, with
the remainder of the table being blanks. Compression is achieved by using
codes 256 through 4095 to represent sequences of bytes.
• As the encoding continues, LZW identifies repeated sequences in the data,
and adds them to the code table.
• Decoding is achieved by taking each code from the compressed file and
translating it through the code table to find what character or characters it
represents.
• Example: ASCII code. Typically, every character is stored with 8
binary bits, allowing up to 256 unique symbols for the data. This
algorithm tries to extend the library to 9 to 12 bits per character.The
new unique symbols are made up of combinations of symbols that
occurred previously in the string. It does not always compress well,
especially with short, diverse strings. But is good for compressing
redundant data, and does not have to save the new dictionary with the
data: this method can both compress and uncompress data.
There are excellent article’s written up already, you can look more
indepth and also Mark Nelson’s article is commendable
• Implementation
• The idea of the compression algorithm is the following: as the input
data is being processed, a dictionary keeps a correspondence between
the longest encountered words and a list of code values. The words are
replaced by their corresponding codes and so the input file is
compressed. Therefore, the efficiency of the algorithm increases as the
number of long, repetitive words in the input data increases.
LZW ENCODING
• * PSEUDOCODE
• 1 Initialize table with single character strings
• 2 P = first input character
• 3 WHILE not end of input stream
• 4 C = next input character
• 5 IF P + C is in the string table
• 6 P=P+C
• 7 ELSE
• 8 output the code for P
• 9 add P + C to the string table
• 10 P=C
• 11 END WHILE
• 12 output code for P
Problems:
• The LZW algorithm is a very common compression technique.
• Suppose we want to encode the Oxford Concise English dictionary which contains about 159,000
entries. Why not just transmit each word as an 18 bit number?
• Problems:
• Too many bits,
• everyone needs a dictionary,
• only works for English text.
• Solution: Find a way to build the dictionary adaptively.
• Original methods due to Ziv and Lempel in 1977 and 1978. Terry Welch improved the scheme in
1984 (called LZW compression).
• It is used in UNIX compress -- 1D token stream (similar to below)
• It used in GIF comprerssion -- 2D window tokens (treat image as with Huffman Coding Above).
The LZW Compression Algorithm can
summarised as follows:
• W=NILL;
• while ( read a character k )
• {
• if wk exists in the dictionary
• w = wk;
• else
• add wk to the dictionary;
• output the code for w;
• w = k;
• }
• Original LZW used dictionary with 4K entries, first 256 (0-255) are ASCII codes.
Example:
• Input string is "^WED^WE^WEE^WEB^WET".