0% found this document useful (0 votes)
197 views51 pages

Data Compression (RCS 087)

This document provides an overview of data compression techniques. It discusses lossless compression methods like Run Length Encoding (RLE) and Huffman Coding, as well as lossy compression methods. The key differences between lossless and lossy compression are explained. Lossless compression allows for exact reconstruction of the original data, while lossy compression results in some loss of information to achieve higher compression ratios. Huffman Coding is described as a lossless method that assigns variable-length codes to characters based on their frequency, with more common characters getting shorter codes.

Uploaded by

sakshi mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views51 pages

Data Compression (RCS 087)

This document provides an overview of data compression techniques. It discusses lossless compression methods like Run Length Encoding (RLE) and Huffman Coding, as well as lossy compression methods. The key differences between lossless and lossy compression are explained. Lossless compression allows for exact reconstruction of the original data, while lossy compression results in some loss of information to achieve higher compression ratios. Huffman Coding is described as a lossless method that assigns variable-length codes to characters based on their frequency, with more common characters getting shorter codes.

Uploaded by

sakshi mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

DATA COMPRESSION (RCS 087)

Unit-I
Introduction
• Data Compression
• Compression Techniques
• Loss less compression
• Lossy Compression
• Measures of performance
• Modeling and coding
• Mathematical Preliminaries for Lossless compression
• A brief introduction to information theory
• Models: Physical models
• Probability models
• Markov models
• composite source model
• Coding: uniquely decodable codes
• Prefix codes
Data Compression

• Data Compression is also referred to as bit-rate reduction or source


coding. This technique is used to reduce the size of large files.
• DC stands for Data Compression. DC is a digital signal process in
which data to be transmitted is compressed to reduce the storage amount
in bits. In other words, you can say that data storage space is reduced
than usual after applying DC. Data transmission greatly reduces data
storage space and transmission capacity. It is also known as source
coding or bit-rate reduction. Database management system, backup
utilities, etc use data compression method widely. There are many file
compression methods but ZIP and ARC are mostly known file formats.
Data Compression
• Data compression is the process of modifying, encoding or converting
the bits structure of data in such a way that it consumes less space on
disk.
• It enables reducing the storage size of one or more data instances or
elements. Data compression is also known as source coding or bit-rate
reduction.

• The advantage of data compression is that it helps us save our disk


space and time in the data transmission.
Data Compression
Compression Techniques

There are mainly two types of data compression techniques –


1.Lossless Data Compression
2.Lossy Data Compression
Compression Techniques
Lossless data compression

• Lossless data compression is used to compress the files without losing


an original file's quality and data. Simply, we can say that in lossless
data compression, file size is reduced, but the quality of data remains the
same.
• The main advantage of lossless data compression is that we can restore
the original data in its original form after the decompression.
• Lossless data compression mainly used in the sensitive documents,
confidential information, and PNG, RAW, GIF, BMP file formats.
• GIF (Graphics Interchange Format)
• Bitmap 
Lossless data compression
Lossless data compression
Some most important Lossless data compression techniques are –
1.Run Length Encoding (RLE)
2.Lempel Ziv - Welch (LZW)
3.Huffman Coding
4.Arithmetic Coding
Lossy data compression

• Lossy data compression is used to compress larger files into smaller


files. In this compression technique, some specific amount of data
and quality are removed (loss) from the original file. It takes less
memory space from the original file due to the loss of original data
and quality. This technique is generally useful for us when the quality
of data is not our first priority.

• Lossy data compression is most widely used in JPEG images, MPEG


video, and MP3 audio formats.
Lossy data compression
Lossy data compression
Some important Lossy data compression techniques are-
1.Transform coding
2.Discrete Cosine Transform (DCT)
3.Discrete Wavelet Transform (DWT)
Difference between lossless and lossy data compression

• As we know, both lossless and lossy data compression techniques are


used to compress data form its original size. The main difference
between lossless and lossy data compression is that we can restore the
lossless data in its original form after the decompression, but lossy
data can't be restored to its original form after the decompression.

• The below table shows the difference between lossless and lossy data
compression -
Difference between lossless and lossy data compression
S.No Lossless data compression Lossy data compression
1. In Lossless data compression, In Lossy data compression,
there is no loss of any data and there is a loss of quality and
quality. data, which is not measurable.

2. In lossless, the file is restored In Lossy, the file does not


in its original form. restore in its original form.
3. Lossless data compression Lossy data compression
algorithms are Run Length algorithms are: Transform
Encoding, Huffman encoding, coding, Discrete Cosine
Shannon fano encoding, Transform, Discrete Wavelet
Arithmetic encoding, Lempel Transform, fractal
Ziv Welch encoding, etc. compression, etc.
Difference between lossless and lossy data
compression
4. Lossless compression is Lossy compression is
mainly used to compress mainly used to compress
text-sound and images. audio, video, and images.

5. As compare to lossy data As compare to lossless data


compression, lossless data compression, lossy data
compression holds more compression holds less
data. data.
Difference between lossless and lossy data
compression
6. File quality is high in the File quality is low in the
lossless data compression. lossy data compression.

7. Lossless data compression Lossy data compression


mainly supports RAW, BMP, mainly supports JPEG, GIF,
PNG, WAV, FLAC, and MP3, MP4, MKV, and OGG
ALAC file types. file types.
Run Length Encoding (RLE)

• Run-length encoding (RLE) is a form of lossless data compression in


which runs of data (sequences in which the same data value occurs in
many consecutive data elements) are stored as a single data value and
count, rather than as the original run. This is most efficient on data that
contains many such runs, for example simple graphic images such as
icons, line drawings, Conway's Game of Life, and animations. For
files that do not have many runs, RLE could increase the file size.
Example

• Consider a screen containing plain black text on a solid white


background. There will be many long runs of white pixels in the blank
space, and many short runs of black pixels within the text. A
hypothetical scan line, with B representing a black pixel and W
representing white, might read as follows:
• WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWW
WWWWWWWWWWWWWWWWWWWWWBWWWWWWWW
WWWWWW
Example
• With a run-length encoding (RLE) data compression algorithm applied
to the above hypothetical scan line, it can be rendered as follows:
• 12W1B12W3B24W1B14W
Huffman Coding
• Huffman coding is a lossless data compression algorithm. The idea is
to assign variable-length codes to input characters, lengths of the
assigned codes are based on the frequencies of corresponding
characters. The most frequent character gets the smallest code and the
least frequent character gets the largest code.
The variable-length codes assigned to input characters are 
Prefix Codes, means the codes (bit sequences) are assigned in such a
way that the code assigned to one character is not the prefix of code
assigned to any other character. This is how Huffman Coding makes
sure that there is no ambiguity when decoding the generated
bitstream. 
Huffman Coding
• Let us understand prefix codes with a counter example. Let there be
four characters a, b, c and d, and their corresponding variable length
codes be 00, 01, 0 and 1. This coding leads to ambiguity because code
assigned to c is the prefix of codes assigned to a and b. If the
compressed bit stream is 0001, the de-compressed output may be
“cccd” or “ccb” or “acd” or “ab”.
See this for applications of Huffman Coding. 
There are mainly two major parts in Huffman Coding
1.Build a Huffman Tree from input characters.
2.Traverse the Huffman Tree and assign codes to characters.
Steps to build Huffman Tree
• Input is an array of unique characters along with their frequency of occurrences and
output is Huffman Tree. 
1.Create a leaf node for each unique character and build a min heap of all leaf nodes
(Min Heap is used as a priority queue. The value of frequency field is used to
compare two nodes in min heap. Initially, the least frequent character is at root)
2.Extract two nodes with the minimum frequency from the min heap.
 
3.Create a new internal node with a frequency equal to the sum of the two nodes
frequencies. Make the first extracted node as its left child and the other extracted
node as its right child. Add this node to the min heap.
4.Repeat steps#2 and #3 until the heap contains only one node. The remaining node is
the root node and the tree is complete.
Huffman Coding
• Let us understand the algorithm with an example:
• character Frequency
• a 5
• b 9
• c 12
• d 13
• e 16
• f 45
Huffman Coding
• Step 1. Build a min heap that contains 6 nodes where each node
represents root of a tree with single node.
• Step 2 Extract two minimum frequency nodes from min heap. Add a
new internal node with frequency 5 + 9 = 14. 

 
Huffman Coding
Huffman Coding
• Now min heap contains 5 nodes where 4 nodes are roots of trees with
single element each, and one heap node is root of tree with 3 elements.

• character Frequency
• c 12
• d 13
• Internal Node 14
• e 16
• f 45
• Step 3: Extract two minimum frequency nodes from heap. Add a new
internal node with frequency 12 + 13 = 25
 
• Now min heap contains 4 nodes where 2 nodes are roots of trees with
single element each, and two heap nodes are root of tree with more
than one nodes
• character Frequency
• Internal Node 14
• e 16
• Internal Node 25
• f 45
• Step 4: Extract two minimum frequency nodes. Add a new internal
node with frequency 14 + 16 = 30
• Now min heap contains 3 nodes.
• character Frequency
• Internal Node 25
• Internal Node 30
• f 45
• Step 5: Extract two minimum frequency nodes. Add a new internal
node with frequency 25 + 30 = 55
• Now min heap contains 2 nodes.
• character Frequency
• f 45
• Internal Node 55
• Step 6: Extract two minimum frequency nodes. Add a new internal
node with frequency 45 + 55 = 100
• Now min heap contains only one node.
• character Frequency
• Internal Node 100
• Since the heap contains only one node, the algorithm stops here.
Steps to print codes from Huffman Tree:
Traverse the tree formed starting from the root. Maintain an auxiliary
array. While moving to the left child, write 0 to the array. While
moving to the right child, write 1 to the array. Print the array when a
leaf node is encountered.
 
• The codes are as follows:
• character code-word
• f 0
• c 100
• d 101
• a 1100
• b 1101
• e 111
LZW (Lempel–Ziv–Welch)
• There are two categories of compression techniques, lossy and
lossless. Whilst each uses different techniques to compress files, both
have the same aim: To look for duplicate data in the graphic (GIF for
LZW) and use a much more compact data representation. Lossless
compression reduces bits by identifying and eliminating statistical
redundancy. No information is lost in lossless compression. On the
other hand, Lossy compression reduces bits by removing unnecessary
or less important information. So we need Data Compression mainly
because:
LZW (Lempel–Ziv–Welch)
• Uncompressed data can take up a lot of space, which is not good for limited
hard drive space and internet download speeds.
• While hardware gets better and cheaper, algorithms to reduce data size also
helps technology evolve.
• Example: One minute of uncompressed HD video can be over 1 GB.How
can we fit a two-hour film on a 25 GB Blu-ray disc?
• Lossy compression methods include DCT (Discreet Cosine Transform),
Vector Quantisation and Transform Coding while Lossless compression
methods include RLE (Run Length Encoding), string-table compression,
LZW (Lempel Ziff Welch) and zlib. There Exist several compression
Algorithms, but we are concentrating on LZW.
LZW (Lempel–Ziv–Welch)
• The LZW algorithm is a very common compression technique. This
algorithm is typically used in GIF and optionally in PDF and TIFF.
Unix’s ‘compress’ command, among other uses. It is lossless, meaning
no data is lost when compressing. The algorithm is simple to
implement and has the potential for very high throughput in hardware
implementations. It is the algorithm of the widely used Unix file
compression utility compress, and is used in the GIF image format.
The Idea relies on reoccurring patterns to save data space. LZW is the
foremost technique for general purpose data compression due to its
simplicity and versatility. It is the basis of many PC utilities that claim
to “double the capacity of your hard drive”.
LZW (Lempel–Ziv–Welch) how to work
• LZW compression works by reading a sequence of symbols, grouping
the symbols into strings, and converting the strings into codes.
Because the codes take up less space than the strings they replace, we
get compression.Characteristic features of LZW includes,
• LZW compression uses a code table, with 4096 as a common choice for the
number of table entries. Codes 0-255 in the code table are always assigned to
represent single bytes from the input file.
• When encoding begins the code table contains only the first 256 entries, with
the remainder of the table being blanks. Compression is achieved by using
codes 256 through 4095 to represent sequences of bytes.
• As the encoding continues, LZW identifies repeated sequences in the data,
and adds them to the code table.
• Decoding is achieved by taking each code from the compressed file and
translating it through the code table to find what character or characters it
represents.
• Example: ASCII code. Typically, every character is stored with 8
binary bits, allowing up to 256 unique symbols for the data. This
algorithm tries to extend the library to 9 to 12 bits per character.The
new unique symbols are made up of combinations of symbols that
occurred previously in the string. It does not always compress well,
especially with short, diverse strings. But is good for compressing
redundant data, and does not have to save the new dictionary with the
data: this method can both compress and uncompress data.
There are excellent article’s written up already, you can look more
indepth and also Mark Nelson’s article is commendable
• Implementation
• The idea of the compression algorithm is the following: as the input
data is being processed, a dictionary keeps a correspondence between
the longest encountered words and a list of code values. The words are
replaced by their corresponding codes and so the input file is
compressed. Therefore, the efficiency of the algorithm increases as the
number of long, repetitive words in the input data increases.
LZW ENCODING
• * PSEUDOCODE
• 1 Initialize table with single character strings
• 2 P = first input character
• 3 WHILE not end of input stream
• 4 C = next input character
• 5 IF P + C is in the string table
• 6 P=P+C
• 7 ELSE
• 8 output the code for P
• 9 add P + C to the string table
• 10 P=C
• 11 END WHILE
• 12 output code for P
Problems:
• The LZW algorithm is a very common compression technique.
• Suppose we want to encode the Oxford Concise English dictionary which contains about 159,000
entries. Why not just transmit each word as an 18 bit number?
• Problems:
• Too many bits,
• everyone needs a dictionary,
• only works for English text.
• Solution: Find a way to build the dictionary adaptively.
• Original methods due to Ziv and Lempel in 1977 and 1978. Terry Welch improved the scheme in
1984 (called LZW compression).
• It is used in UNIX compress -- 1D token stream (similar to below)
• It used in GIF comprerssion -- 2D window tokens (treat image as with Huffman Coding Above).
The LZW Compression Algorithm can
summarised as follows:
• W=NILL;
• while ( read a character k )
• {
• if wk exists in the dictionary
• w = wk;
• else
• add wk to the dictionary;
• output the code for w;
• w = k;
• }
• Original LZW used dictionary with 4K entries, first 256 (0-255) are ASCII codes.
Example:
• Input string is "^WED^WE^WEE^WEB^WET".

• w k output index symbol


• -----------------------------------------
• NIL ^
• ^ W ^ 256 ^W
• W E W 257 WE
• E D E 258 ED
• D ^ D 259 D^
• ^ W
• ^W E 256 260 ^WE
• E ^ E 261 E^
• ^ W

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy