Iict Unit One
Iict Unit One
Lossless Coding
Introduction to Information Theory.
This introductory topic likely covers the foundational concepts that underpin the
entire field of information theory. Here's a breakdown of what you might expect to learn:
• Exploring the historical context and the key figures who developed the field (e.g.,
Claude Shannon).
2. Basic Concepts:
• Communication Systems: How information theory helps design efficient and reliable
communication channels.
• Machine Learning: Using information theory for model selection, feature extraction,
and other tasks.
• Other Fields: Exploring the applications of information theory in areas like linguistics,
biology, and physics.
Sounds good! Let's start with the first topic in Unit-I: Uncertainty & Information.
In information theory, uncertainty and information are closely intertwined. The more
uncertain we are about something, the more information we gain when we learn about it.
Let's explore how we can measure uncertainty and relate it to information content.
Measuring Uncertainty
• Example: Imagine a bag with 100 balls, and only one of them is red. If you randomly
pick a ball, there's a high degree of uncertainty about whether you'll get the red one.
But if 99 balls were red, there's much less uncertainty.
The information content of an event is inversely related to its probability. The less
likely an event is, the more information it carries.
• Example: If you receive a message saying "the sun will rise tomorrow," it carries very
little information because it's a highly probable event. But a message saying "it will
snow in the Sahara desert tomorrow" carries much more information because it's a
very improbable event.
Mathematical Representation
We can express the information content (I) of an event (E) with probability P(E) using
the following formula:
• This formula shows that as the probability of an event decreases, its information
content increases.
• Bits: When using base-2 logarithm (log₂), information is measured in bits.1 This is the
most common unit, especially in computer science and digital communications, as it
aligns with binary code (0s and 1s).
• Nats: When using the natural logarithm (ln), information is measured in nats. Nats
are often used in theoretical derivations and mathematical analysis due to the
properties of the natural logarithm.
The general formula for information content (I) of an event (E) with probability P(E)
is:
Conversions:
You can convert between these units using the following relationships:
Imagine you have two random variables, X and Y. Mutual information, denoted as
I(X;Y), measures the reduction in uncertainty about X when we know Y (or vice
versa).
• Example: Let X represent the weather (sunny, rainy) and Y represent whether you
carry an umbrella (yes, no). Intuitively, there's some shared information between X
and Y. Knowing it's raining (Y=yes) tells you something about the weather (X=rainy),
and vice versa. Mutual information quantifies this relationship.
Mathematical Definition
where:
• H(X|Y) is the conditional entropy of X given Y (uncertainty about X after knowing Y).
Interpretation
• I(X;Y) > 0: This means X and Y share some information. Knowing one reduces
uncertainty about the other.
• I(X;Y) = 0: This indicates X and Y are independent. Knowing one tells you nothing
about the other.
• Symmetry: I(X;Y) = I(Y;X) - the information shared is the same regardless of which
variable you consider first.
• Units: Mutual information is typically measured in bits (or nats, hartleys, as discussed
earlier).
You're absolutely right! It makes more sense to discuss entropy before diving into
mutual information, as it forms the foundation for understanding mutual
information. Let's explore entropy in detail.
Entropy
Understanding Entropy
• High Entropy: If a random variable has many equally likely outcomes, it has high
entropy. This means there's a lot of uncertainty about what the outcome will be.
o Example: A fair coin toss has high entropy because both heads and tails are
equally likely.
• Low Entropy: If a random variable has one highly probable outcome and other less
likely outcomes, it has low entropy. There's less uncertainty because one outcome is
more likely.
o Example: A biased coin that almost always lands on heads has low entropy.
Calculating Entropy
For a discrete random variable X with possible outcomes x₁, x₂, ..., xₙ and
corresponding probabilities P(x₁), P(x₂), ..., P(xₙ), the entropy H(X) is calculated as:
where:
• log₂ is the base-2 logarithm (you can use other bases for different units like nats).
Interpretation
• Units: Entropy is typically measured in bits, reflecting the average number of bits
needed to represent the information content of the random variable.
Example
= - [-0.5 + -0.5]
= 1 bit
This means the entropy of a fair coin toss is 1 bit, reflecting the maximum uncertainty
for a binary outcome.
Great! Let's revisit Mutual Information now that we've refreshed our understanding
of entropy.
• H(X): This is the entropy of X, representing the uncertainty about X before we know
anything about Y.
Interpretation
• I(X;Y) = 0: If the mutual information is zero, it means H(X) = H(X|Y). In other words,
knowing Y doesn't reduce the uncertainty about X at all. This implies X and Y are
independent.
• I(X;Y) > 0: If the mutual information is positive, it means H(X) > H(X|Y). Knowing Y
reduces the uncertainty about X. The larger the mutual information, the more
information X and Y share.
where:
• H(X,Y): is the joint entropy of X and Y, representing the uncertainty about both X and
Y taken together.
Example
• X: A binary random variable representing whether it's raining (1) or not (0).
• Y: A binary random variable representing whether you carry an umbrella (1) or not
(0).
• P(X=0, Y=0) = 0.3 (not raining and you don't carry an umbrella)
Yes, if you have conditional probabilities or the joint distribution, you can definitely
calculate the joint probability P(X, Y)!
If you have the conditional probability P(Y|X), which is the probability of Y given X,
you can use the following formula:
Example:
If you have the joint distribution, which specifies the probabilities for all possible
combinations of X and Y, then you directly have the joint probabilities.
Example:
Y=0 Y=1
Okay, let's proceed with the next topic in Unit-I: Average Mutual Information.
While mutual information I(X;Y) quantifies the information shared between two
random variables for a specific pair of outcomes, average mutual information
extends this concept to consider the average shared information over all possible
outcomes of X and Y.
Mathematical Definition
where:
• P(x) and P(y) are the marginal probabilities of X=x and Y=y, respectively.
Interpretation
• It tells us, on average, how much information is shared between the two variables
across all possible outcomes.
Average mutual information can be seen as the weighted average of the mutual
information for each pair of outcomes, where the weights are the joint probabilities.
Okay, let's work through an example to calculate average mutual information. We'll
use the rain and umbrella scenario again, but this time with a different joint
probability distribution to make it more interesting.
Y=0 Y=1
Entropy
For a discrete random variable X with possible outcomes x₁, x₂, ..., xₙ and
corresponding probabilities P(x₁), P(x₂), ..., P(xₙ), the entropy H(X) is calculated as:
where:
• log₂ is the base-2 logarithm (you can use other bases for different units like nats).
Properties of Entropy
1. Non-negativity: H(X) ≥ 0. Entropy is always non-negative. It's zero only when the
random variable has a single outcome with probability 1 (no uncertainty).
2. Maximum Entropy: For a discrete random variable with n possible outcomes, the
maximum entropy occurs when all outcomes are equally likely. In this case, H(X) =
log₂(n).
Let's calculate the entropy of a biased coin that lands heads with probability p and
tails with probability (1-p).
• Outcomes: Heads (H) and Tails (T)
Okay, let's move on to the next topic in Unit-I: Relative Entropy (Kullback-Leibler
Divergence).
• Asymmetry: It's important to note that relative entropy is not symmetric. That is,
D(P||Q) is generally not equal to D(Q||P). This means the "distance" from P to Q is
not the same as the "distance" from Q to P.
Mathematical Definition
For discrete probability distributions P and Q, the relative entropy (or KL divergence)
from Q to P, denoted as D(P||Q), is calculated as:
where:
Interpretation
• D(P||Q) ≥ 0: Relative entropy is always non-negative. It's zero only when P and Q are
identical.
• Units: Relative entropy is typically measured in bits (or nats, depending on the base
of the logarithm).
Applications
• Information theory: Quantifying coding inefficiency when using a code optimized for
one distribution to encode data from another distribution.
Okay, let's solve the example of calculating relative entropy (KL divergence) for the
coin toss scenario.
We want to calculate D(P||Q), the relative entropy from Q to P. This will tell us how
much information is lost when we use the biased coin model (Q) to approximate the
fair coin (P).
Calculation
Expanding the summation for the two possible outcomes (Heads and Tails):
≈ -0.3390 + 0.6609
≈ 0.3219 bits
Unit-I: Extension of an Information Source and Markov Source.
So far, we've mostly dealt with single random variables or events. However, in many
real-world scenarios, we encounter sequences of symbols or events generated by an
information source.
Example:
Consider a simple source that produces binary symbols (0 and 1) with equal
probability.
• Two Symbols: If we consider two consecutive symbols, there are four possible
outcomes (00, 01, 10, 11), each with probability 1/4. The entropy of this two-symbol
sequence is 2 bits.
Markov Source
A Markov source is a special type of information source where the probability of generating a
particular symbol depends only on the previous symbol (or a finite number of previous
symbols). This "memory" property makes Markov sources suitable for modeling various real-
world phenomena.
• Order: The order of a Markov source refers to the number of previous symbols that
influence the current symbol. A first-order Markov source depends only on the
immediately preceding symbol.
• State Diagram: Markov sources can be represented using state diagrams, where each
state represents a possible symbol, and the transitions between states represent the
conditional probabilities of generating one symbol given the previous one.
Applications
Okay, let's move on to the next topic in Unit-I: Maximum Entropy Principle.
The Maximum Entropy Principle states that when we have some prior information
about a probability distribution (e.g., constraints on its moments or expected values),
the distribution that best represents our current state of knowledge is the one with
the maximum entropy among all distributions that satisfy the constraints.
• Maximum Entropy: Among all the possible distributions that satisfy our prior
information, the Maximum Entropy Principle suggests choosing the one with the
highest entropy.
• Justification: This principle is based on the idea that maximizing entropy minimizes
the amount of "hidden" information or assumptions we're making about the
distribution. It's a way to avoid bias and ensure we're not imposing any unnecessary
structure on the data.
Mathematical Formulation
subject to:
where:
Example
Let's say we know the mean of a discrete random variable X, but we don't know the
full distribution. The Maximum Entropy Principle suggests finding the distribution
that maximizes entropy while satisfying the constraint on the mean. This often leads
to exponential distributions or other distributions with maximum entropy properties.
Applications
Okay, let's work through a solved example to illustrate the Maximum Entropy
Principle.
Suppose we have a six-sided die, but we don't know if it's fair or biased. However, we
know that the mean (expected value) of the die rolls is 4.5. We want to find the
probability distribution that maximizes entropy while satisfying this constraint.
To solve this constrained optimization problem, we can use the method of Lagrange
multipliers. This involves introducing Lagrange multipliers (λ and μ) and forming the
Lagrangian function:
We take the partial derivatives of L with respect to each P(x), λ, and μ, and set them
to zero:
∂L/∂P(x) = -log₂(P(x)) - 1 + λ + μx = 0
∂L/∂λ = Σ P(x) - 1 = 0
Solving these equations simultaneously can be a bit involved, but it leads to the
following solution:
P(x) = (1/Z) * exp(μx)
• Σ P(x) = 1
• Σ x * P(x) = 4.5
This involves solving some algebraic equations, which can be done numerically.
After solving for μ and Z, we get the probability distribution that maximizes entropy
while satisfying the mean constraint. This distribution will likely be non-uniform, with
higher probabilities for outcomes closer to the mean (4.5).
Okay, let's move on to the next topic in Unit-I: Information Measure of Continuous
Random Variables.
So far, we've focused on discrete random variables, where the outcomes can take on
a finite or countable number of values. However, many real-world phenomena
involve continuous random variables, which can take on any value within a given
range.
Differential Entropy
For a continuous random variable X with probability density function (PDF) f(x), the
differential entropy h(X) is defined as:
where:
1. Translation Invariance: h(X + c) = h(X) for any constant c. Shifting the distribution
doesn't change its differential entropy.
2. Scaling: h(aX) = h(X) + log₂(|a|) for any non-zero constant a. Scaling the variable
affects the differential entropy by a logarithmic factor.
3. Maximum Entropy: For a given variance, the Gaussian distribution has the maximum
differential entropy among all continuous distributions.
Example
Let's calculate the differential entropy of a uniform distribution over the interval [0,
a]:
= log₂(a) bits
Let's work through a solved example to illustrate the concept of differential entropy.
Consider a continuous random variable X that follows a uniform distribution over the
interval [a, b]. This means that X can take on any value between a and b with equal
probability.
f(x) = 1 / (b - a) for a ≤ x ≤ b
0 otherwise
Calculation
Using the formula for differential entropy:
We need to integrate over the range where f(x) is non-zero, which is [a, b]:
Since the PDF is constant within the interval, we can take it out of the integral:
Simplifying:
Interpretation
The differential entropy of a uniform distribution over [a, b] is log₂(b - a) bits. This
makes intuitive sense:
• Wider Interval: As the interval [a, b] gets wider, the uncertainty about the value of X
increases, and so does the differential entropy.
• Units: The units are bits, which is consistent with the base-2 logarithm used in the
formula.
This means that a uniform distribution over [0, 1] has zero differential entropy. This
doesn't mean there's no uncertainty, but it serves as a reference point for comparing
with other distributions.
Okay, let's move on to the next topic in Unit-I: Maximum Entropy Principle (for
Continuous Variables).
Just as we have a Maximum Entropy Principle for discrete random variables, there's
an analogous principle for continuous random variables. It states that given some
constraints on a continuous probability distribution (e.g., constraints on its moments
or expected values), the distribution that maximizes differential entropy while
satisfying those constraints is the "best" representation of our current knowledge.
Understanding the Principle
• Maximum Differential Entropy: Among all continuous distributions that satisfy the
given constraints, the Maximum Entropy Principle suggests choosing the one with
the highest differential entropy.
• Justification: This principle is based on the idea that maximizing differential entropy
minimizes the amount of additional information or bias we introduce into the
distribution beyond what's specified by the constraints.
Mathematical Formulation
subject to:
where:
• μ is the mean.
• σ² is the variance.
Other Examples
Suppose we have a continuous random variable X, and we know its mean (μ) is 2 and
its variance (σ²) is 1. We want to find the probability distribution that maximizes
differential entropy while satisfying these constraints.
To solve this constrained optimization problem, we can use the calculus of variations.
This involves finding the function f(x) that maximizes the functional h(X) subject to
the constraints.
where L is the Lagrangian function, which incorporates the differential entropy and
the constraints using Lagrange multipliers.
Solving the Euler-Lagrange equation for this problem (which involves some calculus
and algebraic manipulation) leads to the following solution:
This is the PDF of a Gaussian (normal) distribution with mean μ = 2 and variance σ² =
1.
Interpretation
The Gaussian distribution with mean 2 and variance 1 is the maximum entropy
distribution among all continuous distributions that satisfy the given mean and
variance constraints. This means that if we know the mean and variance of a random
variable, but we don't have any other information about its distribution, the
Gaussian distribution is the most unbiased and "spread out" distribution we can
assume.
Absolutely! Let's briefly discuss the Lagrange function and Lagrange multipliers, as they're
essential tools for solving constrained optimization problems like the Maximum Entropy
Principle.
Imagine you want to find the maximum or minimum value of a function f(x, y) subject to a
constraint g(x, y) = c. This is a constrained optimization problem.
Lagrange Function
The Lagrange function, denoted as L, combines the objective function f and the constraint g
using a Lagrange multiplier (λ):
Lagrange Multipliers
The Lagrange multiplier (λ) is an auxiliary variable introduced to incorporate the constraint
into the optimization problem.
To find the optimal solution, we take the partial derivatives of the Lagrange function with
respect to x, y, and λ, and set them to zero:
∂L/∂x = 0
∂L/∂y = 0
∂L/∂λ = 0
Solving these equations simultaneously gives us the values of x, y, and λ that satisfy the
constraint and optimize the objective function.
Interpretation
• The Lagrange multiplier (λ) represents the sensitivity of the optimal value of f to
changes in the constraint constant c.1
The Lagrange multiplier method is based on the idea that at the optimal solution, the
gradient of the objective function f is parallel to the gradient of the constraint function g.
This condition is captured by the partial derivative equations mentioned above.
Example
L(x, y, λ) = x + y + λ(x² + y² - 1)
∂L/∂x = 1 + 2λx = 0
∂L/∂y = 1 + 2λy = 0
∂L/∂λ = x² + y² - 1 = 0
Okay, I understand. From now on, with each topic we cover, I'll include a solved numerical
example to illustrate the concepts more concretely.
Jensen's Inequality
• Convex Function: A function is convex if, for any two points on the graph of the
function, the line segment connecting those points lies above or on the graph.
• The Inequality: Jensen's inequality states that for a convex function φ(x) and a
random variable X:
φ(E[X]) ≤ E[φ(X)]
where:
• Interpretation: This inequality means that applying the convex function to the
expected value of X is less than or equal to the expected value of the function
applied to X.
Example: Jensen's Inequality with x²
Let's consider the convex function φ(x) = x² and a discrete random variable X with the
following probability distribution:
x 1 2 3
Jensen's inequality is used in various proofs and derivations in information theory, such as:
Quant acknowledged! Let's move on to the next topic in Unit-I: Fano's Inequality.
Fano's Inequality
Fano's inequality is a fundamental result in information theory that relates the probability of
error in decoding a message to the conditional entropy of the transmitted message given the
received signal. It provides a lower bound on the probability of error, showing that it cannot
be arbitrarily small if there's uncertainty about the transmitted message.
• Probability of Error: Let P(e) denote the probability of error, i.e., the probability that
the decoded message X̂ is not equal to the transmitted message X.
where:
Interpretation
• Lower Bound on Error: Fano's inequality provides a lower bound on the probability
of error. It shows that even with optimal decoding, the error probability cannot be
smaller than this bound.
• Relationship with Uncertainty: The inequality highlights the connection between the
probability of error and the uncertainty about the transmitted message given the
received signal. If there's a lot of uncertainty (high H(X|Y)), the error probability will
be higher.
Consider a binary symmetric channel (BSC) with crossover probability p. This means that
each bit transmitted has a probability p of being flipped (0 becomes 1 or 1 becomes 0).
Let X be the transmitted bit (0 or 1) and Y be the received bit. The conditional entropy
H(X|Y) can be calculated using the BSC's properties.
This gives a lower bound on the error probability for the BSC.
Applications
Fano's inequality has applications in:
Quant acknowledged! Let's move on to the next topic in Unit-I: Introduction to Lossless
Coding.
Lossless coding, also known as lossless compression or reversible coding, is a type of data
compression where the original data can be perfectly reconstructed from the compressed
data. No information is lost during the compression process.
o Text documents
o Program code
o Medical images
o Scientific data
• Trade-off: The trade-off in lossless coding is between compression ratio (how much
the data is compressed) and computational complexity (how much processing power
is required).
Common Techniques
• Huffman Coding: Assigning shorter codes to more frequent symbols and longer
codes to less frequent symbols.
• Shannon-Fano Coding: Similar to Huffman coding, but the codes are assigned based
on a top-down approach.
• Lempel-Ziv Algorithms (LZ77, LZ78, LZW): Building a dictionary of previously
encountered sequences and replacing repeated occurrences with references to the
dictionary.
AAABBBCCCDDDD
3A3B3C4D
This represents three consecutive 'A's, followed by three 'B's, and so on. The compressed
representation is shorter than the original sequence.
Applications
• File Compression: ZIP, RAR, and other archive formats use lossless compression to
reduce file sizes.
• Image Compression: PNG and GIF are image formats that use lossless compression.
• Text Compression: Text files and documents can be compressed using lossless
techniques.
• Data Transmission: Lossless compression can reduce the amount of data transmitted
over a network, saving bandwidth.
Quant acknowledged! Let's move on to the next topic in Unit-I: Source Coding Theorem.
The Source Coding Theorem, also known as Shannon's Source Coding Theorem or the
Noiseless Coding Theorem, is a fundamental result in information theory that establishes the
limits of lossless data compression. It states that the average number of bits required to
represent a source symbol cannot be less than the entropy of the source.
• Entropy as a Lower Bound: The entropy of a source, denoted as H(X), represents the
average amount of information or uncertainty associated with each source symbol.
The Source Coding Theorem states that it's impossible to compress the source using
fewer bits than its entropy without losing some information.
• Achievability: The theorem also states that it's possible to achieve an average code
length arbitrarily close to the entropy by using sufficiently long codewords. This
means that we can approach the theoretical limit of compression, but we can never
go below it.
• Optimal Coding: The Source Coding Theorem motivates the search for optimal
coding schemes that minimize the average code length and approach the entropy
bound. Huffman coding and other techniques we'll discuss later are examples of such
schemes.
Mathematical Formulation
L ≥ H(X)
where:
Interpretation
• Compression Limit: The theorem sets a fundamental limit on how much we can
compress a source without losing information.
• Practical Implications: The Source Coding Theorem guides the design of efficient
compression algorithms and helps us understand the trade-offs involved in lossless
compression.
Example
• P(A) = 0.5
• P(B) = 0.25
• P(C) = 0.25
The Source Coding Theorem tells us that any lossless coding scheme for this source must use
at least 1.5 bits on average to represent each symbol. We can try to approach this bound,
but we can never go below it.
Quant acknowledged! Let's move on to the next topic in Unit-I: Block Codes and their
Properties.
Block Codes
In the realm of coding theory, a block code is a type of error-correcting code where a fixed-
length block of input data, consisting of k bits, is encoded into a longer fixed-length block of
n bits (n > k). The resulting n-bit block is called a codeword.
• Code Rate: The ratio k/n is known as the code rate, representing the proportion of
information bits to total bits in a codeword.
Block codes exhibit several key properties that determine their efficiency and error-
correcting capabilities:
1. Error Detection and Correction: Block codes are designed to detect and correct
errors that may occur during data transmission or storage. The ability to correct
errors is a crucial characteristic of these codes.
2. Hamming Distance: The Hamming distance between two codewords is the number
of positions in which they differ. For example, the codewords 10110 and 11101 have
a Hamming distance of 3. The minimum Hamming distance (d) of a code is the
smallest distance between any two distinct codewords.
3. Error Detection Capability: A block code with a minimum Hamming distance d can
detect up to d-1 errors. This means that if up to d-1 bits are flipped during
transmission, the receiver can detect that an error has occurred.
4. Error Correction Capability: A block code with a minimum Hamming distance d can
correct up to (d-1)/2 errors. This means that if up to (d-1)/2 bits are flipped, the
receiver can correctly determine the original transmitted codeword.
5. Codeword Space: The set of all possible codewords forms the codeword space. The
size of this space depends on the code's parameters (n and k).
There are various types of block codes, each with its own encoding and decoding algorithms
and characteristics:
• Linear Block Codes: These codes have the property that the sum of any two
codewords is also a codeword. Examples include Hamming codes and Reed-Solomon
codes.
• Cyclic Codes: A special type of linear code where any cyclic shift of a codeword is also
a codeword.
• Other Codes: There are many other types of block codes, each with its own unique
properties and applications.
The Hamming (7,4) code is a classic example of a linear block code. It encodes 4 information
bits into a 7-bit codeword. It has a minimum Hamming distance of 3, allowing it to detect up
to 2 errors and correct 1 error.
Okay, let's delve into solved examples for each type of block code you mentioned:
• Encoding:
o The Hamming (7,4) code adds 3 parity bits to these information bits.
• Decoding:
o The decoder performs parity checks on different subsets of the received bits.
o Based on the specific parity check failures, the decoder can pinpoint the
location of the error and correct it.
o In this case, the decoder would correct the third bit, restoring the original
codeword: 1010101
• Encoding:
o The data is appended with zeros (equal to the degree of the generator
polynomial), becoming: 1101011011000
• Decoding:
o The receiver divides the received message by the same generator polynomial.
o Depending on the specific code and application, the receiver might attempt
to correct the error or request retransmission.
• Encoding:
• Decoding:
Okay, I'll provide solved examples for the remaining topics in Unit-I in one go. Here we go:
• Example:
• Properties:
2. Kraft-McMillan Inequality
Σ 2^(-lᵢ) ≤ 1
• Example:
o Since the inequality holds, an instantaneous code with these lengths exists
(e.g., {0, 10, 110, 111}).
• Interpretation: The inequality essentially checks if the given codeword lengths "fit"
within the available code space without violating the prefix condition.
3. Huffman Coding
• Example:
o Consider the symbols {A, B, C, D} with probabilities {0.4, 0.3, 0.2, 0.1}.
o This code minimizes the average code length, making it efficient for
compression.
4. Shannon-Fano Coding
• Algorithm: Shannon-Fano coding is another variable-length coding algorithm that
assigns codes based on a top-down approach, dividing the symbols into two sets with
roughly equal probabilities and recursively assigning codes.
• Example:
o Shannon-Fano coding might produce a slightly different code, but it also aims
to minimize the average code length.
5. Lempel-Ziv Algorithm
• Example: