Advanced Probability & Statistics - 23CST-286
Advanced Probability & Statistics - 23CST-286
(23CST-281)
Compiled by : Subhayu
Contents :
Unit 1……………………………………………………………………………………………………………………………………..
Unit 2…………………………………………………………………………………………………………………………………….
Unit 3……………………………………………………………………………………………………………………………………..
MST 1 and 2 solutions………………………………………………………………………………………………………
Sample Questions…………………………………………………..…………………………………………………………..
Advanced Probability & Statistics (23CST-286)
Unit 1: Random Variable and Distribution Function
Random Variable
A random variable is a function that maps each outcome in a sample space to a real number.
It provides a numerical description of the outcome of a statistical experiment.
Properties of CDF:
Non-decreasing
limx→−∞ FX (x) = 0
limx→∞ FX (x) = 1
Right continuous
Example Table:
Y=0 Y=1
1/27
Y=0 Y=1
For X :
PX (x) = ∑y P (X = x, Y = y)
For Y :
PY (y) = ∑x P (X = x, Y = y)
2/27
∞
fY (y) = ∫−∞ f (x, y) dx
Properties:
Example (Continuous):
If f (x, y)
= 2 for 0 < x < 1, 0 < y < 1, and 0 elsewhere,
1
then fX (x) = ∫0 2 dy = 2 for 0 < x < 1.
Marginal PDF of X :
∞
fX (x) = ∫−∞ f (x, y) dy
Marginal PDF of Y :
∞
fY (y) = ∫−∞ f (x, y) dx
This gives the individual distribution of each variable regardless of the other.
fY (y)
if fY (y) > 0
fX (x)
if fX (x) > 0
−∞
3/27
For discrete random variables:
P (X = x, Y = y) = P (X = x) ⋅ P (Y = y)
Independence implies that knowing the value of one variable gives no information about the
other.
Bivariate Distribution
A bivariate distribution describes the joint distribution of two random variables X and Y . It
can be expressed through:
The bivariate distribution includes all information about the behavior and relationship
between two variables.
Correlation
Correlation measures the strength and direction of a linear relationship between two
random variables.
Cov(X, Y )
rXY =
σX σY
Properties:
4/27
−1 ≤ rXY ≤ 1
Example:
If X is the number of study hours and Y is exam score, a positive correlation is expected
(higher study hours lead to higher scores).
Correlation only captures linear dependency. Nonlinear relationships may exist even when
correlation is zero.
dx
fY (y) = fX (x)
, where x = g −1 (y)
dy
Example:
Let X ∼ U (0, 1), and let Y = X .
−1
fU ,V (u, v) = fX,Y (x, y) ⋅ ∣J ∣
where
∂(x, y)
J=
∂(u, v)
5/27
2. Distribution of the Difference, Product, Quotient of Two Random Variables
−∞
Product: For Z = XY , the PDF is more complex and often solved using transformation
techniques or convolution in log-domain.
Quotient: If Z = X
Y
, the density function is:
∞
fZ (z) = ∫
∣y∣fX (zy)fY (y)dy
−∞
Definition:
The expected value or mean of a random variable X gives the average value in the long run.
For discrete X :
E[X] = ∑ xP (X = x)
For continuous X :
∞
E[X] = ∫ xfX (x)dx
−∞
Properties:
4. Moments
6/27
r-th moment about origin:
μ′r = E[X r ]
μr = E[(X − μ)r ]
Special Cases:
μ3 relates to skewness
μ4 relates to kurtosis
Joint expectation:
x y
Covariance:
Cov(X, Y )
ρXY =
σX σY
7/27
Definition:
The law states that as the number of trials increases, the sample mean approaches the
population mean.
P ( ∑ Xi − μ ≥ ϵ) → 0 as n → ∞
n
1
n
i=1
Strong Law:
The sample average converges to the population mean almost surely (with probability
1).
Implication:
Justifies using sample averages to estimate expected values in practical applications.
1. Conditional Expectation
Discrete Case:
Let X and Y be discrete random variables. The conditional expectation of X given Y = y is:
E[X∣Y = y] = ∑ x ⋅ P (X = x∣Y = y)
Continuous Case:
If X and Y are continuous random variables with joint density function fX,Y (x, y) and
∞
E[X∣Y = y] = ∫ x ⋅ fX∣Y (x∣y)dx
−∞
fX,Y (x,y)
where fX∣Y (x∣y) =
fY (y)
2. Conditional Variance
8/27
Definition:
The conditional variance of X given Y = y is the variance of X when Y = y is known.
−∞
E[X] = E[E[X∣Y ]]
Definition:
The moment generating function of a random variable X is defined as:
MX (t) = E[etX ]
For discrete:
MX (t) = ∑ etx P (X = x)
For continuous:
∞
MX (t) = ∫
etx fX (x)dx
−∞
Properties:
(n)
MX (0) = E[X n ], i.e., the nth derivative of the MGF at t = 0 gives the nth moment.
9/27
4. Chebyshev’s Inequality
Statement:
For any random variable X with finite mean μ and variance σ 2 , and for any k > 0:
1
P (∣X − μ∣ ≥ kσ) ≤
k2
This inequality gives an upper bound on the probability that the value of a random variable
deviates from its mean by more than k standard deviations.
Use:
Chebyshev’s inequality is used to prove the Weak Law of Large Numbers and to give non-
parametric bounds on dispersion.
Statement:
Let X1 , X2 , ..., Xn be i.i.d. random variables with finite mean μ. Then the sample mean
ˉn =
X 1 n
∑i=1 Xi satisfies:
n
ˉ n − μ∣ ≥ ϵ) = 0 for any ϵ > 0
lim P (∣X
n→∞
This means the sample average converges in probability to the expected value.
ˉ σ2
P (∣Xn − μ∣ ≥ ϵ) ≤ 2 → 0 as n → ∞
nϵ
Statement:
Let X1 , X2 , ..., Xn be i.i.d. random variables with mean μ and variance σ 2 . Then, as n
→∞
, the standardized sum:
n
∑ Xi − nμ
Zn = i=1 → N (0, 1)
σ n
10/27
In other words, the distribution of Zn tends to the standard normal distribution.
Implication:
No matter the original distribution of Xi , the sample sum or mean becomes approximately
normal for large n. This is the basis of many statistical inference techniques.
Example:
If tossing a biased coin n times (say probability of head = 0.6), the distribution of number of
heads (a binomial variable) will resemble a normal curve as n becomes large.
Probability is used when the parameters are known, and we compute the likelihood of
observing a particular data point or set of data.
Example: If we know a coin is fair (P (H) = 0.5), the probability of getting 2 heads in 2
tosses is 0.5 × 0.5 = 0.25.
Likelihood is used when data is observed, and we want to estimate the unknown
parameters that best explain the data.
Example: If we toss a coin twice and get two heads, the likelihood of different values of p
(probability of heads) can be evaluated by L(p) = p2 . The value of p that maximizes this
likelihood is taken as the estimate.
2. Parameter Space
The parameter space is the set of all possible values that a parameter can take.
For example, if p is the probability of success in a Bernoulli trial, then the parameter
space is 0 ≤ p ≤ 1.
For a normal distribution N (μ, σ 2 ), the parameter space is μ ∈ R, σ 2 > 0.
3. Characteristics of Estimators
11/27
Consistency: θ^ → θ in probability as sample size n → ∞.
Efficiency: Among all unbiased estimators, the one with the smallest variance is called
efficient.
Sufficiency: An estimator is sufficient if it uses all the information in the sample about
the parameter.
Minimum Variance Unbiased Estimator (MVUE): An unbiased estimator that has the
smallest variance among all unbiased estimators.
Given sample X1 , X2 , ..., Xn and probability density function f (x; θ), the likelihood is:
n
L(θ) = ∏ f (Xi ; θ)
i=1
i=1
d
log L(θ) = 0
dθ
Solve for θ to get the MLE.
2πσ 2 2σ 2
i=1
1
^M LE =
μ
∑ Xi
12/27
5. Method of Minimum Variance
Based on the idea that population moments can be estimated using sample moments.
If μ′1 , μ′2 , ..., μ′k are the first k population moments expressed as functions of the
n
1
μ′j = ∑ Xi
j
for j = 1, 2, ..., k
n
i=1
Example: For an exponential distribution f (x; λ) = λe−λx , the first moment is μ = 1/λ.
ˉ , we equate:
Using sample mean X
ˉ = 1 ⇒λ
X ^= 1
^
λ ˉ
X
Commonly used in regression analysis where the aim is to minimize the sum of squared
deviations between observed and predicted values.
n
S(β0 , β1 ) = ∑(Yi − β0 − β1 Xi )2
i=1
ˉ )(Yi − Yˉ )
∑(Xi − X
β^1 = , β^0 = Yˉ − β^1 X
ˉ
∑(Xi − X ˉ )2
13/27
This method ensures the best linear unbiased estimates (BLUE) under standard assumptions.
Sampling
Types of Sampling
2. Stratified Sampling:
The population is divided into subgroups (strata) based on a particular characteristic (like
age, gender), and samples are randomly drawn from each subgroup.
Example: Surveying both male and female participants separately.
3. Systematic Sampling:
Every kth unit from a list is selected after a random starting point.
Example: Selecting every 10th student from a roll list.
4. Cluster Sampling:
The population is divided into clusters (groups), some clusters are randomly selected,
and then all or some elements within those clusters are studied.
Example: Selecting a few classrooms and interviewing all students within those classes.
5. Convenience Sampling:
Sample is taken from a group that is easy to access. This is non-probabilistic and may
involve biases.
Example: Interviewing people in a nearby park.
6. Quota Sampling:
The population is segmented, and a quota is set for each segment. Within each quota,
sampling is done conveniently or randomly.
Example: Choosing 10 people from each age group for a survey.
14/27
Algorithms Using Regression
The idea is to update parameters (e.g., weights θ ) in the opposite direction of the
gradient of the cost function with respect to the parameters.
i=1
where hθ (x)
= θ0 + θ1 x
Update Rule:
∂J(θ)
θj := θj − α
∂θj
m
i=1
m
1
θ1 := θ1 − α ∑(hθ (x(i) ) − y (i) )x(i)
m
i=1
Locally Weighted Regression is a non-parametric algorithm that fits multiple linear models
locally to subsets of the data.
Instead of using a single global model for the entire dataset, LWR fits a model at a target
query point x using nearby training data.
()
15/27
(x(i) − x)2
w (i)
= exp (− )
2τ 2
i=1
A separate θ is computed for each query point using weighted linear regression.
LWR is computationally expensive at test time but offers flexibility and good local fitting.
3. Logistic Regression
Logistic Regression is used for binary classification problems, where the output is 0 or 1.
1
hθ (x) =
1 + e−θ x
T
This is a sigmoid function that maps any real-valued number into the (0,1) interval.
Cost Function:
m
1
J(θ) = − ∑ [y (i) log(hθ (x(i) )) + (1 − y (i) ) log(1 − hθ (x(i) ))]
m i=1
The cost function is convex, allowing gradient descent to converge to the global
minimum.
m
i=1
Logistic regression is interpretable and works well when the data is linearly separable.
For multi-class classification, softmax regression (multinomial logistic regression) is
used.
MST 1 Solutions
16/27
1. Define a random variable and its types
A random variable (RV) is a function that assigns a real number to each outcome in a sample
space of a random experiment.
Continuous Random Variable: Takes any value in a given interval (infinite uncountable
set).
Example: Time taken to run a race.
2. Compute the value of k for the following probability distribution of a random variable
X, P(X = 1) = 0.5, P(X = 2) = 0.3 and P(X = 3) = k
We know that the sum of all probabilities in a probability distribution must equal 1.
P (X = 1) + P (X = 2) + P (X = 3) = 1
0.5 + 0.3 + k = 1
k = 1 − 0.8 = 0.2
Answer: k = 0.2
It is defined as:
Cov(X, Y )
r=
σX σY
Where:
17/27
Cov(X, Y) is the covariance of X and Y.
Range: −1 ≤r≤1
r = 1: Perfect positive linear correlation
r = −1: Perfect negative linear correlation
r = 0: No linear correlation
A Cumulative Distribution Function (CDF) of a random variable X gives the probability that
X will take a value less than or equal to x:
F (x) = P (X ≤ x)
Properties of CDF:
Non-decreasing
limx→−∞ F (x) = 0
limx→∞ F (x) = 1
F (1) = P (X ≤ 1) = 0.2
Two random variables X and Y are independent if the occurrence of one does not affect
the probability distribution of the other.
18/27
Mathematical condition:
For all values of x and y ,
P (X = x and Y = y) = P (X = x) ⋅ P (Y = y)
Example:
Let P (X = 1) = 0.5, P (Y = 1) = 0.4, and P (X = 1, Y = 1) = 0.2
Then X and Y are independent if
If this condition fails for even one pair, X and Y are not independent.
2/2
6. Calculate the Pearson correlation coefficient for the following data
X (Father's height in inches): 65, 66, 67, 67, 68, 69, 70, 72
Y (Son's height in inches): 61, 68, 65, 68, 72, 72, 64, 71
X Y X² Y² XY
19/27
Step 2: Plug into formula
8(36820) − (544)(541)
r=
[8(37028) − 5442 ][8(36699) − 5412 ]
294560 − 294304
=
[296224 − 295936][293592 − 292681]
7. The length of time (in minutes) that a certain lady speaks on a telephone is found to be
random phenomenon with a probability function specified by the pdf
1
A ⋅ (0 + 5) = 1 ⇒ A = = 0.2
5
A = 0.2
(b) Calculate the probability that the number of minutes she will talk over phone is more
than 10 minutes
∞
P (X > 10) = ∫ 0.2e−x/5 dx = 0.2 ⋅ [−5e−x/5 ]∞
10
10
= 0.2 ⋅ [0 + 5e−2 ] = 0.2 ⋅ 5 ⋅ e−2 = e−2 ≈ 0.1353
MST 2 solutions
20/27
1. Define the probability density function (pdf) of a random variable Y when Y = g(X),
where g(X) is a continuous function.
Solution:
If Y = g(X) and g is a continuous and differentiable function with a strictly monotonic
inverse g −1 (y), then the probability density function (pdf) of Y is given by:
d −1
fY (y) = fX (g −1 (y)) ⋅
g (y)
dy
Here, fX (x) is the pdf of the original variable X , and this formula is valid where g −1 (y)
2. State the formula for the expectation of the product of two independent random
variables.
Solution:
If X and Y are two independent random variables, then:
3. Describe the process of finding the expected value of a discrete random variable X
given that X takes values 0, 1, and 2 with probabilities 0.3, 0.4, and 0.3 respectively.
Solution:
The expected value E[X] of a discrete random variable is calculated as:
E[X] = ∑ xi ⋅ P (xi )
Here:
21/27
4. Define the moment generating function (mgf) of a random variable.
Solution:
The moment generating function (MGF) of a random variable X is defined as:
MX (t) = E[etX ]
It is used to generate the moments (like mean, variance) of the distribution by differentiating
the MGF with respect to t and evaluating at t = 0. That is:
(n)
E[X n ] = MX (0)
5. Compute the mgf of a continuous random variable with the probability density
function fX (x)
= 2x for 0 ≤ x ≤ 1.
Solution:
The moment generating function is given by:
1
MX (t) = ∫ etx ⋅ 2x dx
0
u = 2x ⇒ du = 2dx
etx
dv = etx dx ⇒ v = t
Then,
1 1
2x ⋅ etx 2etx
MX (t) = [ ] −∫ dx
t 0 0 t
1
2et 2 2e t
2
= − ⋅ ∫ etx dx =
− 2 (et − 1)
t t 0 t t
2et 2(et − 1)
MX (t) = −
t2
t
This is the MGF of X for the given pdf.
X : −3, 6, 9
22/27
1 1 1
P (X = x) : , ,
6 2 3
Calculate E(X), E(X 2 ) and using the laws of expectation, evaluate E((2X + 1)2 ) for
the given probability distribution of X .
Solution:
E(X) = ∑ x ⋅ P (X = x)
1 1 1
= (−3) ⋅ + 6 ⋅ + 9 ⋅
6 2 3
= −0.5 + 3 + 3 = 5.5
E(X 2 ) = ∑ x2 ⋅ P (X = x)
1 1 1
= (−3)2 ⋅ + 62 ⋅ + 92 ⋅
6 2 3
1 1 1
= 9 ⋅ + 36 ⋅ + 81 ⋅
6 2 3
= 1.5 + 18 + 27 = 46.5
(2X + 1)2 = 4X 2 + 4X + 1
Final Answers:
E(X) = 5.5
E(X 2 ) = 46.5
E((2X + 1)2 ) = 209
23/27
7. Compute the Moment Generating Function (MGF) about origin of a Normal
Distribution.
Let X ∼ N (μ, σ 2 ). We need to find the moment generating function (MGF), defined by:
MX (t) = E[etX ]
Solution:
1 (x − μ)2
f (x) = exp (− )
2πσ 2 2σ 2
−∞ −∞
−∞
1
MX (t) = exp (μt + σ 2 t2 )
2
24/27
Final Answer:
1
MX (t) = exp (μt + σ 2 t2 )
2
5. Write the general formula for the Maximum Likelihood Estimator (MLE).
8. A sample contains values: 4, 7, 9. Estimate the population mean using the method of
moments.
9. In simple random sampling, what is the probability that the first item selected is the
largest in the population of size n?
11. Write one key difference between gradient descent and locally weighted regression.
12. What does the term “independent identically distributed” (i.i.d) mean?
25/27
2. A sample of 5 values: 2, 3, 5, 7, 9. Estimate the population variance using the method of
moments.
3. Find the MLE of p for a binomial distribution based on the observation: 3 successes in 5
trials.
4. Use gradient descent (one iteration) to update weights for minimizing the function
J(θ) = (θx − y)2 , where x = 2, y = 5, θ = 1, and learning rate α = 0.1.
5. Describe stratified and systematic sampling with appropriate examples.
1. A random sample of size 10 from a normal distribution yielded the following values:
6, 8, 9, 10, 7, 9, 11, 12, 8, 10.
(a) Estimate the population mean and variance using the method of moments.
(b) Derive the maximum likelihood estimators for the same parameters.
2. Perform two iterations of gradient descent for minimizing the cost function
1 m
J(θ) = ∑i=1 (θxi − yi )2
m
using data: (x1 = 1, y1 = 2), (x2 = 2, y2 = 3), initial θ = 0, learning rate α = 0.1.
1+e−θT x
,
derive the cost function and explain how gradient descent is used to minimize it.
f (x; λ) = λe−λx , x ≥ 0.
Derive the MLE for λ based on a sample of size n.
26/27
(c) Cluster sampling
(d) Systematic sampling
Include relevant examples and illustrations.
27/27