0% found this document useful (0 votes)
3 views15 pages

Probability - Statistics - Class Notes

The document provides an overview of probability concepts relevant to machine learning, including disjoint and joint events, independent and dependent probabilities, and the Naive Bayes model. It covers various distributions such as discrete, continuous, binomial, normal, and chi-squared distributions, along with their properties and functions like PMF, PDF, and CDF. Additionally, it discusses statistical measures such as mean, variance, standard deviation, skewness, kurtosis, and methods for visualizing data distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views15 pages

Probability - Statistics - Class Notes

The document provides an overview of probability concepts relevant to machine learning, including disjoint and joint events, independent and dependent probabilities, and the Naive Bayes model. It covers various distributions such as discrete, continuous, binomial, normal, and chi-squared distributions, along with their properties and functions like PMF, PDF, and CDF. Additionally, it discusses statistical measures such as mean, variance, standard deviation, skewness, kurtosis, and methods for visualizing data distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 15

Probability for ML

Sum of probabilities (A or B)
Disjoint events: ∑ pi Joint events: ∑ pi −∏ pi
P( A∪B)=P( A)+ P(B) P( A∪B)=P( A)+ P(B)−P( A∩B)

Independent, dependent and conditional probabilities


For independent events, i.e. when one outcome does not affect other outcomes:
P( A ∩ B) = P (A ) ⋅ P (B)
For non-independent events:
P( A ∩ B) = P( A) ⋅ P (B ∣ A)
Conversely, the probability of an event not happening together with another is
P(¬ A ∩ B) = P(¬ A ) ⋅ P (B ∣ ¬ A )
The probability of the other event is given by
P(B) = P( A ∩ B) + P(¬ A ∩ B)
= P (A ) ⋅ P (B ∣ A) + P(¬ A) ⋅ P (B ∣ ¬ A)
and the converse conditional probability by Bayes theorem
P( A ∩ B) P( A ) ⋅ P (B ∣ A) P (A ) ⋅ P(B ∣ A)
P( A ∣ B) = = =
P(B) P(B) P( A) ⋅ P (B ∣ A) + P(¬ A) ⋅ P( B ∣ ¬ A)
P (sick )⋅P(diagnosis∣sick)
P(sick∣diagnosis)=
P(sick)⋅P(diagnosis∣sick)+ P(healthy)⋅P(diagnosis∣healthy )

Naive Bayes model


Complicated models where probabilities are calculated based on an array of observed events B 1, …
Bn would have the following conditional probability:
P ( A ) ⋅ P(B 1 , ... , Bn ∣ A )
P( A ∣ B) =
P( A) ⋅ P(B1 , ..., Bn ∣ A) + P(¬ A ) ⋅ P (B 1 , ... , Bn ∣ ¬ A)
However, P(B1, ..., Bn | ¬A) could be 0, which makes the probability 1. Therefore, the naive
assumption is made that the observed events are independent of each other:
P( A) ⋅ P(B1 ∣ A ) ⋅ ... ⋅ P (B n ∣ A)
P( A ∣ B) =
P( A) ⋅ P(B1 ∣ A ) ⋅ ... ⋅ P (B n ∣ A) + P(¬ A) ⋅ P( B1 ∣ ¬ A) ⋅ ... ⋅ P(Bn ∣ ¬ A)
Discrete distributions
The distribution of discrete probabilities is given by the probability mass function (PMF): pX(x)
or p(X=x);
where the value is always between 0 and 1 0⩽p X ( x )⩽1; and

the sum of probabilities is 1 ∑ pX ( x )

Continuous distributions
The distribution of the total probability 1 over an interval between a and b of possible values is
given by the probability density function (PDF): fX(x);
where x is defined for all numbers x ∈ ℝ ;
the value is always between 0 and 1 0⩽f X (x)⩽1; and
b
the total area under the curve is 1 ∫ f X (t)dt
a

Cumulative distribution function


The function giving the probability that x is smaller or equal to X is the cumulative distribution
function (CDF): F X (x ) = P(X ⩽x);
where x is defined for all numbers x ∈ ℝ ;
the value is always non-negative 0⩽F X (x)⩽1;

left endpoint is 0 F X (a)=0 or lim F X ( x)=0;


x→−∞

right endpoint is 1 F X (b)=1 or lim F X ( x)=1; and


x →+∞

never decreases.

X
F X (X ) = ∑ px (i) F X (0)=0 F X (n)=1
i= a

Binomial coefficient
The number of different arrangements in which
( nk) = k ! (n−k )! = ( n−k
n!
k number of identical events (successes)
can be arranged out of n events is:
n
)
where n! is the total number of ways
k! is the arrangements of success
(n-k)! is the arrangements of failure
Binomial and Bernoulli distributions
The total probability 1 among k successes in n events is given by the PMF of the binomial
distribution:

()
p X (x) = n ⋅ p (1−p)
x
x n −x

In such circumstances it is said that the random variable follows a Binomial distribution with
parameters p and n: X ~ Binomial(n, p).
A binomial distribution with n = 1 where the random variable takes the value 1 with probability p
and takes the value 0 with probability q = 1 - p is called a Bernoulli distribution with parameter p.

Uniform distribution
When each event (or interval of the same length) has equal probability then the variable follows a
Uniform distribution with parameter p.

{ {
1 1
p X (x) = a< x< b 0 x< a
n b−a
f X( x ) =
x−a
F X (x ) = a⩽x< b
0 x∉(a , b) b−a

Normal distribution 1 x⩾b


2
1 x− μ
When a continuous distribution (approximating a binomial distribution 1 − ( σ )
with very large n) with μ and σ follows the curve f X( x ) = e 2
σ √2 π
2
X t
1 −
then it is said that X ~ Normal(μ, σ). Its CDF is Φ ( x) = ⋅ ∫ e 2 dt .
√ 2 π −∞
Chi-Squared distribution
The square of independent, normally distributed random variable F W (w)=P (W ⩽w)=
X follows the chi-squared distribution. It’s CDF is given by: P(X 2 ⩽w)= .
P(|X|⩽w)=
P(− √ w⩽ X⩽√ w)
The area under the PDF curve between
-√w and √w. The number of independent
random variables (k) is called the degree
of freedom. The chi-squared distributions
PDF and CDF with k degrees of freedom
are:

Sampling from a distribution


In order to find values that follow a particular distribution, first pick numbers with uniform intervals
(or generate random numbers) between 0 and 1. Then look up their values in the CDF-1 of the
desired distribution. These values will follow the target distribution.
Mean
n
1
The point of balance of a collection of datapoints is called the mean (average): μ = ⋅∑ x .
n i=1 i
b
1
The mean of a function is given by μ = ∫ f (x )dx .
b−a a
When outliers bias the data, the median (the middle value for odd datapoints or the average of the
two middle values for even datapoints) could be a better measure of the middle.
b−a
The median of a function is f ( ).
2
To find the value of the variable that maximises the probability function, the modes, the most
df
frequently occurring values are used. For a function f(x), the equation = 0 gives the mode
dx
2
d f
(as long as there is one maximum and 2
is negative, i.e. the curve is concave down).
dx

Expected value
The expected value for any probability distribution is the mean: E [X ]= μ .
n ∞
For PMFs: E [ p( x )]=∑ x i p (xi ) For PDFs: E [f ( x)]=∫ x⋅f (x )dx.
i=1 −∞

m
The sum of expectations is E [X 1 +...+ X m ] = ∑ E [ X i ].
i=1

n
The k moment of a distribution is expressed by ∑ x ki⋅p (xi ).
th

i=1

Variance
A fundamental measure of the spread of some data is the centred second moment of a distribution,
n
1
which is called the variance: Var [ X ]= ∑
n i=1
( xi −μ )2 .

The variance formula can be simplified thus: 2


Var [ X ]=E[( X− μ ) ]
E[ X 2−2 μ X− μ 2] =
2 2
E[ X ] − E [2 μ X ] + E [ μ ]=
2 2
E[ X ] − 2 μ E[ X ] + μ =
E[ X 2 ] − 2 μ μ + μ 2=
2 2 2 2
E[ X ] − μ =E[ X ] − E [ X ]
n ∞
For PMFs: Var [ p(x)]=∑ ( xi −μ )2 p(x i) For PDFs: Var [f (x)]=∫ (x−μ ) f (x )dx .
2

i=1 −∞

m
The sum of variances is Var [ X 1 +...+ X m ] = ∑ Var [ X i ] (as long as the variables are independent).
i=1
Standard deviation
The unit of variance is the square of the measured values (e.g. m2). The square root of the
variance, however, i.e. the standard deviation yields the original unit (m):
σ = √ Var [ X ] = √ E[ X 2 ]−E [ X ]2.
In a normal distribution, the standard deviation gives the ratio of the datapoints within a certain
range:
68.2% of the data is between -σ and σ;
95.2% is between -2σ and 2σ; and
99.7% is between -3σ and 3σ.
Standardisation is the process of making the mean of a dataset equal 0 and the standard deviation
of the dataset equal 1:
Centring: ❑− μ Scaling: ❑ σ

Normalisation and denormalisation


Training and test data often has to be normalized, in which case the prediction must be
denormalized:
X − x̄ Y denormalised = Y ⋅ SY + ȳ
X normalised =
SX

The sum of normal distributions


Given X1 ~ N(μ1, σ1) and X2 ~ N(μ2, σ2), what are the parameters of the distributions of X1 + X2?
As long as the variables are independent, the mean is μ =E [ X 1 ] + E[ X 2 ], while the standard
deviation is σ = √ Var ( X 1) + Var (X 2).

In general, w 1 X 1+...+ wn X n∼N (w 1 μ 1 +...+ w n μ n , √ w1 σ 1 +...+w n σ n).


2 2 2 2

Skewness and kurtosis


The standardised third moment of a
distribution measures the skewness (lack of
symmetry).
E.g. lottery vs car insurance where both μ and σ
are the same. Their difference can be detected
3
X−μ
by calculating: E [( σ ) ].
The standardised fourth moment of a
distribution measures the kurtosis (the tails).
Thick tails at the two ends of a distribution can
4
X−μ
be detected by calculating: E [( σ ) ] .
Charts and plots
The value at which the distribution reaches its nth % is called n% quantile.
The value at which the distribution reaches its (n x 25)th % is called nth quartile.
Therefore, q 0.25=Q 1 ,q 0.5=Q 2 or median ,Q 0.75=Q3 .

In a box plot the following points are defined:


lower whisker: max(Q1 – 1.5 x IQR, Xlowest)
box, i.e. IQR: between Q1 and Q3
middle of box (median): Q2
upper whisker min(Q3 + 1.5 x IQR, Xhighest)
Values below the lower whisker and above the upper whisker are outliers and are not represented.

For kernel density estimation, sample variables are plotted over the X axis and around each, a
small normal curve is drawn. Then the heights of each curve at each point X are summed up to give
an estimate of the PDF curve.

A violin plot is a kernel density estimation superimposed with a box plot.

A QQ plot is one where the X axis represents standard normal quantiles whereas the Y axis
represents quantiles calculated from the standardised sample. If the sample quantiles are around the
y = x line, then the distribution is normal.

Joint probabilities
For discrete distributions, the PMF for 2 variables is P XY (x , y) = P (X=x , Y = y ); for independent
variables, this is equal to P X (x)⋅PY ( y ).
For continuous distributions, the PDF for 2 variables is f XY (x , y )=f (X =x ,Y = y); for independent
variables, this is equal to f X (x )⋅f Y ( y ).

For discrete distributions, joint probability tables are summarised by marginal probabilities along
both axes:
xα xβ xγ xδ marginal probability P X (x)
yΑ P(x α , y Α ) P( x β , y Α ) P (xγ , y Α ) P(xδ , y Α ) P( y Α )
yΒ P (xα , Y = y Β ) P(x β , y Β ) P(x γ , y Β ) P(x δ , y Β ) P ( y Β)
yΓ P(xα , Y = y Γ ) P(x β , y Γ) P( xγ , y Γ ) P (xδ , y Γ ) P ( yΓ)
P(xα ) P(x β ) P( x γ ) P (xδ ) marginal probability PY ( y )

The sum of marginal probabilities along an axis is 1:


P(x α )+ p ( x β )+ p(x γ )+ P (xδ ) = 1 and P( y Α )+ p ( y Β )+ p( y Γ) = 1 .
However, a single column or row in the table does not add up to 1:
The sum P(Y Α , x β )+ P(Y Β , x β )+ P(Y Γ , x β ) = P(x β ) ≠ 1.
Normalisation, i.e. division by corresponding marginal probability P( xβ ) is needed.
Similarly P(x α ,Y Β )+ P(x β ,Y Β )+ P(x γ , Y Β )+ P(x δ ,Y Β ) = P( y Β ) ≠ 1.
Each cell in the table can be expressed as the product of a marginal and a conditional probability:
P(x γ , y Β )=P(x γ )⋅P( y Β ∣ X =xγ ) and P(x γ , y Β )=P(Y Β )⋅P (xγ ∣ Y = y Β )
The corresponding conditional probabilities are therefore:
P(x γ , y Β ) P(x γ , y Β )
P( y Β ∣ X=x γ )= and P(x γ ∣ Y = y Β )= .
P( xγ ) P ( y Β)
P X ,Y (x , y )
In general, PY∣X ( y ∣ x)= .
P X (x )
For continuous distributions, a conditional probability distribution can be obtained by the cross-
section of the landscape and the normalisation of the curve of the cross-section:
f X ,Y (x , y)
f Y∣X ( y ∣ x)= .
f X (x )

Covariance
The measure of the tendency of multiple variables to change similarly is the covariance. Formally
variance is a special case of covariance where X and Y are the same.
n
1
Cov [ X ,Y ] = ⋅∑ ( x i−μ x )( y i −μ y ) which can also be expressed by E [XY ] − E[ X ] E[Y ].
n i=1
when Cov[X, Y] > 0 positive correlation
Cov[X, Y] > 0 negative correlation
Cov[X, Y] ≈ 0 no correlation
n
The covariance of probabilities can be written
the same way only if the probabilities are the ∑ p XY (x i , y i )(xi −μ x )( y i−μ y )
i=1
same, i.e. PXY(x, y) is constant. If the
probabilities are different, then the covariance is
For multiple variables, the variances and covariances are organised in the covariance matrix:

e.g. with 2 variables Σ =


[ Var [ X ]
Cov [Y , X ]
Cov[ X , Y ]
Var [Y ] ]

[ ]
Var [ X 1] Cov[ X 1 , X 2 ] ⋯ Cov [ X 1 , X k ]
with k variables Cov [ X 2 , X 1 ] Var [ X 2 ] ⋯ Cov [ X 2 , X k ]
⋮ ⋮ ⋱ ⋮
Cov [ X k , X 1 ] Cov [ X k , X 2 ] ⋯ Var [ X k ]

Correlation coefficient
For the comparison of correlation, covariances are not suitable as they may take any value. They
have to be standardised to achieve standard correlation coefficients between -1 and 1:

[ ]
[ X ,Y ]
R = Cov σ ⋅σ R X , X =1
1 1
RX , X 1 2
⋯ RX ,X 1 k
x y
RX ,X R X , X =1 ⋯ RX ,X
Correlation coefficients can be organised in matrices: 2 1 2 2 2 k

⋮ ⋮ ⋱ ⋮
RX ,Xk
RX ,X
1 k 2
⋯ R X , X =1
k k
Multivariate normal distributions
For a single variable, the normal distribution has the following PDF:
2
1 x− μ
1 − ( σ )
f X( x ) = ⋅e 2
σ √2 π
2 2
1 x1− μ 1 x 2− μ 2
1 − (( σ ) +( σ ) )
2
For two independent variables, we get the following: 2
⋅ e 1 2
.
σ 1 σ 2 √2 π

[ ]
However, part of the exponent can be rewritten in this x 1− μ 1
form: [ x 1− μ 1
σ1
x 2− μ 2
σ2 ]⋅ σ1
x 2− μ 2
=
σ2

[ ][ ]
1
0
σ 21 x1
− μ 1 )=
( [ x 1 x 2]−[ μ 1 μ 2 ] )⋅
0
1
⋅(
x2 [ ]
μ2
σ 22

[][ ][ ] [][
T −1
x1 μ σ 21 0 x1
− μ1 )
Notice that the middle term of the last row is just the
(
x2

μ2
1 )⋅
0 σ 22
⋅(
x2 μ2 ]
inverse of the covariance matrix.

[] []
T
1 x x
− ( 1 −Μ) ⋅Σ −1⋅( 1 −Μ)
1 2 x2 x2
Therefore, the PDF can be rewritten as f X X (x 1 , x 2 ) = ⋅e
1 2
√det Σ2 π
The formula is valid for dependent variables as well, only the covariance matrix will change.

[ ]
Var [ X 1 ] 0 ⋯ 0
For independent variables: Σ = 0 Var [ X 2 ] ⋯ 0 ; whereas
⋮ ⋮ ⋱ ⋮
0 0 ⋯ Var [ X k ]

for dependent variables: Σ =


[ σ 21
Cov [X 2 , X 1 ]
Cov[ X 1 , X 2 ]
σ 22 ]
The bell-shaped landscape is also more elongated along the line x1 = x2 compared to the
independent landscape.

1
1 − ( x− μ)T⋅Σ−1⋅( x− μ)
2
We can generalise further to k variables: 1 k
⋅e
2
|Σ| (2 π ) 2

where x – vector [x1, x2, ... xk];


μ – vector [μ1, μ2, ... μk];
(x – μ)T (x – μ) – dot product representing the sum of squares;
| Σ | - determinant of covariance matrix (the spread of the landscape); and
Σ-1 – division by covariance matrix for standardisation
Population and sample

Population sample
size N n
mean μ x̄
? ?
proportion p= ^p=
N n
N n
1 1
variance σ2 = ∑ (x −μ )2
N i=1 i
s2 = ∑ (x − x̄ )2
n−1 i=1 i

Central Limit Theorem


Independently of the distribution of a population, the average of a sufficiently large number of
independent, random samples approximate the normal distribution.
The mean of the sample averages is the population mean: μ x̄ = μ ; and
2
the variance of the sample averages is the population variance per sample size: σ x̄ = σ .
2
n
If the sample size is sufficiently large, then the standardised mean of the sample means has a
σ 2X √ n⋅( X̄ −μ ) ∼ N (0 ,12 ) .
standard distribution: X̄ ∼ N ( μ X , ) or X
n σX

Point estimation – Maximum likelihood estimation (MLE)

Pick a model (e.g. the hypothetical bias of a coin) that fits the data.

Bernoulli distributions:
Maximise the likelihood of seeing a pattern of k successes in n observations X ~ Bernoulli(p):
e.g. 7 Heads (k) in 10 tosses (n):
k n−k 7 3
L( p , k , n)= p ⋅(1− p) = p ⋅(1− p)
k n−k 7 3
take the log: log likelihood = log p +log (1− p) log p + log (1− p)
k⋅log p+(n−k)⋅log (1− p) 7⋅log p+3⋅log (1−p)
dL k n−k
= ⋅1+ ⋅−1 pn = k
dp p 1− p
then the derivative: , which is 0 when
k
k− pk− pn+ pk ^p = = x̄ 0.7
2 n
p− p
Normal distributions:
The most likely among candidate models (~ N) for a sample is the one that fits the μ and the σ of the
1
sample the best. For maximum likelihood estimation, the σ is calculated with while the sample s
N
1
is calculated with .
n−1

Linear regression:
To find the ideal line to best fit a sample of points, for each point (xi, yi) consider a normal
curve along the line where y = mxi.
The task is to find the x that maximises the PDF for all these normal curves. This is a
n 1
1 − 2 (x −μ )
2

case of joint probability, i.e. a product of probabilities: ∏


i
e .
i=1 √2 π
n 1
1
2
− (x − μ )
is just a constant, it’s sufficient to find the maximum of ∏ e 2
i
Since , which is the
√2 π i=1
n n
1
maximisation of − ∑ ( x −μ )2 or the minimisation of ∑ (x i−μ )2 (i.e. the least squares error).
2 i=1 i i=1

Regularisation
When candidates differ in simplicity (i.e. in the degree of the formula), regularisation is necessary
to prevent over-fitting the model to the sample or training data. This is done by applying to each
model a penalty, which corresponds to the likelihood of a complicated model, i.e. P(Model) in
the Bayesian equation P(observation∩model) = P(observation∣model )⋅P (model).
The likelihood of a model P(model) is given by the joint probabilities of the coefficients w of the
model’s expression w m x m + wm−1 x m−1+...+ w1 x + w0 (except for the constant) in a standard normal
distribution (N(0, 1)) of any possible values for w. This joint probability can be expressed as
m 1
1 1
∏ 21π e 2
2
− (w − μ )
j
. Since μ is 0 and as well as − are just constants, the expression to be
j=1 √ √2 π 2
n
minimised becomes ∑ w2j. (When taking the logarithm of the equation, the multiplication becomes
j=1

addition, therefore, the terms are added and not multiplied together.)
To adjust the weight of the total penalty, a constant λ called regularisation parameter is applied.
n n
The total score of each model is therefore ∑ (x i−μ ) + λ ∑ w2j .
2

i=1 j=1
Bayesian inference – Maximum a Posteriori Estimation
For any parameter Θ, i.e. p X if ~ Bernoulli(p),
(μ, σ) if X ~ N(μ, σ),
b if X ~ Uniform(0, b)
the likelihood of the parameter, given a sample, can be approximated by iterations of estimations
P(B ∣ A) ⋅ P( A)
involving Bayes theorem P( A ∣ B) =
P(B)
where A = parameter to be estimated and
B = the observed samples.
The probability formulas for bayesian parameter estimation are:
p X∣Θ=θ (x ) pΘ ( θ )
for discrete parameter and discrete data: pΘ∣X =x ( θ ) = ;
p X (x)
p X∣Θ=θ (x ) f Θ (θ )
for continuous parameter and discrete data: f Θ∣X =x (θ ) = ;
p X (x)
f X∣Θ=θ (x ) pΘ (θ )
for discrete parameter and continuous data: pΘ∣X =x (θ ) = ; and
f X (x )
f X∣Θ=θ (x)f Θ (θ )
for continuous parameter and continuous data: f Θ∣X =x (θ ) =
f X (x)
where X – sample vector
Θ – the model’s parameter(s) to be estimated
pΘ|X=x(θ) – the posterior estimate, how likely the model, given the sample
pX|Θ=θ(x) – the likelihood, i.e. how likely the sample, given the model
pΘ(θ) – the prior estimate, i.e. how likely the model is
pX(x) – the normalizing constant, i.e. how likely the sample is
Interval estimation
It follows from the central limit theorem that samples follow a normal distribution. A confidence
interval is a range of estimates for an unknown parameter that theoretically contains the true value
of said parameter with a certain confidence level: lower estimate < x̄ < upper estimate
The confidence level indicates the frequency of similar sample means being within the
confidence interval. It’s defined by first picking a critical value α (commonly 0.05) and
calculating 1 - α (0.95 or 95%), i.e. the percentage of the area under the curve between the two
limits:
x̄ − z α /2⋅ σ and x̄ + z α/ 2⋅ σ
√n √n
where Z is just some (usually 1.96) standard deviation value
σ/√n is called the standard error
Both halves of the confidence interval below and above the μ of the sample are called margin of
error (MOE). All else being equal, as the confidence level grows (by increasing the sample size n
and thus decreasing the standard error σ ), the confidence interval shrinks.
√n
Calculation of confidence interval:
1. Find sample mean x̅
2. Pick a critical value α, e.g. 0.05 (and thus a confidence interval 1 - α, e.g. 0.95)
3. Given α, look up the critical value zα/2, e.g. 1.96
4. Multiply by standard error σ
√n
5. Add/subtract the result z α /2⋅ σ to/from the sample mean x̅
√n
if simple random sample
n > 30 or the population is approximately normal
2
z α /2⋅σ
Calculation of sample size: n⩾( )
MOE

Student’s t-distribution
s
When σ is unknown, then σ would be replaced by . The variables will then follow a
√n √n
ν s
student’s t-distribution (~ tν) with a t-score instead of a z-score: x̄±t α / 2⋅ . The t score depends
√n
not only on the critical value α but on the degrees of freedom ν = n – 1. With increasing ν, the
distribution approximates the normal distribution. With ν being the only parameter, the μ value of a
t-distribution is always 0.

Confidence interval for proportions


x
The sample proportion is the number of counts per sample size: ^p = .
n

The confidence interval is then ^p ± z α /2⋅SE , but the standard error is different:
√ ^p (1− ^p )
n
Hypothesis testing
When contrasting two mutually exclusive binary hypotheses, the null hypothesis H0 should denote
the safer option and the alternative hypothesis H1 the less safe one. Given enough evidence H0 can
be rejected in favour of H1 but failing to reject H1 does not mean that H0 is true, only the lack of
evidence. When making such decisions, two types of error can happen:
Type I error false positive when H0 is rejected erroneously; and
Type II error false negative when H1 is accepted erroneously.
Type II is much less detrimental than type I. The maximum probability of type I error that is
tolerated is called the significance level α (typically 0.05):
α = 0 no type I is tolerated, H0 is never rejected;
α = 1 type I is always tolerated, H0 is always rejected
The test statistic is a function whose value shows how closely observed statistics match the
distribution expected under H0 of a statistical test. Testing for instance the difference between a
population mean μ and a baseline μ0, the H0 is that there is no difference between μ and a μ0;
whereas H1 is the hypothesis that the observed statistic indexes a population difference. A right-
tailed test is when H1 suggests a growth of the statistic; a left-tailed test is when H1 suggests a
decrease of statistic. A two-tailed test is when H1 suggests simply that the statistic is different.
In case of a right-tailed test,
Type I error: conclude μ > μ0 and accept H1 falsely;
Type II error: conclude μ is not > μ0 and reject H0 falsely.
In case of a left-tailed test,
Type I error: conclude μ < μ0 and accept H1 falsely;
Type II error: conclude μ is not < μ0 and reject H0 falsely.
In case of a two-tailed test,
Type I error: conclude μ ≠ μ0 and accept H1 falsely;
Type II error: conclude μ = μ0 and reject H0 falsely.

H0 means that the distribution ∼N ( μ 0 , σ 2) (when σ is known). The question is, how likely is the
sample if H0 is true? The goal is to minimise α, i.e. the chances of a type I error below 0.05. This is
the same as finding a z score of this distribution that leaves the p-value (i.e. the remaining area
under the curve) less than 0.05. The p-value is the probability that H0 is true and the test
statistic X takes a value at least as extreme as the observed value x.
If p < α reject H0 and accept H1;
otherwise don’t reject H0. μ −μ 0
Standardisation allows for the use of Z-statistic: first calculate z = σ
√n
In case of a right-tailed test: P(T(X) > t | H0); if z > zα →reject H0
In case of a left-tailed test: P(T(X) < t | H0); if z < -zα →reject H0
In case of a two-tailed test: P(| T(X) - μ0 | > | t - μ0 | | H0); if zt < -zα/2 or zt > zα/2 →reject H0
where T(X) is the test statistic; and
t is the observed value of the function.
If σ is not known, S =
μ −μ 0
√ 1

n−1 i=1
( X i− X̄ )2 must be used instead and the T-statistic has to be

calculated: t = ∼ N nu = N n−1.
S
√n

^p− p 0 ^p −p 0
For calculating proportions, z = = ⋅√ n ∼ N (0 , 1)
√ p0 (1−p 0) √ p0 (1−p 0)
√n
This holds only if N⩾20⋅n for sample independence
all individuals can be categorised unambiguously (success or failure)
np 0> 10 and n(1− p0 )>10 so H0 can be tested with normal distribution

The critical value is the most extreme value chosen that would still reject H0. It is the value the
probability function takes at the significance level α.
The probability β = P (Do not reject H 0 ∣ H 0 is false ) of a type II error can be calculated by
finding the area under the critical region of the probability distribution under the assumption that
H1 is true ∼N ( μ H , σ 2) .
1

The power of a test is the complementary of β 1 – β = P ( Reject H 0 ∣ H 0 is false) .


As α = P(type I error) increases, so does the power of the test increase and so does the
probability of a type II error decrease. Both types of error can be decreased with sample size.

Hypothesis testing
1. Form hypothesis and define H0, H1 and significance level α.
2. Design test, choose test statistic (e.g. mean).
3. Check if population standard deviation can (z-test) or cannot (t-test) be used.

p-value method:
4. Compute observed statistic t or z from the formulae above.
5. Look up the p-value from a z-table, t-table or with a z-test / t-test calculator
6. Reach conclusion: if p-value < α – reject H0
critical value method:
3. Compute critical range with a z-table, t-table or with a critical value calculator
4. Compute observed statistic t or z from the formulae above.
5. Reach conclusion: if the observed statistic is in the critical range – reject H0
Two-sample tests
nx ny
1 1
For a two-sample t-test of means, consider X̄ = ∑
nx i=1
X i and Ȳ = ∑ Y i.
n y i=1

Then the test statistic is X̄ −Ȳ ∼ N ( μ x −μ y ,

( X̄ −Ȳ )−( μ x −μ y )
√ σ 2x σ 2y
nx
+
ny
) , which can be standardised to

( X̄−Ȳ )−( μ x −μ y )
∼ N (0 ,1) or T = ∼ tν

√ √
2 2 2 2
σ x σ y s s x y
+ +
nx ny nx n y
when σ is unknown.
2 2 2
sx s y
( + )
nx n y
For two samples, the degrees of freedom is given by .
s2x 2 s2y 2
( ) ( )
nx ny
+
n x −1 n y −1
Finally, H0 states that μx = μy.
For right-tailed test: H1: μx > μy, for left-tailed test: H1 μx < μy and for two-tailed test: μx ≠ μy.
p x − p y −0
To test proportions, z = ⋅√ nx⋅n y ∼ N(0 , 1) and H0: px = py.

√ ( X+ Y )(1−
X +Y
nx+ ny
)

For right-tailed test: H1: px > py, for left-tailed test: H1: px < py and for two-tailed test: px ≠ py.

A paired t-test is for non-independent data, e.g. same individuals in different circumstances. The
n
1
sample mean of differences is D̄ = ∑ Di ~ N(μD, σD) if X~N and Y~N where D = X – Y
n i=1


n

D̄−μ D ∑ (Di− D̄)2


i=1
The mean can be standardised: σD but when σ is unknown, S D = and a
n−1
√n
t-distribution with n-1 degrees of freedom are used. The p-value, as usual, is the probability that,
for right-tailed test: P(T>t | μD = 0); for left-tailed test: P(T<t | μD = 0); and for two-tailed test: P(| T
– μD | > | t – μD | | μD = 0)

A/B Testing
A framework of two-sample test:
• Propose variations A and B
• Randomly split sample to A group and B group
• Measure outcomes and determine a metric to use
• Statistical analysis to make a decision e.g. t-test, z-test

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy