0% found this document useful (0 votes)

3 views15 pages

Probability - Statistics - Class Notes

The document provides an overview of probability concepts relevant to machine learning, including disjoint and joint events, independent and dependent probabilities, and the Naive Bayes model. It covers various distributions such as discrete, continuous, binomial, normal, and chi-squared distributions, along with their properties and functions like PMF, PDF, and CDF. Additionally, it discusses statistical measures such as mean, variance, standard deviation, skewness, kurtosis, and methods for visualizing data distributions.

Uploaded by

Németh János Tamás

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views15 pages

Probability - Statistics - Class Notes

Uploaded by

Németh János Tamás

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 15

Probability for ML

Sum of probabilities (A or B)
Disjoint events: ∑ pi Joint events: ∑ pi −∏ pi
P( A∪B)=P( A)+ P(B) P( A∪B)=P( A)+ P(B)−P( A∩B)

Independent, dependent and conditional probabilities

For independent events, i.e. when one outcome does not affect other outcomes:
P( A ∩ B) = P (A ) ⋅ P (B)
For non-independent events:
P( A ∩ B) = P( A) ⋅ P (B ∣ A)
Conversely, the probability of an event not happening together with another is
P(¬ A ∩ B) = P(¬ A ) ⋅ P (B ∣ ¬ A )
The probability of the other event is given by
P(B) = P( A ∩ B) + P(¬ A ∩ B)
= P (A ) ⋅ P (B ∣ A) + P(¬ A) ⋅ P (B ∣ ¬ A)
and the converse conditional probability by Bayes theorem
P( A ∩ B) P( A ) ⋅ P (B ∣ A) P (A ) ⋅ P(B ∣ A)
P( A ∣ B) = = =
P(B) P(B) P( A) ⋅ P (B ∣ A) + P(¬ A) ⋅ P( B ∣ ¬ A)
P (sick )⋅P(diagnosis∣sick)
P(sick∣diagnosis)=
P(sick)⋅P(diagnosis∣sick)+ P(healthy)⋅P(diagnosis∣healthy )

Naive Bayes model

Complicated models where probabilities are calculated based on an array of observed events B 1, …
Bn would have the following conditional probability:
P ( A ) ⋅ P(B 1 , ... , Bn ∣ A )
P( A ∣ B) =
P( A) ⋅ P(B1 , ..., Bn ∣ A) + P(¬ A ) ⋅ P (B 1 , ... , Bn ∣ ¬ A)
However, P(B1, ..., Bn | ¬A) could be 0, which makes the probability 1. Therefore, the naive
assumption is made that the observed events are independent of each other:
P( A) ⋅ P(B1 ∣ A ) ⋅ ... ⋅ P (B n ∣ A)
P( A ∣ B) =
P( A) ⋅ P(B1 ∣ A ) ⋅ ... ⋅ P (B n ∣ A) + P(¬ A) ⋅ P( B1 ∣ ¬ A) ⋅ ... ⋅ P(Bn ∣ ¬ A)
Discrete distributions
The distribution of discrete probabilities is given by the probability mass function (PMF): pX(x)
or p(X=x);
where the value is always between 0 and 1 0⩽p X ( x )⩽1; and

the sum of probabilities is 1 ∑ pX ( x )

Continuous distributions
The distribution of the total probability 1 over an interval between a and b of possible values is
given by the probability density function (PDF): fX(x);
where x is defined for all numbers x ∈ ℝ ;
the value is always between 0 and 1 0⩽f X (x)⩽1; and
b
the total area under the curve is 1 ∫ f X (t)dt
a

Cumulative distribution function

The function giving the probability that x is smaller or equal to X is the cumulative distribution
function (CDF): F X (x ) = P(X ⩽x);
where x is defined for all numbers x ∈ ℝ ;
the value is always non-negative 0⩽F X (x)⩽1;

left endpoint is 0 F X (a)=0 or lim F X ( x)=0;

x→−∞

right endpoint is 1 F X (b)=1 or lim F X ( x)=1; and

x →+∞

never decreases.

X
F X (X ) = ∑ px (i) F X (0)=0 F X (n)=1
i= a

Binomial coefficient
The number of different arrangements in which
( nk) = k ! (n−k )! = ( n−k
n!
k number of identical events (successes)
can be arranged out of n events is:
n
)
where n! is the total number of ways
k! is the arrangements of success
(n-k)! is the arrangements of failure
Binomial and Bernoulli distributions
The total probability 1 among k successes in n events is given by the PMF of the binomial
distribution:

()
p X (x) = n ⋅ p (1−p)
x
x n −x

In such circumstances it is said that the random variable follows a Binomial distribution with
parameters p and n: X ~ Binomial(n, p).
A binomial distribution with n = 1 where the random variable takes the value 1 with probability p
and takes the value 0 with probability q = 1 - p is called a Bernoulli distribution with parameter p.

Uniform distribution
When each event (or interval of the same length) has equal probability then the variable follows a
Uniform distribution with parameter p.

{ {
1 1
p X (x) = a< x< b 0 x< a
n b−a
f X( x ) =
x−a
F X (x ) = a⩽x< b
0 x∉(a , b) b−a

Normal distribution 1 x⩾b

2
1 x− μ
When a continuous distribution (approximating a binomial distribution 1 − ( σ )
with very large n) with μ and σ follows the curve f X( x ) = e 2
σ √2 π
2
X t
1 −
then it is said that X ~ Normal(μ, σ). Its CDF is Φ ( x) = ⋅ ∫ e 2 dt .
√ 2 π −∞
Chi-Squared distribution
The square of independent, normally distributed random variable F W (w)=P (W ⩽w)=
X follows the chi-squared distribution. It’s CDF is given by: P(X 2 ⩽w)= .
P(|X|⩽w)=
P(− √ w⩽ X⩽√ w)
The area under the PDF curve between
-√w and √w. The number of independent
random variables (k) is called the degree
of freedom. The chi-squared distributions
PDF and CDF with k degrees of freedom
are:

Sampling from a distribution

In order to find values that follow a particular distribution, first pick numbers with uniform intervals
(or generate random numbers) between 0 and 1. Then look up their values in the CDF-1 of the
desired distribution. These values will follow the target distribution.
Mean
n
1
The point of balance of a collection of datapoints is called the mean (average): μ = ⋅∑ x .
n i=1 i
b
1
The mean of a function is given by μ = ∫ f (x )dx .
b−a a
When outliers bias the data, the median (the middle value for odd datapoints or the average of the
two middle values for even datapoints) could be a better measure of the middle.
b−a
The median of a function is f ( ).
2
To find the value of the variable that maximises the probability function, the modes, the most
df
frequently occurring values are used. For a function f(x), the equation = 0 gives the mode
dx
2
d f
(as long as there is one maximum and 2
is negative, i.e. the curve is concave down).
dx

Expected value
The expected value for any probability distribution is the mean: E [X ]= μ .
n ∞
For PMFs: E [ p( x )]=∑ x i p (xi ) For PDFs: E [f ( x)]=∫ x⋅f (x )dx.
i=1 −∞

m
The sum of expectations is E [X 1 +...+ X m ] = ∑ E [ X i ].
i=1

n
The k moment of a distribution is expressed by ∑ x ki⋅p (xi ).
th

i=1

Variance
A fundamental measure of the spread of some data is the centred second moment of a distribution,
n
1
which is called the variance: Var [ X ]= ∑
n i=1
( xi −μ )2 .

The variance formula can be simplified thus: 2

Var [ X ]=E[( X− μ ) ]
E[ X 2−2 μ X− μ 2] =
2 2
E[ X ] − E [2 μ X ] + E [ μ ]=
2 2
E[ X ] − 2 μ E[ X ] + μ =
E[ X 2 ] − 2 μ μ + μ 2=
2 2 2 2
E[ X ] − μ =E[ X ] − E [ X ]
n ∞
For PMFs: Var [ p(x)]=∑ ( xi −μ )2 p(x i) For PDFs: Var [f (x)]=∫ (x−μ ) f (x )dx .
2

i=1 −∞

m
The sum of variances is Var [ X 1 +...+ X m ] = ∑ Var [ X i ] (as long as the variables are independent).
i=1
Standard deviation
The unit of variance is the square of the measured values (e.g. m2). The square root of the
variance, however, i.e. the standard deviation yields the original unit (m):
σ = √ Var [ X ] = √ E[ X 2 ]−E [ X ]2.
In a normal distribution, the standard deviation gives the ratio of the datapoints within a certain
range:
68.2% of the data is between -σ and σ;
95.2% is between -2σ and 2σ; and
99.7% is between -3σ and 3σ.
Standardisation is the process of making the mean of a dataset equal 0 and the standard deviation
of the dataset equal 1:
Centring: ❑− μ Scaling: ❑ σ

Normalisation and denormalisation

Training and test data often has to be normalized, in which case the prediction must be
denormalized:
X − x̄ Y denormalised = Y ⋅ SY + ȳ
X normalised =
SX

The sum of normal distributions

Given X1 ~ N(μ1, σ1) and X2 ~ N(μ2, σ2), what are the parameters of the distributions of X1 + X2?
As long as the variables are independent, the mean is μ =E [ X 1 ] + E[ X 2 ], while the standard
deviation is σ = √ Var ( X 1) + Var (X 2).

In general, w 1 X 1+...+ wn X n∼N (w 1 μ 1 +...+ w n μ n , √ w1 σ 1 +...+w n σ n).

2 2 2 2

Skewness and kurtosis

The standardised third moment of a
distribution measures the skewness (lack of
symmetry).
E.g. lottery vs car insurance where both μ and σ
are the same. Their difference can be detected
3
X−μ
by calculating: E [( σ ) ].
The standardised fourth moment of a
distribution measures the kurtosis (the tails).
Thick tails at the two ends of a distribution can
4
X−μ
be detected by calculating: E [( σ ) ] .
Charts and plots
The value at which the distribution reaches its nth % is called n% quantile.
The value at which the distribution reaches its (n x 25)th % is called nth quartile.
Therefore, q 0.25=Q 1 ,q 0.5=Q 2 or median ,Q 0.75=Q3 .

In a box plot the following points are defined:

lower whisker: max(Q1 – 1.5 x IQR, Xlowest)
box, i.e. IQR: between Q1 and Q3
middle of box (median): Q2
upper whisker min(Q3 + 1.5 x IQR, Xhighest)
Values below the lower whisker and above the upper whisker are outliers and are not represented.

For kernel density estimation, sample variables are plotted over the X axis and around each, a
small normal curve is drawn. Then the heights of each curve at each point X are summed up to give
an estimate of the PDF curve.

A violin plot is a kernel density estimation superimposed with a box plot.

A QQ plot is one where the X axis represents standard normal quantiles whereas the Y axis
represents quantiles calculated from the standardised sample. If the sample quantiles are around the
y = x line, then the distribution is normal.

Joint probabilities
For discrete distributions, the PMF for 2 variables is P XY (x , y) = P (X=x , Y = y ); for independent
variables, this is equal to P X (x)⋅PY ( y ).
For continuous distributions, the PDF for 2 variables is f XY (x , y )=f (X =x ,Y = y); for independent
variables, this is equal to f X (x )⋅f Y ( y ).

For discrete distributions, joint probability tables are summarised by marginal probabilities along
both axes:
xα xβ xγ xδ marginal probability P X (x)
yΑ P(x α , y Α ) P( x β , y Α ) P (xγ , y Α ) P(xδ , y Α ) P( y Α )
yΒ P (xα , Y = y Β ) P(x β , y Β ) P(x γ , y Β ) P(x δ , y Β ) P ( y Β)
yΓ P(xα , Y = y Γ ) P(x β , y Γ) P( xγ , y Γ ) P (xδ , y Γ ) P ( yΓ)
P(xα ) P(x β ) P( x γ ) P (xδ ) marginal probability PY ( y )

The sum of marginal probabilities along an axis is 1:

P(x α )+ p ( x β )+ p(x γ )+ P (xδ ) = 1 and P( y Α )+ p ( y Β )+ p( y Γ) = 1 .
However, a single column or row in the table does not add up to 1:
The sum P(Y Α , x β )+ P(Y Β , x β )+ P(Y Γ , x β ) = P(x β ) ≠ 1.
Normalisation, i.e. division by corresponding marginal probability P( xβ ) is needed.
Similarly P(x α ,Y Β )+ P(x β ,Y Β )+ P(x γ , Y Β )+ P(x δ ,Y Β ) = P( y Β ) ≠ 1.
Each cell in the table can be expressed as the product of a marginal and a conditional probability:
P(x γ , y Β )=P(x γ )⋅P( y Β ∣ X =xγ ) and P(x γ , y Β )=P(Y Β )⋅P (xγ ∣ Y = y Β )
The corresponding conditional probabilities are therefore:
P(x γ , y Β ) P(x γ , y Β )
P( y Β ∣ X=x γ )= and P(x γ ∣ Y = y Β )= .
P( xγ ) P ( y Β)
P X ,Y (x , y )
In general, PY∣X ( y ∣ x)= .
P X (x )
For continuous distributions, a conditional probability distribution can be obtained by the cross-
section of the landscape and the normalisation of the curve of the cross-section:
f X ,Y (x , y)
f Y∣X ( y ∣ x)= .
f X (x )

Covariance
The measure of the tendency of multiple variables to change similarly is the covariance. Formally
variance is a special case of covariance where X and Y are the same.
n
1
Cov [ X ,Y ] = ⋅∑ ( x i−μ x )( y i −μ y ) which can also be expressed by E [XY ] − E[ X ] E[Y ].
n i=1
when Cov[X, Y] > 0 positive correlation
Cov[X, Y] > 0 negative correlation
Cov[X, Y] ≈ 0 no correlation
n
The covariance of probabilities can be written
the same way only if the probabilities are the ∑ p XY (x i , y i )(xi −μ x )( y i−μ y )
i=1
same, i.e. PXY(x, y) is constant. If the
probabilities are different, then the covariance is
For multiple variables, the variances and covariances are organised in the covariance matrix:

e.g. with 2 variables Σ =

[ Var [ X ]
Cov [Y , X ]
Cov[ X , Y ]
Var [Y ] ]

[ ]
Var [ X 1] Cov[ X 1 , X 2 ] ⋯ Cov [ X 1 , X k ]
with k variables Cov [ X 2 , X 1 ] Var [ X 2 ] ⋯ Cov [ X 2 , X k ]
⋮ ⋮ ⋱ ⋮
Cov [ X k , X 1 ] Cov [ X k , X 2 ] ⋯ Var [ X k ]

Correlation coefficient
For the comparison of correlation, covariances are not suitable as they may take any value. They
have to be standardised to achieve standard correlation coefficients between -1 and 1:

[ ]
[ X ,Y ]
R = Cov σ ⋅σ R X , X =1
1 1
RX , X 1 2
⋯ RX ,X 1 k
x y
RX ,X R X , X =1 ⋯ RX ,X
Correlation coefficients can be organised in matrices: 2 1 2 2 2 k

⋮ ⋮ ⋱ ⋮
RX ,Xk
RX ,X
1 k 2
⋯ R X , X =1
k k
Multivariate normal distributions
For a single variable, the normal distribution has the following PDF:
2
1 x− μ
1 − ( σ )
f X( x ) = ⋅e 2
σ √2 π
2 2
1 x1− μ 1 x 2− μ 2
1 − (( σ ) +( σ ) )
2
For two independent variables, we get the following: 2
⋅ e 1 2
.
σ 1 σ 2 √2 π

[ ]
However, part of the exponent can be rewritten in this x 1− μ 1
form: [ x 1− μ 1
σ1
x 2− μ 2
σ2 ]⋅ σ1
x 2− μ 2
=
σ2

[ ][ ]
1
0
σ 21 x1
− μ 1 )=
( [ x 1 x 2]−[ μ 1 μ 2 ] )⋅
0
1
⋅(
x2 [ ]
μ2
σ 22

[][ ][ ] [][
T −1
x1 μ σ 21 0 x1
− μ1 )
Notice that the middle term of the last row is just the
(
x2
−
μ2
1 )⋅
0 σ 22
⋅(
x2 μ2 ]
inverse of the covariance matrix.

[] []
T
1 x x
− ( 1 −Μ) ⋅Σ −1⋅( 1 −Μ)
1 2 x2 x2
Therefore, the PDF can be rewritten as f X X (x 1 , x 2 ) = ⋅e
1 2
√det Σ2 π
The formula is valid for dependent variables as well, only the covariance matrix will change.

[ ]
Var [ X 1 ] 0 ⋯ 0
For independent variables: Σ = 0 Var [ X 2 ] ⋯ 0 ; whereas
⋮ ⋮ ⋱ ⋮
0 0 ⋯ Var [ X k ]

for dependent variables: Σ =

[ σ 21
Cov [X 2 , X 1 ]
Cov[ X 1 , X 2 ]
σ 22 ]
The bell-shaped landscape is also more elongated along the line x1 = x2 compared to the
independent landscape.

1
1 − ( x− μ)T⋅Σ−1⋅( x− μ)
2
We can generalise further to k variables: 1 k
⋅e
2
|Σ| (2 π ) 2

where x – vector [x1, x2, ... xk];

μ – vector [μ1, μ2, ... μk];
(x – μ)T (x – μ) – dot product representing the sum of squares;
| Σ | - determinant of covariance matrix (the spread of the landscape); and
Σ-1 – division by covariance matrix for standardisation
Population and sample

Population sample
size N n
mean μ x̄
? ?
proportion p= ^p=
N n
N n
1 1
variance σ2 = ∑ (x −μ )2
N i=1 i
s2 = ∑ (x − x̄ )2
n−1 i=1 i

Central Limit Theorem

Independently of the distribution of a population, the average of a sufficiently large number of
independent, random samples approximate the normal distribution.
The mean of the sample averages is the population mean: μ x̄ = μ ; and
2
the variance of the sample averages is the population variance per sample size: σ x̄ = σ .
2
n
If the sample size is sufficiently large, then the standardised mean of the sample means has a
σ 2X √ n⋅( X̄ −μ ) ∼ N (0 ,12 ) .
standard distribution: X̄ ∼ N ( μ X , ) or X
n σX

Point estimation – Maximum likelihood estimation (MLE)

Pick a model (e.g. the hypothetical bias of a coin) that fits the data.

Bernoulli distributions:
Maximise the likelihood of seeing a pattern of k successes in n observations X ~ Bernoulli(p):
e.g. 7 Heads (k) in 10 tosses (n):
k n−k 7 3
L( p , k , n)= p ⋅(1− p) = p ⋅(1− p)
k n−k 7 3
take the log: log likelihood = log p +log (1− p) log p + log (1− p)
k⋅log p+(n−k)⋅log (1− p) 7⋅log p+3⋅log (1−p)
dL k n−k
= ⋅1+ ⋅−1 pn = k
dp p 1− p
then the derivative: , which is 0 when
k
k− pk− pn+ pk ^p = = x̄ 0.7
2 n
p− p
Normal distributions:
The most likely among candidate models (~ N) for a sample is the one that fits the μ and the σ of the
1
sample the best. For maximum likelihood estimation, the σ is calculated with while the sample s
N
1
is calculated with .
n−1

Linear regression:
To find the ideal line to best fit a sample of points, for each point (xi, yi) consider a normal
curve along the line where y = mxi.
The task is to find the x that maximises the PDF for all these normal curves. This is a
n 1
1 − 2 (x −μ )
2

case of joint probability, i.e. a product of probabilities: ∏

i
e .
i=1 √2 π
n 1
1
2
− (x − μ )
is just a constant, it’s sufficient to find the maximum of ∏ e 2
i
Since , which is the
√2 π i=1
n n
1
maximisation of − ∑ ( x −μ )2 or the minimisation of ∑ (x i−μ )2 (i.e. the least squares error).
2 i=1 i i=1

Regularisation
When candidates differ in simplicity (i.e. in the degree of the formula), regularisation is necessary
to prevent over-fitting the model to the sample or training data. This is done by applying to each
model a penalty, which corresponds to the likelihood of a complicated model, i.e. P(Model) in
the Bayesian equation P(observation∩model) = P(observation∣model )⋅P (model).
The likelihood of a model P(model) is given by the joint probabilities of the coefficients w of the
model’s expression w m x m + wm−1 x m−1+...+ w1 x + w0 (except for the constant) in a standard normal
distribution (N(0, 1)) of any possible values for w. This joint probability can be expressed as
m 1
1 1
∏ 21π e 2
2
− (w − μ )
j
. Since μ is 0 and as well as − are just constants, the expression to be
j=1 √ √2 π 2
n
minimised becomes ∑ w2j. (When taking the logarithm of the equation, the multiplication becomes
j=1

addition, therefore, the terms are added and not multiplied together.)
To adjust the weight of the total penalty, a constant λ called regularisation parameter is applied.
n n
The total score of each model is therefore ∑ (x i−μ ) + λ ∑ w2j .
2

i=1 j=1
Bayesian inference – Maximum a Posteriori Estimation
For any parameter Θ, i.e. p X if ~ Bernoulli(p),
(μ, σ) if X ~ N(μ, σ),
b if X ~ Uniform(0, b)
the likelihood of the parameter, given a sample, can be approximated by iterations of estimations
P(B ∣ A) ⋅ P( A)
involving Bayes theorem P( A ∣ B) =
P(B)
where A = parameter to be estimated and
B = the observed samples.
The probability formulas for bayesian parameter estimation are:
p X∣Θ=θ (x ) pΘ ( θ )
for discrete parameter and discrete data: pΘ∣X =x ( θ ) = ;
p X (x)
p X∣Θ=θ (x ) f Θ (θ )
for continuous parameter and discrete data: f Θ∣X =x (θ ) = ;
p X (x)
f X∣Θ=θ (x ) pΘ (θ )
for discrete parameter and continuous data: pΘ∣X =x (θ ) = ; and
f X (x )
f X∣Θ=θ (x)f Θ (θ )
for continuous parameter and continuous data: f Θ∣X =x (θ ) =
f X (x)
where X – sample vector
Θ – the model’s parameter(s) to be estimated
pΘ|X=x(θ) – the posterior estimate, how likely the model, given the sample
pX|Θ=θ(x) – the likelihood, i.e. how likely the sample, given the model
pΘ(θ) – the prior estimate, i.e. how likely the model is
pX(x) – the normalizing constant, i.e. how likely the sample is
Interval estimation
It follows from the central limit theorem that samples follow a normal distribution. A confidence
interval is a range of estimates for an unknown parameter that theoretically contains the true value
of said parameter with a certain confidence level: lower estimate < x̄ < upper estimate
The confidence level indicates the frequency of similar sample means being within the
confidence interval. It’s defined by first picking a critical value α (commonly 0.05) and
calculating 1 - α (0.95 or 95%), i.e. the percentage of the area under the curve between the two
limits:
x̄ − z α /2⋅ σ and x̄ + z α/ 2⋅ σ
√n √n
where Z is just some (usually 1.96) standard deviation value
σ/√n is called the standard error
Both halves of the confidence interval below and above the μ of the sample are called margin of
error (MOE). All else being equal, as the confidence level grows (by increasing the sample size n
and thus decreasing the standard error σ ), the confidence interval shrinks.
√n
Calculation of confidence interval:
1. Find sample mean x̅
2. Pick a critical value α, e.g. 0.05 (and thus a confidence interval 1 - α, e.g. 0.95)
3. Given α, look up the critical value zα/2, e.g. 1.96
4. Multiply by standard error σ
√n
5. Add/subtract the result z α /2⋅ σ to/from the sample mean x̅
√n
if simple random sample
n > 30 or the population is approximately normal
2
z α /2⋅σ
Calculation of sample size: n⩾( )
MOE

Student’s t-distribution
s
When σ is unknown, then σ would be replaced by . The variables will then follow a
√n √n
ν s
student’s t-distribution (~ tν) with a t-score instead of a z-score: x̄±t α / 2⋅ . The t score depends
√n
not only on the critical value α but on the degrees of freedom ν = n – 1. With increasing ν, the
distribution approximates the normal distribution. With ν being the only parameter, the μ value of a
t-distribution is always 0.

Confidence interval for proportions

x
The sample proportion is the number of counts per sample size: ^p = .
n

The confidence interval is then ^p ± z α /2⋅SE , but the standard error is different:
√ ^p (1− ^p )
n
Hypothesis testing
When contrasting two mutually exclusive binary hypotheses, the null hypothesis H0 should denote
the safer option and the alternative hypothesis H1 the less safe one. Given enough evidence H0 can
be rejected in favour of H1 but failing to reject H1 does not mean that H0 is true, only the lack of
evidence. When making such decisions, two types of error can happen:
Type I error false positive when H0 is rejected erroneously; and
Type II error false negative when H1 is accepted erroneously.
Type II is much less detrimental than type I. The maximum probability of type I error that is
tolerated is called the significance level α (typically 0.05):
α = 0 no type I is tolerated, H0 is never rejected;
α = 1 type I is always tolerated, H0 is always rejected
The test statistic is a function whose value shows how closely observed statistics match the
distribution expected under H0 of a statistical test. Testing for instance the difference between a
population mean μ and a baseline μ0, the H0 is that there is no difference between μ and a μ0;
whereas H1 is the hypothesis that the observed statistic indexes a population difference. A right-
tailed test is when H1 suggests a growth of the statistic; a left-tailed test is when H1 suggests a
decrease of statistic. A two-tailed test is when H1 suggests simply that the statistic is different.
In case of a right-tailed test,
Type I error: conclude μ > μ0 and accept H1 falsely;
Type II error: conclude μ is not > μ0 and reject H0 falsely.
In case of a left-tailed test,
Type I error: conclude μ < μ0 and accept H1 falsely;
Type II error: conclude μ is not < μ0 and reject H0 falsely.
In case of a two-tailed test,
Type I error: conclude μ ≠ μ0 and accept H1 falsely;
Type II error: conclude μ = μ0 and reject H0 falsely.

H0 means that the distribution ∼N ( μ 0 , σ 2) (when σ is known). The question is, how likely is the
sample if H0 is true? The goal is to minimise α, i.e. the chances of a type I error below 0.05. This is
the same as finding a z score of this distribution that leaves the p-value (i.e. the remaining area
under the curve) less than 0.05. The p-value is the probability that H0 is true and the test
statistic X takes a value at least as extreme as the observed value x.
If p < α reject H0 and accept H1;
otherwise don’t reject H0. μ −μ 0
Standardisation allows for the use of Z-statistic: first calculate z = σ
√n
In case of a right-tailed test: P(T(X) > t | H0); if z > zα →reject H0
In case of a left-tailed test: P(T(X) < t | H0); if z < -zα →reject H0
In case of a two-tailed test: P(| T(X) - μ0 | > | t - μ0 | | H0); if zt < -zα/2 or zt > zα/2 →reject H0
where T(X) is the test statistic; and
t is the observed value of the function.
If σ is not known, S =
μ −μ 0
√ 1
∑
n−1 i=1
( X i− X̄ )2 must be used instead and the T-statistic has to be

calculated: t = ∼ N nu = N n−1.
S
√n

^p− p 0 ^p −p 0
For calculating proportions, z = = ⋅√ n ∼ N (0 , 1)
√ p0 (1−p 0) √ p0 (1−p 0)
√n
This holds only if N⩾20⋅n for sample independence
all individuals can be categorised unambiguously (success or failure)
np 0> 10 and n(1− p0 )>10 so H0 can be tested with normal distribution

The critical value is the most extreme value chosen that would still reject H0. It is the value the
probability function takes at the significance level α.
The probability β = P (Do not reject H 0 ∣ H 0 is false ) of a type II error can be calculated by
finding the area under the critical region of the probability distribution under the assumption that
H1 is true ∼N ( μ H , σ 2) .
1

The power of a test is the complementary of β 1 – β = P ( Reject H 0 ∣ H 0 is false) .

As α = P(type I error) increases, so does the power of the test increase and so does the
probability of a type II error decrease. Both types of error can be decreased with sample size.

Hypothesis testing
1. Form hypothesis and define H0, H1 and significance level α.
2. Design test, choose test statistic (e.g. mean).
3. Check if population standard deviation can (z-test) or cannot (t-test) be used.

p-value method:
4. Compute observed statistic t or z from the formulae above.
5. Look up the p-value from a z-table, t-table or with a z-test / t-test calculator
6. Reach conclusion: if p-value < α – reject H0
critical value method:
3. Compute critical range with a z-table, t-table or with a critical value calculator
4. Compute observed statistic t or z from the formulae above.
5. Reach conclusion: if the observed statistic is in the critical range – reject H0
Two-sample tests
nx ny
1 1
For a two-sample t-test of means, consider X̄ = ∑
nx i=1
X i and Ȳ = ∑ Y i.
n y i=1

Then the test statistic is X̄ −Ȳ ∼ N ( μ x −μ y ,

( X̄ −Ȳ )−( μ x −μ y )
√ σ 2x σ 2y
nx
+
ny
) , which can be standardised to

( X̄−Ȳ )−( μ x −μ y )
∼ N (0 ,1) or T = ∼ tν

√ √
2 2 2 2
σ x σ y s s x y
+ +
nx ny nx n y
when σ is unknown.
2 2 2
sx s y
( + )
nx n y
For two samples, the degrees of freedom is given by .
s2x 2 s2y 2
( ) ( )
nx ny
+
n x −1 n y −1
Finally, H0 states that μx = μy.
For right-tailed test: H1: μx > μy, for left-tailed test: H1 μx < μy and for two-tailed test: μx ≠ μy.
p x − p y −0
To test proportions, z = ⋅√ nx⋅n y ∼ N(0 , 1) and H0: px = py.

√ ( X+ Y )(1−
X +Y
nx+ ny
)

For right-tailed test: H1: px > py, for left-tailed test: H1: px < py and for two-tailed test: px ≠ py.

A paired t-test is for non-independent data, e.g. same individuals in different circumstances. The
n
1
sample mean of differences is D̄ = ∑ Di ~ N(μD, σD) if X~N and Y~N where D = X – Y
n i=1

√
n

D̄−μ D ∑ (Di− D̄)2

i=1
The mean can be standardised: σD but when σ is unknown, S D = and a
n−1
√n
t-distribution with n-1 degrees of freedom are used. The p-value, as usual, is the probability that,
for right-tailed test: P(T>t | μD = 0); for left-tailed test: P(T<t | μD = 0); and for two-tailed test: P(| T
– μD | > | t – μD | | μD = 0)

A/B Testing
A framework of two-sample test:
• Propose variations A and B
• Randomly split sample to A group and B group
• Measure outcomes and determine a metric to use
• Statistical analysis to make a decision e.g. t-test, z-test

General Organic Chemistry
No ratings yet
General Organic Chemistry
78 pages
Microsoft Powerpoint Tips and Tricks
No ratings yet
Microsoft Powerpoint Tips and Tricks
8 pages
Prob-Review Xid-8243918 1
No ratings yet
Prob-Review Xid-8243918 1
21 pages
Sem 6 Notes Maths
No ratings yet
Sem 6 Notes Maths
7 pages
Lecture 3 - Adv. Probability - Discrete Random Variables
No ratings yet
Lecture 3 - Adv. Probability - Discrete Random Variables
51 pages
Wound Dressing Jurnal
No ratings yet
Wound Dressing Jurnal
32 pages
How To Define Roles and Responsibilities
No ratings yet
How To Define Roles and Responsibilities
5 pages
Preformulasi Merged
No ratings yet
Preformulasi Merged
147 pages
CH 3
No ratings yet
CH 3
26 pages
Chap 4
No ratings yet
Chap 4
36 pages
Kajian Manajemen Transportasi Pada Daerah Hinterland (Studi Kasus Di Pelabuhan Ketapang Banyuwangi)
No ratings yet
Kajian Manajemen Transportasi Pada Daerah Hinterland (Studi Kasus Di Pelabuhan Ketapang Banyuwangi)
13 pages
Chem 1 Subject-Outline
No ratings yet
Chem 1 Subject-Outline
10 pages
Session 21-22 - IB Chapter 18 Global Production and Supply Chains
No ratings yet
Session 21-22 - IB Chapter 18 Global Production and Supply Chains
14 pages
Ticketcreator Barcodechecker Manual: Check Secure Tickets With Barcodes
No ratings yet
Ticketcreator Barcodechecker Manual: Check Secure Tickets With Barcodes
8 pages
MA1201 Probability Notes
No ratings yet
MA1201 Probability Notes
30 pages
04 Long Tables Cetking CAT DILR150 Frequently Repeated Questions
No ratings yet
04 Long Tables Cetking CAT DILR150 Frequently Repeated Questions
7 pages
Mathematics BSC FYUP Syllabus 2024
No ratings yet
Mathematics BSC FYUP Syllabus 2024
36 pages
3 Expectation
No ratings yet
3 Expectation
70 pages
Commissioning Report For Boiler Air and Flue Gas System Unit 1
No ratings yet
Commissioning Report For Boiler Air and Flue Gas System Unit 1
6 pages
Introduction To CAM Lesson 1
No ratings yet
Introduction To CAM Lesson 1
9 pages
P - S 3 Random Variables
No ratings yet
P - S 3 Random Variables
18 pages
Appendix A Probability and Statistics
No ratings yet
Appendix A Probability and Statistics
12 pages
Rockridge News
No ratings yet
Rockridge News
16 pages
Mining Rehabilitation Fund Questions and Answers
No ratings yet
Mining Rehabilitation Fund Questions and Answers
4 pages
Chapter 23 The Enlightenment
No ratings yet
Chapter 23 The Enlightenment
4 pages
How To Trade The IV Flush Strategy
No ratings yet
How To Trade The IV Flush Strategy
4 pages
Chapter 5 Sampling in Discrete Even Simulation
No ratings yet
Chapter 5 Sampling in Discrete Even Simulation
56 pages
Concert Mri Datasheet
No ratings yet
Concert Mri Datasheet
3 pages
Writing Effective Covering Letters
No ratings yet
Writing Effective Covering Letters
3 pages
Module 2 Inverse Functions
No ratings yet
Module 2 Inverse Functions
3 pages
Bid Evaluation Report - 23H00003
No ratings yet
Bid Evaluation Report - 23H00003
3 pages
What Does Regenerative Air Pre-Heater Means, Why They Named So
No ratings yet
What Does Regenerative Air Pre-Heater Means, Why They Named So
10 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Probability and Random Processes 2023
No ratings yet
Probability and Random Processes 2023
43 pages
Class 4 SP
No ratings yet
Class 4 SP
23 pages
Bharat B Hwan
50% (2)
Bharat B Hwan
16 pages
Puttin - On - The - Ritz Brass Quintet
100% (3)
Puttin - On - The - Ritz Brass Quintet
22 pages
Probability Cheatsheet
100% (1)
Probability Cheatsheet
10 pages
Module Wise Important Formulae
No ratings yet
Module Wise Important Formulae
45 pages
Paulo Coelho'S: Aleph
No ratings yet
Paulo Coelho'S: Aleph
1 page
Meat Co. Escargot
No ratings yet
Meat Co. Escargot
1 page
Lecture04 Continuous Random Variables Ver1
No ratings yet
Lecture04 Continuous Random Variables Ver1
35 pages
MAS 102 - Topic 1
No ratings yet
MAS 102 - Topic 1
13 pages
8.a. Distributions
No ratings yet
8.a. Distributions
28 pages
Probability Theory Cheat Sheet
No ratings yet
Probability Theory Cheat Sheet
10 pages
Notes Dvi
No ratings yet
Notes Dvi
34 pages
SI Chapter-1
No ratings yet
SI Chapter-1
30 pages
Week 5-8 Short Notes
No ratings yet
Week 5-8 Short Notes
10 pages
Math9 Q4 W1-W8-52pages
No ratings yet
Math9 Q4 W1-W8-52pages
52 pages
STT201
No ratings yet
STT201
19 pages
Stat1 Formulas and Tables For Statistics 2022
No ratings yet
Stat1 Formulas and Tables For Statistics 2022
34 pages
323 Egec
No ratings yet
323 Egec
18 pages
Formula Sheet
No ratings yet
Formula Sheet
19 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
01 Lectureslides ProbTheory
No ratings yet
01 Lectureslides ProbTheory
42 pages
Formula Sheet
No ratings yet
Formula Sheet
18 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
CL - 4 - NSTSE-2024-Paper-1P204 Key-Updated
No ratings yet
CL - 4 - NSTSE-2024-Paper-1P204 Key-Updated
3 pages
Probability Distributions
No ratings yet
Probability Distributions
18 pages
STATS Formula Sheet
No ratings yet
STATS Formula Sheet
4 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
Progressive Die
No ratings yet
Progressive Die
36 pages
تقرير الإحصاء PDF
No ratings yet
تقرير الإحصاء PDF
8 pages
Basic Probability and Statistics: Random Variables Distribution Functions Various Probability Distributions
No ratings yet
Basic Probability and Statistics: Random Variables Distribution Functions Various Probability Distributions
39 pages
Chapter 6
No ratings yet
Chapter 6
5 pages
Discrete Random Variables and Probability Distributions
No ratings yet
Discrete Random Variables and Probability Distributions
4 pages
Probability
No ratings yet
Probability
2 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
Probability Cheatsheet
100% (2)
Probability Cheatsheet
10 pages
MIT14 381F13 Lec1 PDF
No ratings yet
MIT14 381F13 Lec1 PDF
8 pages
Statistics Study Guide: Measures of Central Tendancy
No ratings yet
Statistics Study Guide: Measures of Central Tendancy
2 pages
Exam P Review Sheet
No ratings yet
Exam P Review Sheet
12 pages
Probability Cheatsheet v2.0: Thinking Conditionally
No ratings yet
Probability Cheatsheet v2.0: Thinking Conditionally
10 pages
CN Module 1 Prelim
No ratings yet
CN Module 1 Prelim
42 pages
Fe Engineering Probability Statistics
No ratings yet
Fe Engineering Probability Statistics
9 pages
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Harolds Stats Distributions Cheat Sheet 2022
No ratings yet
Harolds Stats Distributions Cheat Sheet 2022
18 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Probability & Statistics Facts and Formulae: Guides To Statistical Information 1
No ratings yet
Probability & Statistics Facts and Formulae: Guides To Statistical Information 1
4 pages
Stat Cheatsheet (Ver.2)
No ratings yet
Stat Cheatsheet (Ver.2)
2 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Probability - Statistics - Class Notes

Uploaded by

Probability - Statistics - Class Notes

Uploaded by

Probability for ML

Independent, dependent and conditional probabilities

Naive Bayes model

the sum of probabilities is 1 ∑ pX ( x )

Cumulative distribution function

left endpoint is 0 F X (a)=0 or lim F X ( x)=0;

right endpoint is 1 F X (b)=1 or lim F X ( x)=1; and

Normal distribution 1 x⩾b

Sampling from a distribution

The variance formula can be simplified thus: 2

Normalisation and denormalisation

The sum of normal distributions

In general, w 1 X 1+...+ wn X n∼N (w 1 μ 1 +...+ w n μ n , √ w1 σ 1 +...+w n σ n).

Skewness and kurtosis

In a box plot the following points are defined:

A violin plot is a kernel density estimation superimposed with a box plot.

The sum of marginal probabilities along an axis is 1:

e.g. with 2 variables Σ =

for dependent variables: Σ =

where x – vector [x1, x2, ... xk];

Central Limit Theorem

Point estimation – Maximum likelihood estimation (MLE)

case of joint probability, i.e. a product of probabilities: ∏

Confidence interval for proportions

The power of a test is the complementary of β 1 – β = P ( Reject H 0 ∣ H 0 is false) .

Then the test statistic is X̄ −Ȳ ∼ N ( μ x −μ y ,

D̄−μ D ∑ (Di− D̄)2

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.