Probability - Statistics - Class Notes
Probability - Statistics - Class Notes
Sum of probabilities (A or B)
Disjoint events: ∑ pi Joint events: ∑ pi −∏ pi
P( A∪B)=P( A)+ P(B) P( A∪B)=P( A)+ P(B)−P( A∩B)
Continuous distributions
The distribution of the total probability 1 over an interval between a and b of possible values is
given by the probability density function (PDF): fX(x);
where x is defined for all numbers x ∈ ℝ ;
the value is always between 0 and 1 0⩽f X (x)⩽1; and
b
the total area under the curve is 1 ∫ f X (t)dt
a
never decreases.
X
F X (X ) = ∑ px (i) F X (0)=0 F X (n)=1
i= a
Binomial coefficient
The number of different arrangements in which
( nk) = k ! (n−k )! = ( n−k
n!
k number of identical events (successes)
can be arranged out of n events is:
n
)
where n! is the total number of ways
k! is the arrangements of success
(n-k)! is the arrangements of failure
Binomial and Bernoulli distributions
The total probability 1 among k successes in n events is given by the PMF of the binomial
distribution:
()
p X (x) = n ⋅ p (1−p)
x
x n −x
In such circumstances it is said that the random variable follows a Binomial distribution with
parameters p and n: X ~ Binomial(n, p).
A binomial distribution with n = 1 where the random variable takes the value 1 with probability p
and takes the value 0 with probability q = 1 - p is called a Bernoulli distribution with parameter p.
Uniform distribution
When each event (or interval of the same length) has equal probability then the variable follows a
Uniform distribution with parameter p.
{ {
1 1
p X (x) = a< x< b 0 x< a
n b−a
f X( x ) =
x−a
F X (x ) = a⩽x< b
0 x∉(a , b) b−a
Expected value
The expected value for any probability distribution is the mean: E [X ]= μ .
n ∞
For PMFs: E [ p( x )]=∑ x i p (xi ) For PDFs: E [f ( x)]=∫ x⋅f (x )dx.
i=1 −∞
m
The sum of expectations is E [X 1 +...+ X m ] = ∑ E [ X i ].
i=1
n
The k moment of a distribution is expressed by ∑ x ki⋅p (xi ).
th
i=1
Variance
A fundamental measure of the spread of some data is the centred second moment of a distribution,
n
1
which is called the variance: Var [ X ]= ∑
n i=1
( xi −μ )2 .
i=1 −∞
m
The sum of variances is Var [ X 1 +...+ X m ] = ∑ Var [ X i ] (as long as the variables are independent).
i=1
Standard deviation
The unit of variance is the square of the measured values (e.g. m2). The square root of the
variance, however, i.e. the standard deviation yields the original unit (m):
σ = √ Var [ X ] = √ E[ X 2 ]−E [ X ]2.
In a normal distribution, the standard deviation gives the ratio of the datapoints within a certain
range:
68.2% of the data is between -σ and σ;
95.2% is between -2σ and 2σ; and
99.7% is between -3σ and 3σ.
Standardisation is the process of making the mean of a dataset equal 0 and the standard deviation
of the dataset equal 1:
Centring: ❑− μ Scaling: ❑ σ
For kernel density estimation, sample variables are plotted over the X axis and around each, a
small normal curve is drawn. Then the heights of each curve at each point X are summed up to give
an estimate of the PDF curve.
A QQ plot is one where the X axis represents standard normal quantiles whereas the Y axis
represents quantiles calculated from the standardised sample. If the sample quantiles are around the
y = x line, then the distribution is normal.
Joint probabilities
For discrete distributions, the PMF for 2 variables is P XY (x , y) = P (X=x , Y = y ); for independent
variables, this is equal to P X (x)⋅PY ( y ).
For continuous distributions, the PDF for 2 variables is f XY (x , y )=f (X =x ,Y = y); for independent
variables, this is equal to f X (x )⋅f Y ( y ).
For discrete distributions, joint probability tables are summarised by marginal probabilities along
both axes:
xα xβ xγ xδ marginal probability P X (x)
yΑ P(x α , y Α ) P( x β , y Α ) P (xγ , y Α ) P(xδ , y Α ) P( y Α )
yΒ P (xα , Y = y Β ) P(x β , y Β ) P(x γ , y Β ) P(x δ , y Β ) P ( y Β)
yΓ P(xα , Y = y Γ ) P(x β , y Γ) P( xγ , y Γ ) P (xδ , y Γ ) P ( yΓ)
P(xα ) P(x β ) P( x γ ) P (xδ ) marginal probability PY ( y )
Covariance
The measure of the tendency of multiple variables to change similarly is the covariance. Formally
variance is a special case of covariance where X and Y are the same.
n
1
Cov [ X ,Y ] = ⋅∑ ( x i−μ x )( y i −μ y ) which can also be expressed by E [XY ] − E[ X ] E[Y ].
n i=1
when Cov[X, Y] > 0 positive correlation
Cov[X, Y] > 0 negative correlation
Cov[X, Y] ≈ 0 no correlation
n
The covariance of probabilities can be written
the same way only if the probabilities are the ∑ p XY (x i , y i )(xi −μ x )( y i−μ y )
i=1
same, i.e. PXY(x, y) is constant. If the
probabilities are different, then the covariance is
For multiple variables, the variances and covariances are organised in the covariance matrix:
[ ]
Var [ X 1] Cov[ X 1 , X 2 ] ⋯ Cov [ X 1 , X k ]
with k variables Cov [ X 2 , X 1 ] Var [ X 2 ] ⋯ Cov [ X 2 , X k ]
⋮ ⋮ ⋱ ⋮
Cov [ X k , X 1 ] Cov [ X k , X 2 ] ⋯ Var [ X k ]
Correlation coefficient
For the comparison of correlation, covariances are not suitable as they may take any value. They
have to be standardised to achieve standard correlation coefficients between -1 and 1:
[ ]
[ X ,Y ]
R = Cov σ ⋅σ R X , X =1
1 1
RX , X 1 2
⋯ RX ,X 1 k
x y
RX ,X R X , X =1 ⋯ RX ,X
Correlation coefficients can be organised in matrices: 2 1 2 2 2 k
⋮ ⋮ ⋱ ⋮
RX ,Xk
RX ,X
1 k 2
⋯ R X , X =1
k k
Multivariate normal distributions
For a single variable, the normal distribution has the following PDF:
2
1 x− μ
1 − ( σ )
f X( x ) = ⋅e 2
σ √2 π
2 2
1 x1− μ 1 x 2− μ 2
1 − (( σ ) +( σ ) )
2
For two independent variables, we get the following: 2
⋅ e 1 2
.
σ 1 σ 2 √2 π
[ ]
However, part of the exponent can be rewritten in this x 1− μ 1
form: [ x 1− μ 1
σ1
x 2− μ 2
σ2 ]⋅ σ1
x 2− μ 2
=
σ2
[ ][ ]
1
0
σ 21 x1
− μ 1 )=
( [ x 1 x 2]−[ μ 1 μ 2 ] )⋅
0
1
⋅(
x2 [ ]
μ2
σ 22
[][ ][ ] [][
T −1
x1 μ σ 21 0 x1
− μ1 )
Notice that the middle term of the last row is just the
(
x2
−
μ2
1 )⋅
0 σ 22
⋅(
x2 μ2 ]
inverse of the covariance matrix.
[] []
T
1 x x
− ( 1 −Μ) ⋅Σ −1⋅( 1 −Μ)
1 2 x2 x2
Therefore, the PDF can be rewritten as f X X (x 1 , x 2 ) = ⋅e
1 2
√det Σ2 π
The formula is valid for dependent variables as well, only the covariance matrix will change.
[ ]
Var [ X 1 ] 0 ⋯ 0
For independent variables: Σ = 0 Var [ X 2 ] ⋯ 0 ; whereas
⋮ ⋮ ⋱ ⋮
0 0 ⋯ Var [ X k ]
1
1 − ( x− μ)T⋅Σ−1⋅( x− μ)
2
We can generalise further to k variables: 1 k
⋅e
2
|Σ| (2 π ) 2
Population sample
size N n
mean μ x̄
? ?
proportion p= ^p=
N n
N n
1 1
variance σ2 = ∑ (x −μ )2
N i=1 i
s2 = ∑ (x − x̄ )2
n−1 i=1 i
Pick a model (e.g. the hypothetical bias of a coin) that fits the data.
Bernoulli distributions:
Maximise the likelihood of seeing a pattern of k successes in n observations X ~ Bernoulli(p):
e.g. 7 Heads (k) in 10 tosses (n):
k n−k 7 3
L( p , k , n)= p ⋅(1− p) = p ⋅(1− p)
k n−k 7 3
take the log: log likelihood = log p +log (1− p) log p + log (1− p)
k⋅log p+(n−k)⋅log (1− p) 7⋅log p+3⋅log (1−p)
dL k n−k
= ⋅1+ ⋅−1 pn = k
dp p 1− p
then the derivative: , which is 0 when
k
k− pk− pn+ pk ^p = = x̄ 0.7
2 n
p− p
Normal distributions:
The most likely among candidate models (~ N) for a sample is the one that fits the μ and the σ of the
1
sample the best. For maximum likelihood estimation, the σ is calculated with while the sample s
N
1
is calculated with .
n−1
Linear regression:
To find the ideal line to best fit a sample of points, for each point (xi, yi) consider a normal
curve along the line where y = mxi.
The task is to find the x that maximises the PDF for all these normal curves. This is a
n 1
1 − 2 (x −μ )
2
Regularisation
When candidates differ in simplicity (i.e. in the degree of the formula), regularisation is necessary
to prevent over-fitting the model to the sample or training data. This is done by applying to each
model a penalty, which corresponds to the likelihood of a complicated model, i.e. P(Model) in
the Bayesian equation P(observation∩model) = P(observation∣model )⋅P (model).
The likelihood of a model P(model) is given by the joint probabilities of the coefficients w of the
model’s expression w m x m + wm−1 x m−1+...+ w1 x + w0 (except for the constant) in a standard normal
distribution (N(0, 1)) of any possible values for w. This joint probability can be expressed as
m 1
1 1
∏ 21π e 2
2
− (w − μ )
j
. Since μ is 0 and as well as − are just constants, the expression to be
j=1 √ √2 π 2
n
minimised becomes ∑ w2j. (When taking the logarithm of the equation, the multiplication becomes
j=1
addition, therefore, the terms are added and not multiplied together.)
To adjust the weight of the total penalty, a constant λ called regularisation parameter is applied.
n n
The total score of each model is therefore ∑ (x i−μ ) + λ ∑ w2j .
2
i=1 j=1
Bayesian inference – Maximum a Posteriori Estimation
For any parameter Θ, i.e. p X if ~ Bernoulli(p),
(μ, σ) if X ~ N(μ, σ),
b if X ~ Uniform(0, b)
the likelihood of the parameter, given a sample, can be approximated by iterations of estimations
P(B ∣ A) ⋅ P( A)
involving Bayes theorem P( A ∣ B) =
P(B)
where A = parameter to be estimated and
B = the observed samples.
The probability formulas for bayesian parameter estimation are:
p X∣Θ=θ (x ) pΘ ( θ )
for discrete parameter and discrete data: pΘ∣X =x ( θ ) = ;
p X (x)
p X∣Θ=θ (x ) f Θ (θ )
for continuous parameter and discrete data: f Θ∣X =x (θ ) = ;
p X (x)
f X∣Θ=θ (x ) pΘ (θ )
for discrete parameter and continuous data: pΘ∣X =x (θ ) = ; and
f X (x )
f X∣Θ=θ (x)f Θ (θ )
for continuous parameter and continuous data: f Θ∣X =x (θ ) =
f X (x)
where X – sample vector
Θ – the model’s parameter(s) to be estimated
pΘ|X=x(θ) – the posterior estimate, how likely the model, given the sample
pX|Θ=θ(x) – the likelihood, i.e. how likely the sample, given the model
pΘ(θ) – the prior estimate, i.e. how likely the model is
pX(x) – the normalizing constant, i.e. how likely the sample is
Interval estimation
It follows from the central limit theorem that samples follow a normal distribution. A confidence
interval is a range of estimates for an unknown parameter that theoretically contains the true value
of said parameter with a certain confidence level: lower estimate < x̄ < upper estimate
The confidence level indicates the frequency of similar sample means being within the
confidence interval. It’s defined by first picking a critical value α (commonly 0.05) and
calculating 1 - α (0.95 or 95%), i.e. the percentage of the area under the curve between the two
limits:
x̄ − z α /2⋅ σ and x̄ + z α/ 2⋅ σ
√n √n
where Z is just some (usually 1.96) standard deviation value
σ/√n is called the standard error
Both halves of the confidence interval below and above the μ of the sample are called margin of
error (MOE). All else being equal, as the confidence level grows (by increasing the sample size n
and thus decreasing the standard error σ ), the confidence interval shrinks.
√n
Calculation of confidence interval:
1. Find sample mean x̅
2. Pick a critical value α, e.g. 0.05 (and thus a confidence interval 1 - α, e.g. 0.95)
3. Given α, look up the critical value zα/2, e.g. 1.96
4. Multiply by standard error σ
√n
5. Add/subtract the result z α /2⋅ σ to/from the sample mean x̅
√n
if simple random sample
n > 30 or the population is approximately normal
2
z α /2⋅σ
Calculation of sample size: n⩾( )
MOE
Student’s t-distribution
s
When σ is unknown, then σ would be replaced by . The variables will then follow a
√n √n
ν s
student’s t-distribution (~ tν) with a t-score instead of a z-score: x̄±t α / 2⋅ . The t score depends
√n
not only on the critical value α but on the degrees of freedom ν = n – 1. With increasing ν, the
distribution approximates the normal distribution. With ν being the only parameter, the μ value of a
t-distribution is always 0.
The confidence interval is then ^p ± z α /2⋅SE , but the standard error is different:
√ ^p (1− ^p )
n
Hypothesis testing
When contrasting two mutually exclusive binary hypotheses, the null hypothesis H0 should denote
the safer option and the alternative hypothesis H1 the less safe one. Given enough evidence H0 can
be rejected in favour of H1 but failing to reject H1 does not mean that H0 is true, only the lack of
evidence. When making such decisions, two types of error can happen:
Type I error false positive when H0 is rejected erroneously; and
Type II error false negative when H1 is accepted erroneously.
Type II is much less detrimental than type I. The maximum probability of type I error that is
tolerated is called the significance level α (typically 0.05):
α = 0 no type I is tolerated, H0 is never rejected;
α = 1 type I is always tolerated, H0 is always rejected
The test statistic is a function whose value shows how closely observed statistics match the
distribution expected under H0 of a statistical test. Testing for instance the difference between a
population mean μ and a baseline μ0, the H0 is that there is no difference between μ and a μ0;
whereas H1 is the hypothesis that the observed statistic indexes a population difference. A right-
tailed test is when H1 suggests a growth of the statistic; a left-tailed test is when H1 suggests a
decrease of statistic. A two-tailed test is when H1 suggests simply that the statistic is different.
In case of a right-tailed test,
Type I error: conclude μ > μ0 and accept H1 falsely;
Type II error: conclude μ is not > μ0 and reject H0 falsely.
In case of a left-tailed test,
Type I error: conclude μ < μ0 and accept H1 falsely;
Type II error: conclude μ is not < μ0 and reject H0 falsely.
In case of a two-tailed test,
Type I error: conclude μ ≠ μ0 and accept H1 falsely;
Type II error: conclude μ = μ0 and reject H0 falsely.
H0 means that the distribution ∼N ( μ 0 , σ 2) (when σ is known). The question is, how likely is the
sample if H0 is true? The goal is to minimise α, i.e. the chances of a type I error below 0.05. This is
the same as finding a z score of this distribution that leaves the p-value (i.e. the remaining area
under the curve) less than 0.05. The p-value is the probability that H0 is true and the test
statistic X takes a value at least as extreme as the observed value x.
If p < α reject H0 and accept H1;
otherwise don’t reject H0. μ −μ 0
Standardisation allows for the use of Z-statistic: first calculate z = σ
√n
In case of a right-tailed test: P(T(X) > t | H0); if z > zα →reject H0
In case of a left-tailed test: P(T(X) < t | H0); if z < -zα →reject H0
In case of a two-tailed test: P(| T(X) - μ0 | > | t - μ0 | | H0); if zt < -zα/2 or zt > zα/2 →reject H0
where T(X) is the test statistic; and
t is the observed value of the function.
If σ is not known, S =
μ −μ 0
√ 1
∑
n−1 i=1
( X i− X̄ )2 must be used instead and the T-statistic has to be
calculated: t = ∼ N nu = N n−1.
S
√n
^p− p 0 ^p −p 0
For calculating proportions, z = = ⋅√ n ∼ N (0 , 1)
√ p0 (1−p 0) √ p0 (1−p 0)
√n
This holds only if N⩾20⋅n for sample independence
all individuals can be categorised unambiguously (success or failure)
np 0> 10 and n(1− p0 )>10 so H0 can be tested with normal distribution
The critical value is the most extreme value chosen that would still reject H0. It is the value the
probability function takes at the significance level α.
The probability β = P (Do not reject H 0 ∣ H 0 is false ) of a type II error can be calculated by
finding the area under the critical region of the probability distribution under the assumption that
H1 is true ∼N ( μ H , σ 2) .
1
Hypothesis testing
1. Form hypothesis and define H0, H1 and significance level α.
2. Design test, choose test statistic (e.g. mean).
3. Check if population standard deviation can (z-test) or cannot (t-test) be used.
p-value method:
4. Compute observed statistic t or z from the formulae above.
5. Look up the p-value from a z-table, t-table or with a z-test / t-test calculator
6. Reach conclusion: if p-value < α – reject H0
critical value method:
3. Compute critical range with a z-table, t-table or with a critical value calculator
4. Compute observed statistic t or z from the formulae above.
5. Reach conclusion: if the observed statistic is in the critical range – reject H0
Two-sample tests
nx ny
1 1
For a two-sample t-test of means, consider X̄ = ∑
nx i=1
X i and Ȳ = ∑ Y i.
n y i=1
( X̄ −Ȳ )−( μ x −μ y )
√ σ 2x σ 2y
nx
+
ny
) , which can be standardised to
( X̄−Ȳ )−( μ x −μ y )
∼ N (0 ,1) or T = ∼ tν
√ √
2 2 2 2
σ x σ y s s x y
+ +
nx ny nx n y
when σ is unknown.
2 2 2
sx s y
( + )
nx n y
For two samples, the degrees of freedom is given by .
s2x 2 s2y 2
( ) ( )
nx ny
+
n x −1 n y −1
Finally, H0 states that μx = μy.
For right-tailed test: H1: μx > μy, for left-tailed test: H1 μx < μy and for two-tailed test: μx ≠ μy.
p x − p y −0
To test proportions, z = ⋅√ nx⋅n y ∼ N(0 , 1) and H0: px = py.
√ ( X+ Y )(1−
X +Y
nx+ ny
)
For right-tailed test: H1: px > py, for left-tailed test: H1: px < py and for two-tailed test: px ≠ py.
A paired t-test is for non-independent data, e.g. same individuals in different circumstances. The
n
1
sample mean of differences is D̄ = ∑ Di ~ N(μD, σD) if X~N and Y~N where D = X – Y
n i=1
√
n
A/B Testing
A framework of two-sample test:
• Propose variations A and B
• Randomly split sample to A group and B group
• Measure outcomes and determine a metric to use
• Statistical analysis to make a decision e.g. t-test, z-test