0% found this document useful (0 votes)
28 views42 pages

01 Lectureslides ProbTheory

This document provides an overview of key concepts in probability theory, including sample spaces, events, probability measures, conditional probability, independence, random variables, distributions, transformations, expectation, variance, and entropy. It defines these terms and outlines some of their basic properties.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views42 pages

01 Lectureslides ProbTheory

This document provides an overview of key concepts in probability theory, including sample spaces, events, probability measures, conditional probability, independence, random variables, distributions, transformations, expectation, variance, and entropy. It defines these terms and outlines some of their basic properties.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Probability Theory

Maximilian Soelch

Technische Universität München

1 / 42
Probability Theory

Probability Theory is the study of uncertainty. Uncertainty is all around


us.

Mathematical probability theory is based on measure theory—we do not


work at this level.

Slides are mostly based on Review of Probability Theory by Arian Meleki


and Tom Do.

2 / 42
Probability Theory

The basic problem that we study in probability theory:

Given a data generating process,


what are the properties of the outcomes?

The basic problem of statistics (or better statistical inference) is the


inverse of probability theory:

Given the outcomes,


what can we say about the process that generated the data?

Statistics uses the formal language of probability theory.

3 / 42
Basic Elements of Probability
Sample space Ω:

The set of all outcomes of a random experiment.


e.g. rolling a die: Ω = {1, 2, 3, 4, 5, 6}
e.g. rolling a die twice: Ω0 = Ω × Ω = {(1, 1), (1, 2), . . . , (6, 6)}

Set of events F (event space):

A set whose elements A ∈ F (events) are subsets of Ω.


F (σ-field) must satisfy
I ∅ ∈ F,
I A ∈ F ⇒ Ω \ A ∈ F,
I A1 , A2 , . . . ∈ F ⇒ ∪i Ai ∈ F.

e.g. “die outcome is even” event A = {ω ∈ Ω : ω even} = {2, 4, 6}

e.g. (smallest) σ-field that contains A: F = {∅, {1, 3, 5}, {2, 4, 6}, {1, 2, 3, 4, 5, 6}}

4 / 42
Basic Elements of Probability ctd.

Probability measure P : F → [0, 1]

with Axioms of Probability:


I P(A) ≥ 0 for all A ∈ F,
I P(Ω) = 1,
I If A1 , A2 , . . . are disjoint events (Ai ∩ Aj = ∅, i 6= j ) then
X
P(∪i Ai ) = P(Ai ).
i

|A|
e.g. for rolling a die P(A) = |Ω|

The triple (Ω, F, P) is called a probability space.

5 / 42
Important Properties

The three axioms from the previous slide suffice to show:


I If A ⊆ B ⇒ P(A) ≤ P(B )
I P(A ∩ B ) (≡ P(A, B )) ≤ min(P(A) , P(B ))
I P(A ∪ B ) ≤ P(A) + P(B )
I P(Ω \ A) = 1 − P(A)
I A2 , . . . , Ak are sets of disjoint events such that ∪i Ai = Ω
If A1 ,P
then k P(Ak ) = 1 (Law of total probability).

6 / 42
Conditional Probability

Let A, B ⊆ Ω be two events with P(B ) 6= 0, then:

P(A ∩ B )
P(A | B ) := .
P(B )

P(A | B ) is the probability of A conditioned on B and represents the


probability of A, if it is known that B was observed.

7 / 42
Multiplication law

Let A1 , . . . , An be events with P(A1 ∩ . . . ∩ An ) 6= 0. Then:


n
!
Y \
P(A1 ∩ . . . ∩ An ) = P Ai | Aj
i=1 j <i

= P(A1 ) · P(A2 | A1 ) · P(A3 | A1 ∩ A2 ) · . . . · P(An | A1 ∩ . . . ∩ An−1 ).

8 / 42
Law of total probability (revisited)

Let B be an event and Φ a partition of Ω with P(A) > 0 for all A ∈ Φ.


Then: X X
P(B ) = P(B ∩ A) = P(A) · P(B | A).
A∈Φ A∈Φ

Graphical representation for a 5-partition Φ = {A1 , . . . , A5 } of Ω:

9 / 42
Bayes’ rule

Let A and B be two events with P(A) , P(B ) 6= 0. Then:

P(B | A) · P(A)
P(A | B ) = .
P(B )

Bayes’ rule applies the multiplication rule twice to set P(A | B ) and
P(B | A) in relation:

P(B | A) · P(A) = P(A ∩ B ) = P(A | B ) · P(B ) .


| {z }
=P(A,B)

Bayes’ rule is always used if the conditional probability P(A | B ) is easy


to calculate or given, but the conditional probability P(B | A) is searched
for.

10 / 42
Independence

Two events A, B are called independent if and only if

P(A, B ) = P(A) P(B ) ,

or equivalently P(A | B ) = P(A). What does that mean in words?

Two events A, B are called conditionally independent given a third event


C if and only if

P(A, B | C ) = P(A | C )P(B | C ).

11 / 42
Random variables
We are usually only interested in some aspects of a random experiment.

Random variable X : Ω → R (actually not every function is allowed . . . ).

A random variable is usually just denoted by an upper case letter X


(instead of X (ω)). The value a random variable may take is denoted by
the corresponding lower-case letter.

For a discrete random variable

P(X = x ) := P({ω ∈ Ω : X (ω) = x }) .

For a continuous random variable

P(a ≤ X ≤ b) := P({ω ∈ Ω : a ≤ X (ω) ≤ b}) .

Note the usage of P here.

12 / 42
Cumulative distribution function – CDF
A probability measure P is specified by a cumulative distribution function
(CDF), a function FX : R → [0, 1]:

FX (x ) ≡ P(X ≤ x ) .

Properties:
I 0 ≤ FX (x ) ≤ 1
I limx →−∞ FX (x ) = 0
I limx →∞ FX (x ) = 1
I x ≤ y → FX (x ) ≤ FX (y)

Let X have CDF FX and Y have CDF FY . If FX (x ) = FY (x ) for all x ,


then P(X ∈ A) = P(Y ∈ A) for all (measurable) A.
We call X and Y identically distributed (or equal in distribution).

13 / 42
Probability density function—PDF

For some continuous random vari-


ables, the CDF FX (x ) is continuous
on R. The probability density func-
tion is then defined as the piecewise
derivative
dFX (x )
fX (x ) = ,
dx
and X is called continuous.

P(x ≤ X ≤ x + ∆x ) ≈ fX (x ) · ∆x .

Properties:
I fX (x ) ≥ 0
R
I f (x ) dx = P(X ∈ A)
x ∈A X
R∞
I
−∞ X
f (x ) dx = 1

14 / 42
Probability mass function—PMF
X takes on only a countable set of possible values (discrete random
variable).

A probability mass function pX : Ω → [0, 1] is a simple way to represent


the probability measure associated with X :

pX (x ) = P(X = x )

(Note: We use the probability measure P on the random variable X )

Properties:
I 0 ≤ pX (x ) ≤ 1
P
I pX (x ) = 1
Px
x ∈A pX (x ) = P(X ∈ A)
I

15 / 42
Transformation of Random Variables

Given a (continuous) random variable X and a strictly monotonic


(increasing or decreasing) function s, what can we say about Y = s(X )?

fY (y) = fX t(y) |t 0 (y)|,




where t is the inverse of s.

16 / 42
Expectation

For any measurable function g : R → R, we define the expected value:


X
E[g(X )] = g(x )pX (x ) discrete
x
Z ∞
E[g(X )] = g(x )fX (x ) dx continuous
−∞

Special case: E[X ], i.e., g(x ) = x , is called the mean of X .

Properties:
I E[a] = a for any constant a ∈ R
I E[af (X )] = aE[f (X )] for any constant a ∈ R
I E[f (X ) + g(X )] = E[f (X )] + E[g(X )]

For any A ∈ R: E[IA (X )] = P (X ∈ A)

17 / 42
Variance and Standard Deviation

Variance measures the concentration of a random variable’s distribution


around its mean.
2
Var(X ) = E (X − E[X ])2 = E X 2 − E[X ] .
   

Properties:
I Var(a) = 0 for any constant a ∈ R.
I Var(af (X )) = a 2 Var(f (X )) for any constant a ∈ R.

p
σ(X ) = Var(X ) is called the standard deviation of X .

18 / 42
Entropy

The Shannon entropy or just entropy of a discrete random variable X is


X
H [X ] ≡ − P(X = x ) log P(X = x ) = −E[log P(X )].
x

Given two probability mass fuctions p1 and p2 , the Kullback-Leibler


divergence (or relative entropy) between p1 and p2 is
X p2 (x )
KL(p1 k p2 ) ≡ − p1 (x ) log
x
p1 (x )

Note that the KL divergence is not symmetric.

19 / 42
Bernoulli distribution

A Bernoulli-distributed random variable X ∼ Ber(µ), µ ∈ [0, 1] models


the outcome of an experiment. It is positive with a probability of µ and
negative with a probability of 1 − µ.


µ,
 if x = 1
pX (x ) = 1 − µ, if x = 0

0 else

For calculations the following equation is more useful:

Ber(x | µ) = µx · (1 − µ)1−x

20 / 42
Binomial distribution

A Binomial random variable


X ∼ Bin(N , µ), N ≥ 1, µ ∈ [0, 1]
shows the number of successes by
performing N trials, where each trial
is independent from the others. The
success probability is µ.

For x ∈ {0, 1, . . . , N }:

 
N
Bin(x | N , µ) = ·µx ·(1−µ)N −x
x

21 / 42
Poisson distribution

A Binomial random variable with


large N and small µ can be
approximated by a Poisson random
variable X ∼ Poi(λ).
For λ = N µ and as N → ∞:

X ∼ Bin(N , µ) → X ∼ Poi(λ)

For x ∈ N0 :

e −λ · λx
Poi(x | λ) =
x!

22 / 42
Uniform distribution

A uniformly distributed random


variable X ∼ U(a, b), a, b ∈ R,
a < b, takes any value on the
interval [a, b] with equal probability.

For x ∈ [a, b]:


1
U(x | a, b) =
b−a

23 / 42
Exponential distribution

An exponentially distributed random


variable X ∼ Exp(λ), λ > 0 can be
referred to as the latency until an
event (“success”) occurs the first
time. λ corresponds to the expected
number of successes in one unit of
time.

For x ∈ R+
0:

Exp(x | λ) = λ · e −λx

24 / 42
Normal/Gaussian distribution

A Normal or Gaussian random


variable X ∼ N (µ, σ 2 ), µ, σ ∈ R,
σ > 0 has approximately the same
distribution as the sum of many
independently, arbitrarily but
identically distributed random
variables.

For x ∈ R:
1 (x −µ)2
N (x | µ, σ 2 ) = √ · e − 2σ2
2πσ

25 / 42
Beta distribution

Random variables X ∼ Beta(a, b),


a, b > 0, following a Beta
distribution can often be seen as the
success probability for a binary
event.

For x ∈ [0, 1]:

Γ(a + b) a−1
Beta(x | a, b) = x (1−x )b−1
Γ(a)Γ(b)

26 / 42
Gamma distribution

Random variables X ∼ Gamma(a, b)


following a Gamma distribution are
governed by the parameters a, b > 0 .
For x > 0:
1 a a−1 −bx
Gamma(x | a, b) = b x e
Γ(a)
Z ∞
Γ(a) = t a−1 e −t dt
0
Γ(n + 1) = n! for n ∈ N+
0

The Gamma distribution is the


conjugate prior for the precision (inverse
variance) of a univariate Gauss
distribution.

27 / 42
Overview: probability distributions

Distribution Notation Param. Co-dom. PMF / PDF Mean Variance


Bernoulli* Ber(µ) µ ∈ [0, 1] x ∈ {0, 1} µx (1 − µ)1−x µ µ(1 − µ)
N N −x
 x
Binomial* Bin(N , µ) N ≥ 1, µ ∈ [0, 1] x ∈ {0, 1, . . . , N } x µ (1 − µ) Nµ N µ(1 − µ)
e −λ λx
Poisson* Poi(λ) λ>0 x ∈ N+
0 x! λ λ
1 a+b 1
Uniform U(a, b) a, b ∈ R, a < b x ∈ [a, b] b−a 2 12 (b − a)2
Exponential Exp(λ) λ>0 x ∈ R+
0 λe −λx 1
λ
1
λ2
n o
(x −µ)2
Normal/Gauss N (µ, σ 2 ) µ ∈ R, σ > 0 x ∈R √1 exp − µ σ2
σ 2π 2σ 2
Γ(a+b) a−1 a ab
Beta Beta(a, b) a, b > 0 x ∈ [0, 1] Γ(a)Γ(b) x (1 − x )b−1 a+b (a+b)2 (a+b+1)
ba
Gamma Gamma(a, b) a, b > 0 x ∈ R+
0 Γ(a) x
a−1 −bx
e a
b
a
b2

*Discrete distributions

R∞
With the gamma function Γ(x ) = 0
t x −1 e −t dt, with the property that
Γ(n + 1) = n! for n ∈ N+
0.

28 / 42
Two random variables—Bivariate case

Two random variables X and Y can interact. We need to consider them


simultaneously for statistical analysis. To this end, we introduce the joint
cumulative distribution function of X and Y :

FXY (x , y) = P(X ≤ x , Y ≤ y) .

FX (x ) and FY (y) are the marginal cumulative distribution function of


FXY (x , y).

Properties:
I 0 ≤ FXY (x , y) ≤ 1
I limx ,y→−∞ FXY (x , y) = 0
I limx ,y→∞ FXY (x , y) = 1
I FX (x ) = limy→∞ FXY (x , y)

29 / 42
Two continuous random variables
Most properties can be defined analogously to the univariate case.
Joint probability density function:

∂ 2 FXY (x , y)
fXY (x , y) = .
∂x ∂y

Properties:
I fXY (x , y) ≥ 0
RR
I fXY (x , y) dxdy = P((X , Y ) ∈ A)
R ∞A R ∞
I
−∞ −∞ XY
f (x , y) dxdy = 1

If we remove the effect of one of the random variables, we yield the


marginal probability density function or marginal density for short:
Z ∞
fX (x ) = fXY (x , y) dy.
−∞

30 / 42
Relations between fX ,Y , fX , fY , FX ,Y , FX and FY

31 / 42
Two discrete random variables

Joint probability mass function:

pXY (x , y) = P(X = x , Y = y) .
Properties:
I 0 ≤ pXY (x , y) ≤ 1
P P
y pXY (x , y) = 1
I
x

In order to get the marginal probability mass function pX (x ), we need to


sum out all possible y (marginalization):
X
pX (x ) = pXY (x , y).
y

32 / 42
Conditional distributions/Bayes’ rule

discrete continuous
pXY (x ,y) fXY (x ,y)
Definition pY |X (y | x ) = pX (x ) fY |X (y | x ) = fX (x )
pX |Y (x |y)pY (y) fX |Y (x |y)fY (y)
Bayes’ rule pY |X (y | x ) = pX (x ) fY |X (y | x ) = fX (x )
R
Probabilites pY |X (y | x ) = P(Y = y | X = x ) P(Y ∈ A | X = x ) = f
A Y |X
(y | x ) dy

33 / 42
Independence

Two random variables X , Y are independent if


FXY (x , y) = FX (x )FY (y) for all values x and y.

Equivalently:
I pXY (x , y) = pX (x ) pY (y)
I pY |X (y | x ) = pY (y)
I fXY (x , y) = fX (x )fY (y)
I fY |X (y | x ) = fY (y)

34 / 42
Independent and identically distributed—i.i.d.

If two random variables X and Y are called identically distributed it


means that the following holds:

fX (x ) = fY (x ),
FX (x ) = FY (x ).

As a consequence (among many others):

E[X ] = E[Y ],
Var(X ) = Var(Y ).

It does not mean that X = Y is true! X and Y following the same


distribution does not imply that they always provide the same values!
If X and Y are also independent, we call them independent and
identically distributed (i.i.d.).

35 / 42
Expectation and covariance
Given two random variables X , Y and g : R2 → R.
P P
I E[g(X , Y )] :=
x y g(x , y)pXY (x , y).
R∞
I E[g(X , Y )] := g(x , y)fXY (x , y) dx dy.
−∞

Covariance
I Cov(X , Y ) := E[(X − E[X ])(Y − E[Y ])] = E[XY ] − E[X ]E[Y ].
I When Cov(X , Y ) = 0, X and Y are uncorrelated.
I Pearson correlation coefficient ρ(X , Y ):

Cov(X , Y )
ρ(X , Y ) := p ∈ [−1, 1].
Var(X )Var(Y )

I E[f (X , Y ) + g(X , Y )] = E[f (X , Y )] + E[g(X , Y )].


I Var(X + Y ) = Var(X ) + Var(Y ) + 2 Cov(X , Y ).
I If X and Y are independent, then Cov(X , Y ) = 0.
I If X and Y are independent, then E[f (X )g(Y )] = E[f (X )]E[g(Y )].

36 / 42
Multiple random variables—Random vectors
Generalize previous ideas to more than two random variables. Putting all
these random variables together in one vector X , a random vector
(X : Ω → Rn ). The notions of joint CDF and PDF apply equivalently,
e.g.

FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P(X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ) .

Expectation of a continuous random vector for g : Rn → R:


Z
E[g(X )] = g(x1 , x2 , . . . , xn )fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn .
Rn

If g : Rn → Rm then the expected value of g is the element-wise values


of the output vector:
 
E[g1 (X )]
 
 E[g2 (X )] 
E[g(X )] =  .
 
..

 . 

E[gm (X )]
37 / 42
Independence of more than two random variables

The random variables X1 , . . . , Xn are independent if for all subsets


I = {i1 , . . . , ik } ⊂ {1, . . . , N } and all (xi1 , . . . , xik )

fXi1 ,...,Xik (xi1 , . . . , xik ) = fXi1 (xi1 ) · . . . · fXik (xik ),

or equivalently

FXi1 ,...,Xik (xi1 , . . . , xik ) = FXi1 (xi1 ) · . . . · FXik (xik ),

hold.

If there exists a combination of values so that the equations above do not


hold then the random variables are not independent.
For better distinction, this notion of independence is sometimes called
mutual independence.

38 / 42
Covariance matrix

For a random vector X : Ω → Rn , the covariance matrix Σ is the n × n


square symmetric, positive definite matrix whose entries are

Σij = Cov(Xi , Xj ).

h i
T
Σ = E (X − E[X ])(X − E[X ])T = E X X T − E[X ]E[X ]
 

39 / 42
Multinomial distribution
The multivariate version of the Binomial is called a Multinomial,
X ∼ Multinomial(N , µ). We have k ≥ 1 mutually exclusive events with
Pk
a success probability of µk (such that i=1 µk = 1).
We draw N times independently.
 
N
pX (x1 , x2 , . . . , xk ) = µx1 µx2 . . . µxkk .
x1 x2 . . . xk 1 2
where
X
xk = N ,
k
 
N N!
= ,
x1 x2 . . . xk x1 ! x2 ! . . . xk !
E[X ] = (N µ1 , N µ2 , . . . , N µk ),
Var(Xi ) = N µi (1 − µi ),
Cov(Xi , Xj ) = −N µi µj .
Example: An urn with n balls of k ≥ 1 different labels, drawn N ≥ 1
times with replacement and probabilities µk = #kn .
The marginal distribution of Xi is Bin(n, µi ).
40 / 42
Multivariate Gaussian

The multivariate version of the Gaussian X ∼ N (µ, Σ) is very similar to


the univariate, except that it allows for dependencies between the
individual components. µ ∈ Rk is the mean vector, the positive definite,
symmetric Σ ∈ Rk ×k the covariance matrix

 
1 1
fX (x1 , x2 , . . . , xk ) = p exp − (x − µ)T Σ−1 (x − µ) ,
(2π)k det Σ 2

where

E[X ] = µ,
Var(Xi ) = Σii ,
Cov(Xi , Xj ) = Σij .

The marginal distribution of Xi is N (µi , Σii ).

41 / 42
Notation in the lecture

Consider
pX (x ), x ∈R vs pX (y), y ∈R

pX (x ), fX (x ), pXY (x , y), fXY (x , y) are written as p(x ) or p(x , y).


Likewise pY (y) is written as p(y).

42 / 42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy