ML 2
ML 2
1
Chapter 2: Probability Distributions
孫民
清華大學電機系
2/19/23
2 Parametric Distributions
Basic building blocks:
Need to determine given
Representation: or ?
2/19/23
3 Binary Variables (1)
Bernoulli Distribution
2/19/23
5 Binary Variables (2)
N coin flips:
Binomial Distribution
2/19/23
6 Binomial Distribution
2/19/23
7 Parameter Estimation (1)
´ Likelihood function 𝑃 𝐷 𝛍
´ Prior 𝑃 𝛍
´ Posterior 𝑃 𝛍 𝐷
´ Conjugate prior:
Prior 𝑃 𝛍 is the conjugate prior for a likelihood function𝑃 𝐷 𝛍 if
the prior 𝑃 𝛍 and the posterior 𝑃 𝛍 𝐷 have the same form.
´ Example (coin flip problem)
´Prior 𝑃 𝜇 : 𝐵𝑒𝑡𝑎 𝛽!, 𝛽" ; Likelihood 𝑃 𝐷 𝜇 : Binomial 𝜇 #! 1 − 𝜇 #"
= X ln 𝜇 + (N-X) ln (1 − 𝜇) Sufficient
Statistics
2/19/23
11 Beta Distribution
Distribution over 𝜇 ∈ 0, 1 .
2/19/23
2/19/23
14 Prior · Likelihood = Posterior
Eq. (2.9)
2/19/23
15 Properties of the Posterior
2/19/23
16 Prediction under the Posterior
What is the probability that the next coin toss
will land heads up?
! !
𝑝 𝑥 = 1 𝒟 = 4 𝑝 𝑥 = 1 𝜇 𝑝 𝜇 𝒟 𝑑𝜇 = 4 𝜇𝑝 𝜇 𝒟 𝑑𝜇 = 𝔼 𝜇|𝒟
" "
𝑎&
𝑝 𝑥=1𝒟 =
𝑎& + 𝑏&
2/19/23
17 Multinomial Variables
1-of-K coding scheme:
2/19/23
18 ML Parameter Estimation
Given: 𝑚! = ∑" 𝑥"!
%!
&!
+𝜆=0
! !
19 The Multinomial Distribution
2/19/23
20 The Dirichlet Distribution
Also known as multivariate beta distribution (MBD)
Lejeune Dirichlet
& 𝜇! = 1
!
Simplex
2/19/23
21 Bayesian Multinomial (1)
2/19/23
22 Bayesian Multinomial (2)
2/19/23
23
Continuous Variable
2/19/23
24 The Gaussian Distribution
2/19/23
Square of Mahalanobis distance
25 Central Limit Theorem
2/19/23
26 Geometry of the Multivariate Gaussian
∑𝑢𝑖 = 𝜆𝑖 𝑢𝑖
∑ = ∑-
+,! 𝜆𝑖 𝑢𝑖 𝑢𝑖
T
𝐲=𝐔 𝐱−𝝁
(KLT-Transform, PCA)
2/19/23
27 Moments of the Multivariate Gaussian (1)
𝐳=𝐱−𝝁
thanks to anti-symmetry of z
2/19/23
28 Moments of the Multivariate Gaussian (2)
2/19/23
30 Partitioned Gaussian Distributions
𝚺. = 𝚺
Precision matrix:
𝚲. = 𝚲 2/19/23
31 Partitioned Conditionals and Marginals
𝚺#|' = 𝚲)$
##
𝛍#|' =𝚺#|' 𝚲## 𝛍# − 𝚲#' 𝐱 ' − 𝛍' 2/19/23
= 𝛍# − 𝚲)$
## 𝚲#' 𝐱 ' − 𝛍'
32 Partitioned Conditionals and Marginals
Making use of the following identity for the inverse of a partitioned
matrix
3!
𝐀 𝐁
𝐌 3! is known as the Schur complement of the matrix
𝐂 𝐃
2/19/23
33 Partitioned Conditionals and Marginals
Conditional distribution
Marginal distribution
2/19/23
34 Partitioned Conditionals and Marginals
2/19/23
35 Bayes’ Theorem for Gaussian Variables (1)
Given
Linear G model
we have
where
2/19/23
36 Bayes’ Theorem for Gaussian Variables (2)
Define
we have
𝐑
The covariance matrix is found by taking the inverse
of the precision
2/19/23
38 Bayes’ Theorem for Gaussian Variables (4)
Linear to x, y terms
And 2/19/23
39 Maximum Likelihood for the Gaussian (1)
Sufficient statistics
2/19/23
40 Maximum Likelihood for the Gaussian (2)
Similarly
2/19/23
41 Maximum Likelihood for the Gaussian (3)
Hence define
𝔼 Σ* = Σ
2/19/23
42 Sequential Estimation
Contribution of the Nth data point, xN
correction given xN
correction weight
old estimate
2/19/23
43 The Robbins-Monro Algorithm (1)
Consider " and z governed by p(z,") and
define the regression function
2/19/23
44 The Robbins-Monro Algorithm (2)
2/19/23
46 Robbins-Monro for Maximum Likelihood (1)
Regarding general ML
2/19/23
47 Robbins-Monro for Maximum Likelihood (2)
2/19/23
49 Bayesian Inference for the Gaussian (2)
2/19/23
50 Bayesian Inference for the Gaussian (3)
… where
Note:
2/19/23
51 Bayesian Inference for the Gaussian (4)
Example: for N = 0, 1, 2 and 10.
Prior with mean 0, and data sampled from N(mean 0.8, variance 0.1) 2/19/23
52
Bayesian Inference for the Gaussian (5)
Sequential Estimation
prior
The posterior obtained after observing N — 1 data points
becomes the prior when we observe the Nth data point.
2/19/23
53 Bayesian Inference for the Gaussian (6)
2/19/23
54 Bayesian Inference for the Gaussian (7)
2/19/23
55 Bayesian Inference for the Gaussian (8)
2/19/23
56 Bayesian Inference for the Gaussian (9)
2/19/23
58 Bayesian Inference for the Gaussian (11)
2/19/23
59 Bayesian Inference for the Gaussian (12)
Multivariate conjugate priors
µ unknown, ! known: p(µ) Gaussian.
! unknown, µ known: p(!) Wishart,
2/19/23
Student’s t-Distribution (1)
60
If we intergrate out precision
where
Like the precision The degrees of freedom
2/19/23
62 Student’s t-Distribution (3)
Robustness to outliers: Gaussian, Laplacian vs t-distribution.
2/19/23
63 Student’s t-Distribution (4)
The multi-variate case:
where .
Properties:
2/19/23
64 Periodic variables
2/19/23
65 von Mises Distribution (1)
This requirement is satisfied by
where
𝜃 𝑖𝑠 𝑡ℎ𝑒 mean, m is the concentration parameter like the precision for G. 2/19/23
67 Maximum Likelihood for von Mises
Given a data set, , the log likelihood
function is given by
2/19/23
2/19/23
https://www.kaggle.com/janithwanni/old-faithful
69 Mixtures of Gaussians (2)
Component
Mixing coefficient
K=3
2/19/23
70 Mixtures of Gaussians (3)
2/19/23
71 Mixtures of Gaussians (4)
2/19/23
73 The Exponential Family
´ General from
𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥
𝜂 is the natural parameters
g(𝜂) coefficient ensures normalized
2/19/23
74 The Exponential Family (2.1)
𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥
The Bernoulli Distribution
2/19/23
Logistic sigmoid
75 The Exponential Family (2.2)
𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥
where
2/19/23
76 The Exponential Family (3.1)
The Multinomial Distribution
where, , and
NOTE: The µk parameters
are not independent since
the corresponding µk must
satisfy
2/19/23
The Exponential Family (3.2)
77
Let . This leads to
& &
)
𝑝 𝐱 𝝁 = 4 𝜇# ' = exp & 𝑥# ln 𝜇# = ℎ 𝐱 𝑔 𝜼 exp 𝜼* 𝐮 𝐱
#$% #$%
x𝑀 𝜇𝑀
& &'% &'% &'%
exp & 𝑥# ln 𝜇# = exp & 𝑥# ln 𝜇# + 1 − & 𝑥# ln 1 − & 𝜇#
#$% #$% #$% #$%
&'% 𝜇# &'%
= exp & 𝑥# ln &'% + ln 1 − & 𝜇#
#$% 1 − ∑ 𝜇
($% ( #$%
𝜇#
𝜂# = ln &'% &'% 𝜇#
1 − ∑&'%
($% 𝜇( = 1−& 𝜇# exp & 𝑥# ln
#$% #$% 1 − ∑&'%
($% 𝜇(
exp 𝜂#
𝜇# =
1 + ∑&'%
($% exp 𝜂(
&'% 1
𝑔 𝜼 =1−& 𝜇# =
#$% 1 + ∑&'%
#$% exp 𝜂#
2/19/23
78 The Exponential Family (3.3)
& &
)
𝑝 𝐱 𝝁 = 4 𝜇# ' = exp & 𝑥# ln 𝜇# = 𝑔 𝜼 exp 𝜼* 𝒙
#$% #$%
and &'% 1
Softmax
𝑔 𝜼 =1−& 𝜇# =
#$% 1 + ∑&'%
#$% exp 𝜂#
where
2/19/23
80 The Exponential Family (4)
The Gaussian Distribution
where
2/19/23
Estimate 𝜂 using ML for Exponential
81
Family
2/19/23
82 ML for the Exponential Family (1)
3
Taking at both sides, we get
34
Thus
2/19/23
83 ML for the Exponential Family (2)
Thus we have
Sufficient statistic
2/19/23
84 Conjugate priors 𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥
for any A and B. Thus p($) " 1/$ and so this prior is
improper too. Note that this corresponds to p(ln $)
being constant.
2/19/23
89 Noninformative Priors (5)
Example: For the variance of a Gaussian, $2, we have
2/19/23
90 Nonparametric Methods (1)
2/19/23
91 Nonparametric Methods (2)
2/19/23
92 Kernel Density Estimation (1)
Histogram methods
partition the data space
into distinct bins with
widths !i and count the
number of observations,
ni, in each bin.
Thus
The probability that K
out of N observations
lie inside R is Bin(K|N,P) V R is small so const. V
and if N is large Yet R is large so K ~= NP
(1) Fix K, determine V from data (K-NN)
(2) Fix V. determine K from data (Kernel)
2/19/23
94 Parzen Window(1)
Kernel Density Estimation: fix V, estimate K from the data. Let R
be a hypercube centred on x and define the kernel function
(Parzen window)
It follows that
# points at x
Fix V
2/19/23
96 Parzen Window (3)
To avoid discontinuities in 𝑝(𝑥), use
a smooth kernel, e.g. a Gaussian
ℎ acts as a smoother.
2/19/23
n is N
2/19/23
98 Parzen Window (5)
n is N
2/19/23
99 Parzen Window (6)
n is N
2/19/23
100 Nearest Neighbour Estimation (1)
Nearest Neighbour Density Estimation: fix K, estimate
V from the data. Consider a hypersphere centred on
x and let it grow to a volume, V ", that includes K of
the given Ndata points. Then
2/19/23
101 K-Nearest Neighbour Estimation (2)
K acts as a smoother.
2/19/23
102 K-Nearest Neighbour Estimation (3)
2/19/23
103 K-Nearest Neighbour Estimation (4)
2/19/23
104 Nonparametric vs. Parametric Methods
2/19/23
105 K-NN for Classification (1)
Given a data set with Nk data points from class Ck
and , we have
and correspondingly
2/19/23
107 K-NN Classifier
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
2/19/23
108 1-NN Classifier
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
2/19/23
109 3-NN Classifier
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
2/19/23
110 5-NN Classifier
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
2/19/23
111 1-NN Classifier (3)
Voronoi diagram (cells)
2/19/23
112 K-NN Classifier
• K acts as a smother
• For 𝑁 → ∞, the error rate of the 1-nearest-neighbour
classifier is never more than twice the optimal error
(obtained from the true conditional class distributions).
2/19/23