0% found this document useful (0 votes)
4 views109 pages

ML 2

The document discusses various probability distributions relevant to machine learning, including Bernoulli, Binomial, and Gaussian distributions, along with their parameter estimation techniques. It covers Bayesian inference, the Central Limit Theorem, and the properties of multivariate distributions. Additionally, it introduces the von Mises distribution for periodic variables and the concept of mixtures of Gaussians.

Uploaded by

wj9hn5fc5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views109 pages

ML 2

The document discusses various probability distributions relevant to machine learning, including Bernoulli, Binomial, and Gaussian distributions, along with their parameter estimation techniques. It covers Bayesian inference, the Central Limit Theorem, and the properties of multivariate distributions. Additionally, it introduces the von Mises distribution for periodic variables and the concept of mixtures of Gaussians.

Uploaded by

wj9hn5fc5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

Machine Learning

1
Chapter 2: Probability Distributions

孫民
清華大學電機系

2/19/23
2 Parametric Distributions
Basic building blocks:
Need to determine given
Representation: or ?

Recall Curve Fitting

2/19/23
3 Binary Variables (1)

Coin flipping: heads = 1, tails = 0 Jacob Bernoulli

Bernoulli Distribution

2/19/23
5 Binary Variables (2)
N coin flips:

Binomial Distribution

2/19/23
6 Binomial Distribution

2/19/23
7 Parameter Estimation (1)
´ Likelihood function 𝑃 𝐷 𝛍
´ Prior 𝑃 𝛍
´ Posterior 𝑃 𝛍 𝐷

´ Conjugate prior:
Prior 𝑃 𝛍 is the conjugate prior for a likelihood function𝑃 𝐷 𝛍 if
the prior 𝑃 𝛍 and the posterior 𝑃 𝛍 𝐷 have the same form.
´ Example (coin flip problem)
´Prior 𝑃 𝜇 : 𝐵𝑒𝑡𝑎 𝛽!, 𝛽" ; Likelihood 𝑃 𝐷 𝜇 : Binomial 𝜇 #! 1 − 𝜇 #"

´Posterior 𝑃 𝜇 𝐷 : 𝐵𝑒𝑡𝑎 𝛼! + 𝛽!, 𝛼" + 𝛽"


2/19/23
8 Parameter Estimation (2)
Bernoulli Distribution

Assuming D={x1, .. xN } samples, estimate 𝜇 using ML

ln P(D|𝜇)=∑$ !"# xi ln 𝜇 + (1−xi) ln (1 − 𝜇) X = ∑$


!"# 𝑥𝑖

= X ln 𝜇 + (N-X) ln (1 − 𝜇) Sufficient
Statistics

𝒹 ln P(D|&) ' $('


= − =0
𝒹& & #(&
X(1- 𝜇) - (N-X) 𝜇=0
X – X𝜇 − 𝑁𝜇 + X 𝜇 = 0
'
𝜇= $
2/19/23
10 Parameter Estimation (3)
Example:
Prediction: all future tosses will land heads up
Overfitting to D

2/19/23
11 Beta Distribution

Distribution over 𝜇 ∈ 0, 1 .

2/19/23

Γ(z) is the gamma function.


12 Bayesian Bernoulli
m 1 among N trials

The Beta distribution provides the conjugate prior


for the Bernoulli distribution.
2/19/23
13 Beta Distribution

2/19/23
14 Prior · Likelihood = Posterior

Eq. (2.9)

2/19/23
15 Properties of the Posterior

As the size of the data set, N, increases

2/19/23
16 Prediction under the Posterior
What is the probability that the next coin toss
will land heads up?

! !
𝑝 𝑥 = 1 𝒟 = 4 𝑝 𝑥 = 1 𝜇 𝑝 𝜇 𝒟 𝑑𝜇 = 4 𝜇𝑝 𝜇 𝒟 𝑑𝜇 = 𝔼 𝜇|𝒟
" "

𝑎&
𝑝 𝑥=1𝒟 =
𝑎& + 𝑏&

2/19/23
17 Multinomial Variables
1-of-K coding scheme:

2/19/23
18 ML Parameter Estimation
Given: 𝑚! = ∑" 𝑥"!

To ensure ∑! 𝜇! = 1, use a Lagrange multiplier, !.

%!
&!
+𝜆=0

& 𝜇! = & −𝑚! ⁄𝜆 = 1 2/19/23

! !
19 The Multinomial Distribution

2/19/23
20 The Dirichlet Distribution
Also known as multivariate beta distribution (MBD)
Lejeune Dirichlet

& 𝜇! = 1
!

Conjugate prior for the multinomial distribution.

Simplex

2/19/23
21 Bayesian Multinomial (1)

2/19/23
22 Bayesian Multinomial (2)

2/19/23
23
Continuous Variable

2/19/23
24 The Gaussian Distribution

Carl Friedrich Gauss

2/19/23
Square of Mahalanobis distance
25 Central Limit Theorem

The distribution of the mean of N i.i.d. random


variables becomes increasingly Gaussian as N
grows.
Example: N uniform [0,1] random variables.

2/19/23
26 Geometry of the Multivariate Gaussian

∆: the Mahalanobis distance (when ∑ is I, reduce to Euclidian D)

∑𝑢𝑖 = 𝜆𝑖 𝑢𝑖

∑ = ∑-
+,! 𝜆𝑖 𝑢𝑖 𝑢𝑖
T

𝐲=𝐔 𝐱−𝝁
(KLT-Transform, PCA)
2/19/23
27 Moments of the Multivariate Gaussian (1)

𝐳=𝐱−𝝁
thanks to anti-symmetry of z

2/19/23
28 Moments of the Multivariate Gaussian (2)

General S Diagonal S Identity S


2/19/23
29 Conditional Gaussisan

• IF two set of variables re jointly G,


then the conditional distribution of
one set conditional on another set is
again G.
• The marginal distribution of either
set is also G.

2/19/23
30 Partitioned Gaussian Distributions

𝚺. = 𝚺
Precision matrix:

𝚲. = 𝚲 2/19/23
31 Partitioned Conditionals and Marginals

• The above equation is a quadratic function of 𝐱 # , thus it’s Gaussian.


• “Completing the square” for the above quadratic function to the
following form
! . 3! ! . 3! 3!
−/ 𝐱 0 − 𝛍0|2 𝚺0|2 𝐱 0 − 𝛍0|2 = − / 𝐱 0 𝚺0|2 𝐱 0 + 𝐱 0.𝚺0|2 𝛍0|2 + const
$
− 𝐱 #&𝚲## 𝐱 # + 𝐱 #& 𝚲## 𝛍# − 𝚲#' 𝐱 ' − 𝛍'
%

𝚺#|' = 𝚲)$
##
𝛍#|' =𝚺#|' 𝚲## 𝛍# − 𝚲#' 𝐱 ' − 𝛍' 2/19/23

= 𝛍# − 𝚲)$
## 𝚲#' 𝐱 ' − 𝛍'
32 Partitioned Conditionals and Marginals
Making use of the following identity for the inverse of a partitioned
matrix

3!
𝐀 𝐁
𝐌 3! is known as the Schur complement of the matrix
𝐂 𝐃

2/19/23
33 Partitioned Conditionals and Marginals

Conditional distribution

(a linear function of 𝐱 ' )

Marginal distribution

2/19/23
34 Partitioned Conditionals and Marginals

2/19/23
35 Bayes’ Theorem for Gaussian Variables (1)

Given

Linear G model

we have

where
2/19/23
36 Bayes’ Theorem for Gaussian Variables (2)

Define

we have

This is a quadratic function of the components of 𝐳


® 𝐳 is Gaussian 2/19/23
37 Bayes’ Theorem for Gaussian Variables (3)

To find the precision of this Gaussian, we consider


the second order terms

𝐑
The covariance matrix is found by taking the inverse
of the precision
2/19/23
38 Bayes’ Theorem for Gaussian Variables (4)
Linear to x, y terms

the mean of 𝐳 is given by (Eq. 2.71 in textbook)

The mean and covariance of y are given by

And 2/19/23
39 Maximum Likelihood for the Gaussian (1)

Given N i.i.d. data , the log likelihood


function is given by

Sufficient statistics

2/19/23
40 Maximum Likelihood for the Gaussian (2)

Set the derivative of the log likelihood function to


zero,

and solve to obtain

Similarly

2/19/23
41 Maximum Likelihood for the Gaussian (3)

Under the true distribution

Hence define

𝔼 Σ* = Σ
2/19/23
42 Sequential Estimation
Contribution of the Nth data point, xN

correction given xN
correction weight
old estimate
2/19/23
43 The Robbins-Monro Algorithm (1)
Consider " and z governed by p(z,") and
define the regression function

Seek "# such that f("#)= 0.

2/19/23
44 The Robbins-Monro Algorithm (2)

Assume we are given samples from p(z,"),


one at the time.
2/19/23
45 The Robbins-Monro Algorithm (3)
• Successive estimates of "# are then given by

• Conditions on aN for convergence :

2/19/23
46 Robbins-Monro for Maximum Likelihood (1)

Regarding general ML

as a regression function, finding its root is equivalent


to finding the maximum likelihood solution !ML. Thus

2/19/23
47 Robbins-Monro for Maximum Likelihood (2)

Example: estimate the mean of a


Gaussian.

The distribution of z is Gaussian


with mean µ — µML.
For the Robbins-Monro update
equation, aN = $2/N.
2/19/23
48 Bayesian Inference for the Gaussian (1)

Assume $2 is known. Given i.i.d. data ,


the likelihood function for µ is given by

This has a Gaussian shape as a function of µ (but it is


not a distribution over µ).

2/19/23
49 Bayesian Inference for the Gaussian (2)

Combined with a Gaussian prior over µ,

this gives the posterior

Completing the square over µ, we see that

2/19/23
50 Bayesian Inference for the Gaussian (3)

… where

Note:

2/19/23
51 Bayesian Inference for the Gaussian (4)
Example: for N = 0, 1, 2 and 10.

Prior with mean 0, and data sampled from N(mean 0.8, variance 0.1) 2/19/23
52
Bayesian Inference for the Gaussian (5)

Sequential Estimation

prior
The posterior obtained after observing N — 1 data points
becomes the prior when we observe the Nth data point.
2/19/23
53 Bayesian Inference for the Gaussian (6)

Now assume µ is known (not the variance). The


likelihood function for ! = 1/$2 is given by

This has a Gamma shape as a function of !.

2/19/23
54 Bayesian Inference for the Gaussian (7)

The Gamma distribution

2/19/23
55 Bayesian Inference for the Gaussian (8)

Now we combine a Gamma prior, ,


with the likelihood function for ! to obtain

which we recognize as with

2/19/23
56 Bayesian Inference for the Gaussian (9)

If both µ and ! are unknown, the joint likelihood


function is given by

We need a prior with the same functional


dependence on µ and !.
2/19/23
57 Bayesian Inference for the Gaussian (10)

The Gaussian-gamma distribution

• Quadratic in µ. • Gamma distribution over !.


• Linear in !. • Independent of µ.

2/19/23
58 Bayesian Inference for the Gaussian (11)

The Gaussian-gamma distribution

2/19/23
59 Bayesian Inference for the Gaussian (12)
Multivariate conjugate priors
µ unknown, ! known: p(µ) Gaussian.
! unknown, µ known: p(!) Wishart,

! and µ unknown: p(µ,!) Gaussian-Wishart,

2/19/23
Student’s t-Distribution (1)
60
If we intergrate out precision

William Sealy Gosset


Pen name: "Student"

where
Like the precision The degrees of freedom

Infinite mixture of Gaussians (with same mean). 2/19/23


61 Student’s t-Distribution (2)

2/19/23
62 Student’s t-Distribution (3)
Robustness to outliers: Gaussian, Laplacian vs t-distribution.

2/19/23
63 Student’s t-Distribution (4)
The multi-variate case:

where .
Properties:

2/19/23
64 Periodic variables

Examples: calendar time, direction, …


We require

2/19/23
65 von Mises Distribution (1)
This requirement is satisfied by

where

is the 0th order modified Bessel function of


the 1st kind.
𝜃 𝑖𝑠 𝑡ℎ𝑒 mean, m is the concentration parameter like the precision for G.
2/19/23
66 von Mises Distribution (2)

𝜃 𝑖𝑠 𝑡ℎ𝑒 mean, m is the concentration parameter like the precision for G. 2/19/23
67 Maximum Likelihood for von Mises
Given a data set, , the log likelihood
function is given by

Maximizing with respect to "0 we directly obtain

Similarly, maximizing with respect to m we get

2/19/23

which can be solved numerically for mML.


Mixtures of Gaussians (1)
68
Old Faithful data set

The time from the


current eruption to
the next eruption

Single Gaussian Mixture of two Gaussians


The duration of the eruption in seconds

2/19/23

https://www.kaggle.com/janithwanni/old-faithful
69 Mixtures of Gaussians (2)

Combine simple models


into a complex model:

Component

Mixing coefficient
K=3

2/19/23
70 Mixtures of Gaussians (3)

2/19/23
71 Mixtures of Gaussians (4)

Determining parameters µ, Σ, and ! using maximum


log likelihood

Log of a sum; no closed-form maximum.

Solution: use standard, iterative, numeric optimization


methods or the expectation maximization (EM)
algorithm (Chapter 9).
2/19/23
72 The Exponential Family (1)

where η is the natural parameter and

so g(η) can be interpreted as a normalization


coefficient.

2/19/23
73 The Exponential Family

´ General from
𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥
𝜂 is the natural parameters
g(𝜂) coefficient ensures normalized

2/19/23
74 The Exponential Family (2.1)
𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥
The Bernoulli Distribution

Comparing it with the general form we see that


and so

2/19/23
Logistic sigmoid
75 The Exponential Family (2.2)
𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥

The Bernoulli distribution can hence be written as

where

2/19/23
76 The Exponential Family (3.1)
The Multinomial Distribution

where, , and
NOTE: The µk parameters
are not independent since
the corresponding µk must
satisfy

2/19/23
The Exponential Family (3.2)
77
Let . This leads to
& &
)
𝑝 𝐱 𝝁 = 4 𝜇# ' = exp & 𝑥# ln 𝜇# = ℎ 𝐱 𝑔 𝜼 exp 𝜼* 𝐮 𝐱
#$% #$%
x𝑀 𝜇𝑀
& &'% &'% &'%
exp & 𝑥# ln 𝜇# = exp & 𝑥# ln 𝜇# + 1 − & 𝑥# ln 1 − & 𝜇#
#$% #$% #$% #$%

&'% 𝜇# &'%
= exp & 𝑥# ln &'% + ln 1 − & 𝜇#
#$% 1 − ∑ 𝜇
($% ( #$%
𝜇#
𝜂# = ln &'% &'% 𝜇#
1 − ∑&'%
($% 𝜇( = 1−& 𝜇# exp & 𝑥# ln
#$% #$% 1 − ∑&'%
($% 𝜇(
exp 𝜂#
𝜇# =
1 + ∑&'%
($% exp 𝜂(
&'% 1
𝑔 𝜼 =1−& 𝜇# =
#$% 1 + ∑&'%
#$% exp 𝜂#
2/19/23
78 The Exponential Family (3.3)
& &
)
𝑝 𝐱 𝝁 = 4 𝜇# ' = exp & 𝑥# ln 𝜇# = 𝑔 𝜼 exp 𝜼* 𝒙
#$% #$%

and &'% 1
Softmax
𝑔 𝜼 =1−& 𝜇# =
#$% 1 + ∑&'%
#$% exp 𝜂#

Here the ηk parameters are independent. Note


that
and
2/19/23
79 The Exponential Family (3.4)

The Multinomial distribution can then be written as

where

2/19/23
80 The Exponential Family (4)
The Gaussian Distribution

where

2/19/23
Estimate 𝜂 using ML for Exponential
81
Family

2/19/23
82 ML for the Exponential Family (1)

3
Taking at both sides, we get
34

Thus
2/19/23
83 ML for the Exponential Family (2)

Give a data set, , the likelihood


function is given by

Thus we have

Sufficient statistic
2/19/23
84 Conjugate priors 𝑝 𝑥 𝜂 = ℎ 𝑥 𝑔 𝜂 exp 𝜂𝑇𝑢 𝑥

For any member of the exponential family, there


exists a prior
normalization same above

Combining with the likelihood function, we get

Prior corresponds to & # of pseudo-observations with value '.


2/19/23
85 Noninformative Priors (1)
Given 𝑝 𝑥 𝜆 , 𝑤𝑖𝑡ℎ 𝑝𝑟𝑖𝑜𝑟 𝑝(𝜆)

With little or no information available a-priori, we


might choose a non-informative prior.
´! discrete, K-nominal:
´!![a,b] real and bounded:
´! real and unbounded: improper! (prior not
normalized)
A constant prior may no longer be constant after
a change of variable; consider p(!) constant and
!= η2: 2/19/23
86 Noninformative Priors (2)
𝐼𝑓 𝑃 𝑥 𝜇 = 𝑓 𝑥 − 𝜇 , 𝜇 is known as location parameter.

Translation (T) invariant: consider

We want a prior over 𝜇 reflect T invariant; hence,


equal prob to interval A-B. For a corresponding
prior over µ, we have

for any A and B. Thus p(µ) = p(µ — c) and p(µ) must


be constant. 2/19/23
87 Noninformative Priors (3)

Example: The mean of a Gaussian, µ; the conjugate


prior is also a Gaussian,

As , this will become constant over µ.

Contribution from prior vanish. 2/19/23


88 Noninformative Priors (4)

Scale invariant: Consider and


make the change of variable

For a corresponding prior over $, we have

for any A and B. Thus p($) " 1/$ and so this prior is
improper too. Note that this corresponds to p(ln $)
being constant.
2/19/23
89 Noninformative Priors (5)
Example: For the variance of a Gaussian, $2, we have

For precision ! = 1/$2, we want prior p(!) " 1/!.


We know that the conjugate distribution for ! is the
Gamma distribution,

A noninformative prior is obtained when a0 = 0 and b0


= 0.
Contribution from prior vanish.

2/19/23
90 Nonparametric Methods (1)

´Parametric distribution models are restricted to


specific forms, which may not always be
suitable; for example, consider modelling a
multimodal distribution with a single, unimodal
model.
´Nonparametric approaches make few
assumptions about the overall shape of the
distribution being modelled.

2/19/23
91 Nonparametric Methods (2)

Two types of non-parametric methods


´Estimate density function 𝑝 𝐱 𝐶! from sample
patterns (instance/memory-based learning)
´Directly estimate the a posteriori probability 𝑃 𝐶! 𝐱
─ similar to the nearest-neighbor rule, which bypass
probability estimation and go directly to decision
functions.

2/19/23
92 Kernel Density Estimation (1)
Histogram methods
partition the data space
into distinct bins with
widths !i and count the
number of observations,
ni, in each bin.

Often, the same width is In a D-dimensional space,


used for all bins, !i = !.
using M bins in each
! acts as a smoothing dimension will require MD bins!
parameter. 2/19/23
93 Kernel Density Estimation (2)
Assume observations
drawn from a density If the volume of R, V, is
p(x) and consider a sufficiently small, p(x) is
small region R approximately constant
containing x such that over R and

Thus
The probability that K
out of N observations
lie inside R is Bin(K|N,P) V R is small so const. V
and if N is large Yet R is large so K ~= NP
(1) Fix K, determine V from data (K-NN)
(2) Fix V. determine K from data (Kernel)
2/19/23
94 Parzen Window(1)
Kernel Density Estimation: fix V, estimate K from the data. Let R
be a hypercube centred on x and define the kernel function
(Parzen window)

It follows that

# points at x

and hence hD the vol. of hypercube (fixed).


2/19/23
95 Parzen Window (2)

Fix V

2/19/23
96 Parzen Window (3)
To avoid discontinuities in 𝑝(𝑥), use
a smooth kernel, e.g. a Gaussian

Any kernel such that

ℎ acts as a smoother.
2/19/23

will work. Green true density. Blue estimated


97 Parzen Window (4)

n is N

2/19/23
98 Parzen Window (5)

n is N

2/19/23
99 Parzen Window (6)

n is N

2/19/23
100 Nearest Neighbour Estimation (1)
Nearest Neighbour Density Estimation: fix K, estimate
V from the data. Consider a hypersphere centred on
x and let it grow to a volume, V ", that includes K of
the given Ndata points. Then

2/19/23
101 K-Nearest Neighbour Estimation (2)

Nearest Neighbour Density


Estimation: fix K, estimate V
from the data. Consider a
hypersphere centred on x
and let it grow to a volume,
V ", that includes K of the
given Ndata points. Then

K acts as a smoother.
2/19/23
102 K-Nearest Neighbour Estimation (3)

2/19/23
103 K-Nearest Neighbour Estimation (4)

2/19/23
104 Nonparametric vs. Parametric Methods

´Nonparametric models (not histograms)


requires storing and computing with the entire
data set.
´Parametric models, once fitted, are much
more efficient in terms of storage and
computation.

2/19/23
105 K-NN for Classification (1)
Given a data set with Nk data points from class Ck
and , we have

and correspondingly

Since , Bayes’ theorem gives

2/19/23
107 K-NN Classifier

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1

2/19/23
108 1-NN Classifier

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1

2/19/23
109 3-NN Classifier

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1

2/19/23
110 5-NN Classifier

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1

2/19/23
111 1-NN Classifier (3)
Voronoi diagram (cells)

2/19/23
112 K-NN Classifier

• K acts as a smother
• For 𝑁 → ∞, the error rate of the 1-nearest-neighbour
classifier is never more than twice the optimal error
(obtained from the true conditional class distributions).
2/19/23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy