0% found this document useful (0 votes)
4 views64 pages

ml-1

The document provides an introduction to machine learning concepts, focusing on polynomial curve fitting, probability theory, and decision theory. It discusses various polynomial orders, regularization techniques, and the implications of overfitting, as well as key probability rules and Bayes' theorem. Additionally, it covers parameter estimation methods, cross-validation, and concepts in information theory such as entropy and mutual information.

Uploaded by

wj9hn5fc5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views64 pages

ml-1

The document provides an introduction to machine learning concepts, focusing on polynomial curve fitting, probability theory, and decision theory. It discusses various polynomial orders, regularization techniques, and the implications of overfitting, as well as key probability rules and Bayes' theorem. Additionally, it covers parameter estimation methods, cross-validation, and concepts in information theory such as entropy and mutual information.

Uploaded by

wj9hn5fc5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

1

Machine Learning
Chapter 1: Introduction

孫民
清華大學
Credit : 林嘉文 (Chia-Wen Lin)
2/22/25
3 Polynomial Curve Fitting

Data Set Size:


𝑁 = 10

2/22/25
4 Sum-of-Squares Error Function

2/22/25
5 0th Order Polynomial

2/22/25
6 1st Order Polynomial

2/22/25
7 3rd Order Polynomial

2/22/25
8 9th Order Polynomial

2/22/25
9 Over-fitting

Root-Mean-Square (RMS) Error:


2/22/25
10 Polynomial Coefficients

2/22/25
11 Data Set Size: 𝑁 = 15
9th Order Polynomial

2/22/25
12 Data Set Size: 𝑁 = 100
9th Order Polynomial

2/22/25
13 Regularization

´ Penalize large coefficient values

2/22/25
14 Regularization: ln 𝜆 = −18

2/22/25
15 Regularization: ln 𝜆 = 0

2/22/25
16 Regularization: 𝐸!"# vs. ln 𝜆

2/22/25
17 Polynomial Coefficients

2/22/25
18 Probability Theory
Apples and Oranges

𝐵 𝑜𝑥 𝑖𝑠 𝑏 𝑙𝑢𝑒 𝑜𝑟 𝑟 𝑒𝑑

(F)Ruit is (a)pple or (o)range

2/22/25
19 Probability Theory – two random variables

´Marginal Probability

´Conditional Probability

Joint Probability

2/22/25
20 Probability Theory

´Sum Rule

Product Rule

2/22/25
21 The Rules of Probability

´ Sum Rule

´ Product Rule

2/22/25
22 Bayes’ Theorem
𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋 Product Rule
Since P(X,Y) = P(Y,X), and
𝑃 𝑌, 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌
Hence, 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌

Posterior Likelihood Prior

Evidence

Posterior µ Likelihood × Prior 2/22/25


23 Probability Theory
Apples and Oranges

4
𝑝 𝐵=𝑟 = = 2/5 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑝𝑖𝑐𝑘𝑖𝑛𝑔 𝑎𝑛 𝑎𝑝𝑝𝑙𝑒?
10
p(B = b) = 6/10=3/5 p(F=a)?
p(F=a|B=r) = 2/8 = ¼ Use Sum Rule:
P(F=a|B=b) = 3/4 p(F=a) = p(F=a,B=r)+p(F=a,B=b)
Use Product Rule:
p(F=a,B=r) = p(F=a|B=r)p(B=r)
p(F=a,B=b) = p(F=a|B=b)p(B=b)
Hence,
p(F=a) = p(F=a|B=r)p(B=r)+ p(F=a|B=b)p(B=b)
= ¼*2/5+3/4*3/5 = 2/20+9/20=11/20
p(F=o) = 1- p(F=a) = 9/20 2/22/25
24 Probability Theory
Apples and Oranges

4
𝑝 𝐵=𝑟 = = 2/5 𝐼𝑓 𝑓𝑟𝑢𝑖𝑡 𝑖𝑠 𝑜𝑟𝑎𝑛𝑔𝑒, 𝑤ℎ𝑖𝑐ℎ 𝑏𝑜𝑥?
10
p(B = b) = 6/10=3/5 p(B|F=o)?
p(F=a|B=r) = 2/8 = ¼ Use Bayes’ Rule:
P(F=a|B=b) = 3/4 < 𝐹 = 𝑜 𝐵 <(>)
p(B|F=o) =
<(@AB)
!
∗D/F
p(B=r|F=o) = "
G/DH
= 6/9=2/3
p(B=b|F=o) = 1 - p(B=r|F=o)= 1/3

2/22/25
25 Probability Densities (continuous variable)
Probability density Cumulative distribution function

2/22/25
26 Transformed Densities

x = g(y)
dx/dy = d g(y)/ dy = g’(y)

2/22/25
27 Expectations

Approximate Expectation
(discrete and continuous)

Conditional Expectation
(discrete)
2/22/25
28 Variances and Covariances

𝑊ℎ𝑒𝑛 𝑥, 𝑦 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡


E(xy) = E(x)E(y). Hence,
Cov[x,y] = 0

2/22/25
29 The Gaussian Distribution

2/22/25
30 Gaussian Mean and Variance

2/22/25
31 The Multivariate Gaussian
Σ 𝑖𝑠 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥. 𝐷𝑖𝑎𝑔𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑟𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠 𝜎

2/22/25
32 Gaussian Parameter Estimation

Likelihood function

𝐱 = 𝑥! , 𝑥" , ⋯ , 𝑥# If x is i.i.d.

2/22/25
33 Two Principles for Estimating Parameters

´Maximum likelihood estimation (ML)


Choose 𝛉 that maximizes the probability (likelihood)
of observed data
.!" = argmax 𝑃(𝐷|𝛉)
𝛉
𝛉

´Maximum a posteriori estimation (MAP)


Choose 𝛉 that is most probable given prior
probability and data
𝑃 𝐷𝛉 𝑃 𝛉
.!$%
𝛉 = argmax 𝑃 𝛉 𝐷 = argmax
𝛉 𝛉 𝑃(𝐷)
2/22/25
34 Maximum (Log) Likelihood
𝐱 = 𝑥! , 𝑥" , ⋯ , 𝑥# , 𝐱 is i.i.d.
𝛉$% = argmax 𝑝 𝐱 𝛉 ?
𝛉

(log-likelihood)

𝛉$% = argmax ln 𝑝 𝐱 𝛉
𝛉

(sample mean) (sample variance)


2/22/25
%
35 Properties of 𝜇"$ and 𝜎"$

(unbiased)

(biased)

2/22/25

(unbiased)
36 Curve Fitting Re-visited

𝛽: inverse variance (precision)

2/22/25
37 Maximum Likelihood

Determine 𝐰$% by minimizing sum-of-squares error, 𝐸 𝐰 .

1 #
"
𝐰$% = arg min 5 𝑦 𝑥( , 𝐰 − 𝑡(
𝐰 2 ()!

2/22/25
ML Curve Fitting
38

Green: Actual model 2/22/25

Red: Predicted model


39 MAP: A Step towards Bayes

Posterior Likelihood Prior

Determine 𝐰$*+ by minimizing regularized sum-of-squares error, 𝐸9 𝐰 .

Eq. (1.4) 2/22/25


40 Bayesian Curve Fitting

Predictive
Distribution
W for both mu
And beta. 𝑝 𝑡 𝑥, 𝐰, 𝛽 = 𝒩 𝑡 𝑦 𝑥, 𝐰 , 𝛽 ,!

(Refer to Sec. 3.3 for detailed derivation)

2/22/25
ML Curve Fitting:
41 Bayesian Predictive Distribution

2/22/25
ML Curve Fitting Bayesian Curve Fitting
42 Cross Validation for Model Selection

´5-fold cross-validation -> split the training


data into 5 equal folds
´4 of them for training and 1 for validation

2/22/25
43 Cross Validation

2/22/25
44 Curse of Dimensionality

2/22/25
Curse of Dimensionality
45

Polynomial curve fitting, M = 3

𝐷R # 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠

𝑉𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝑆𝑝ℎ𝑒𝑟𝑒 𝑤𝑖𝑡ℎ 𝑟𝑎𝑑𝑖𝑢𝑠 𝑟


VD 𝑟 = 𝐾𝐷rD
[VD(1)-VD(1-∈)]/VD(1) = 1-(1- ∈)D

2/22/25
46 Decision Theory

Given (x,t), predict t give new x.


´ Inference step
´ Determine either or .

´ Decision step
´ For given x, determine optimal t or decision/action based
on t.

2/22/25
47 Minimum Misclassification Rate

Assuming t as C1 or C2 class

Change x_hat to
X0, blue/green fixed,
but red reduced.

2/22/25

Red/Green Blue
48 Minimum Expected Loss
´ Example: classify medical images as ‘cancer’
or ‘normal’
Decision

Truth

When a cancer patient is classified as normal -> 1000 loss

2/22/25
49 Minimum Expected Loss

True class k, but


Assign class j

Regions are chosen to minimize

Eliminate common factor p(x)

2/22/25
50 Reject Option – avoid making decision

2/22/25
51 Generative vs Discriminative

´ Generative approach:
Model
Use Bayes’ theorem

´ Discriminative approach:
Model directly

2/22/25
52 Why Separate Inference and Decision?

• Minimizing risk (loss matrix may change over time)


• Reject option
• Unbalanced class priors
• Combining models

2/22/25
53 Decision Theory for Regression

´ Inference step
´ Determine 𝑝 𝐱, 𝑡 .

´ Decision step
´ For given x, make optimal prediction, y(x), for t.

´ Loss function:

2/22/25
54 The Squared Loss Function

𝔼 𝐿 is minimized when

2/22/25
55
Information Theory

2/22/25
56 Entropy

h(x) is a monotonic function of p(x),


and expresses the information content (>=0).

ℎ 𝑥 = −𝑙𝑜𝑔2 𝑝(𝑥)

If x,y independent, p(x,y) = p(x) p(y),


h(x,y) = -log2p(x) -log2p(y) = h(x)+h(y)

H(x) is the expectation of h(x)

2/22/25
57 Entropy

Important quantity in
• coding theory
• statistical physics
• machine learning

2/22/25
58 Entropy

2/22/25
59 Entropy - coding theory

´ Coding theory: x discrete with 8 possible


states; how many bits to transmit the
state of x?
´ All states equally likely

Code: 000, 001, 010, 011, 100, 101, 110, 111 2/22/25
60 Entropy

2/22/25
61 Entropy - statistical physics
In how many ways can N identical
objects be allocated M bins?
Note that ni balls in ith bin.
# ways to allocate (multiplicity)

pi is the prob that ball assigned to ith bin.


Entropy maximized when
2/22/25
64 Differential Entropy – continuous x
Put bins of width ¢ along the real line

Differential entropy maximized (for fixed


& 𝜇) when

in which case (only related to 𝜎)


2/22/25
65 Conditional Entropy

ℎ 𝑦|𝑥 = −𝑙𝑜𝑔2 𝑝(𝑦|𝑥)

H[x]

2/22/25
66 The Kullback-Leibler Divergence (Relative Entropy)
Unknow p(x) modeled by q(x). Additional info required

2/22/25
67 Mutual Information
ℎ 𝑥 = −𝑙𝑜𝑔2 𝑝(𝑥)

If x,y independent, p(x,y) = p(x) p(y),


h(x,y) = -log2p(x) -log2p(y) = h(x)+h(y)

If x, y not independent

2/22/25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy