ml-1
ml-1
Machine Learning
Chapter 1: Introduction
孫民
清華大學
Credit : 林嘉文 (Chia-Wen Lin)
2/22/25
3 Polynomial Curve Fitting
2/22/25
4 Sum-of-Squares Error Function
2/22/25
5 0th Order Polynomial
2/22/25
6 1st Order Polynomial
2/22/25
7 3rd Order Polynomial
2/22/25
8 9th Order Polynomial
2/22/25
9 Over-fitting
2/22/25
11 Data Set Size: 𝑁 = 15
9th Order Polynomial
2/22/25
12 Data Set Size: 𝑁 = 100
9th Order Polynomial
2/22/25
13 Regularization
2/22/25
14 Regularization: ln 𝜆 = −18
2/22/25
15 Regularization: ln 𝜆 = 0
2/22/25
16 Regularization: 𝐸!"# vs. ln 𝜆
2/22/25
17 Polynomial Coefficients
2/22/25
18 Probability Theory
Apples and Oranges
𝐵 𝑜𝑥 𝑖𝑠 𝑏 𝑙𝑢𝑒 𝑜𝑟 𝑟 𝑒𝑑
2/22/25
19 Probability Theory – two random variables
´Marginal Probability
´Conditional Probability
Joint Probability
2/22/25
20 Probability Theory
´Sum Rule
Product Rule
2/22/25
21 The Rules of Probability
´ Sum Rule
´ Product Rule
2/22/25
22 Bayes’ Theorem
𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋 Product Rule
Since P(X,Y) = P(Y,X), and
𝑃 𝑌, 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌
Hence, 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌
Evidence
4
𝑝 𝐵=𝑟 = = 2/5 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑝𝑖𝑐𝑘𝑖𝑛𝑔 𝑎𝑛 𝑎𝑝𝑝𝑙𝑒?
10
p(B = b) = 6/10=3/5 p(F=a)?
p(F=a|B=r) = 2/8 = ¼ Use Sum Rule:
P(F=a|B=b) = 3/4 p(F=a) = p(F=a,B=r)+p(F=a,B=b)
Use Product Rule:
p(F=a,B=r) = p(F=a|B=r)p(B=r)
p(F=a,B=b) = p(F=a|B=b)p(B=b)
Hence,
p(F=a) = p(F=a|B=r)p(B=r)+ p(F=a|B=b)p(B=b)
= ¼*2/5+3/4*3/5 = 2/20+9/20=11/20
p(F=o) = 1- p(F=a) = 9/20 2/22/25
24 Probability Theory
Apples and Oranges
4
𝑝 𝐵=𝑟 = = 2/5 𝐼𝑓 𝑓𝑟𝑢𝑖𝑡 𝑖𝑠 𝑜𝑟𝑎𝑛𝑔𝑒, 𝑤ℎ𝑖𝑐ℎ 𝑏𝑜𝑥?
10
p(B = b) = 6/10=3/5 p(B|F=o)?
p(F=a|B=r) = 2/8 = ¼ Use Bayes’ Rule:
P(F=a|B=b) = 3/4 < 𝐹 = 𝑜 𝐵 <(>)
p(B|F=o) =
<(@AB)
!
∗D/F
p(B=r|F=o) = "
G/DH
= 6/9=2/3
p(B=b|F=o) = 1 - p(B=r|F=o)= 1/3
2/22/25
25 Probability Densities (continuous variable)
Probability density Cumulative distribution function
2/22/25
26 Transformed Densities
x = g(y)
dx/dy = d g(y)/ dy = g’(y)
2/22/25
27 Expectations
Approximate Expectation
(discrete and continuous)
Conditional Expectation
(discrete)
2/22/25
28 Variances and Covariances
2/22/25
29 The Gaussian Distribution
2/22/25
30 Gaussian Mean and Variance
2/22/25
31 The Multivariate Gaussian
Σ 𝑖𝑠 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥. 𝐷𝑖𝑎𝑔𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑟𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠 𝜎
2/22/25
32 Gaussian Parameter Estimation
Likelihood function
𝐱 = 𝑥! , 𝑥" , ⋯ , 𝑥# If x is i.i.d.
2/22/25
33 Two Principles for Estimating Parameters
(log-likelihood)
𝛉$% = argmax ln 𝑝 𝐱 𝛉
𝛉
(unbiased)
(biased)
2/22/25
(unbiased)
36 Curve Fitting Re-visited
2/22/25
37 Maximum Likelihood
1 #
"
𝐰$% = arg min 5 𝑦 𝑥( , 𝐰 − 𝑡(
𝐰 2 ()!
2/22/25
ML Curve Fitting
38
Predictive
Distribution
W for both mu
And beta. 𝑝 𝑡 𝑥, 𝐰, 𝛽 = 𝒩 𝑡 𝑦 𝑥, 𝐰 , 𝛽 ,!
2/22/25
ML Curve Fitting:
41 Bayesian Predictive Distribution
2/22/25
ML Curve Fitting Bayesian Curve Fitting
42 Cross Validation for Model Selection
2/22/25
43 Cross Validation
2/22/25
44 Curse of Dimensionality
2/22/25
Curse of Dimensionality
45
𝐷R # 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
2/22/25
46 Decision Theory
´ Decision step
´ For given x, determine optimal t or decision/action based
on t.
2/22/25
47 Minimum Misclassification Rate
Assuming t as C1 or C2 class
Change x_hat to
X0, blue/green fixed,
but red reduced.
2/22/25
Red/Green Blue
48 Minimum Expected Loss
´ Example: classify medical images as ‘cancer’
or ‘normal’
Decision
Truth
2/22/25
49 Minimum Expected Loss
2/22/25
50 Reject Option – avoid making decision
2/22/25
51 Generative vs Discriminative
´ Generative approach:
Model
Use Bayes’ theorem
´ Discriminative approach:
Model directly
2/22/25
52 Why Separate Inference and Decision?
2/22/25
53 Decision Theory for Regression
´ Inference step
´ Determine 𝑝 𝐱, 𝑡 .
´ Decision step
´ For given x, make optimal prediction, y(x), for t.
´ Loss function:
2/22/25
54 The Squared Loss Function
𝔼 𝐿 is minimized when
2/22/25
55
Information Theory
2/22/25
56 Entropy
ℎ 𝑥 = −𝑙𝑜𝑔2 𝑝(𝑥)
2/22/25
57 Entropy
Important quantity in
• coding theory
• statistical physics
• machine learning
2/22/25
58 Entropy
2/22/25
59 Entropy - coding theory
Code: 000, 001, 010, 011, 100, 101, 110, 111 2/22/25
60 Entropy
2/22/25
61 Entropy - statistical physics
In how many ways can N identical
objects be allocated M bins?
Note that ni balls in ith bin.
# ways to allocate (multiplicity)
H[x]
2/22/25
66 The Kullback-Leibler Divergence (Relative Entropy)
Unknow p(x) modeled by q(x). Additional info required
2/22/25
67 Mutual Information
ℎ 𝑥 = −𝑙𝑜𝑔2 𝑝(𝑥)
If x, y not independent
2/22/25