Unit-i Machine Learning Basics
Unit-i Machine Learning Basics
Basics
Machine Learning
o Speech synthesis
Imputation of missing values: predict
the values of the missing entries
The task,
Denoising: 𝑓:
T 𝑥3 ∈ ℝ → 𝑥 ∈ ℝ 𝑛 𝑛
discrete or continuous)
o Example: 𝑝 𝑚𝑜𝑑𝑒𝑙 (𝑥 𝑖 |𝒙 –𝒊 )
The performance
measure,
•Accuracy: P
The proportion of examples for which the model
produces the correct output
• Error rate (0-1 loss):
The proportion of examples for which the model
produces incorrect output
• Average log-probability of some examples (for
density estimation)
• It is difficult sometimes, to decide what should
be measured
• It is imparactical sometimes to measure an
implicit performance metric
• Evaluate the performance measure using a
test set
The experience,
• Elearning: 𝑝(𝑦|𝒙)
Supervised
Experience is a labeled dataset (or datapoints)
Each datapoint has a label or target
𝑝 𝒙 = ∏𝑛 𝑝(𝑥𝑖|𝑥1, … , 𝑥 𝑖 – 1 )
is often blurred
𝑝 𝑦 𝒙 = 𝑖=1 𝑝
𝒙,𝑦
𝑝
𝑦
∑ " 𝒙,𝑦"
The experience,
E
• Semi-Supervised learning:
Some examples include supervision
targets, but others do not.
• Multi-instance learning:
A collection of examples is labeled as
containing an example of a class
• Reinforcement learning:
Interaction with an environment
A
• dataset
Design matrix X:
Each row is an example
Each column is a feature
Example: iris dataset: 𝑋 ∈ ℝ150×4
• Vector of labels y
Example: 0 is a person, 1 is a car, 2 is a
cat, etc.
The label can be a set of words
(e.g. transcription)
Example: Linear
Task: regression problem ℝ
regression 𝑛
→ℝ
•
𝑦# = 𝒘𝑻 𝒙
1 𝒕𝒆𝒔𝒕, 𝒚
Have a test set: 𝑿
• Performance measure:
𝑀𝑆𝐸𝑡𝑒𝑠𝑡 0 𝒚1 𝒕𝒆𝒕𝒆𝒔𝒕 𝒕𝒆 2
𝒔𝒕 𝒔𝒕 𝑖
= −𝒚
�
𝒚1 𝑖 −
� 𝒕𝒆𝒔𝒕
= � 𝒚𝒕𝒆𝒔𝒕
1� 2
2
Experience
𝑿𝒕𝒓𝒂𝒊𝒏, 𝒚𝒕𝒓𝒂𝒊𝒏
•
𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
Minimize
Linear regression
𝛻𝒘 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
= 10
•
(𝑿 𝑿 )𝒘 = 𝑿 𝒚 𝒕𝒓𝒂𝒊𝒏 (normal
−𝒚
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻
𝑛 𝒕𝒓𝒂𝒊𝒏
�
equations)
•
Figure
5.1
Capacity, overfitting
and underfitting
• Generalization: ability to perform well on
previously unobserved data
• i.i.d assumptions:
The examples in each dataset are
independent from each other,
and that the train set and test set are
identically distributed, drawn from the
same probability distribution as each
other.
We call that shared underlying
𝑦O = 𝑏 + R 𝑤𝑖 𝑥𝑖
𝑖=1
Underfitting and
Overfitting in
Polynomial Estimation
Figure
High capacity: non-
parametric models
• Nearest neighbor
2
𝑦O = 𝑦𝑖
regression
2
𝑖 = argmin 𝑿𝒊,: −
where
𝒙
Bayes error
.
• This error is due to:
There may still be noise in the distribution
𝑦 might be inherently stochastic
𝑦 may be deterministic but
involves other variables besides 𝒙
Training Set
Size moderate amount
of noise added to
a degree-5
polynomial
The no Free lunch
• theorem
Averaged over all possible data generating
distributions, every classification algorithm has
the same error rate when classifying previously
unobserved points.
𝐽 𝑤 = 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 + 𝜆𝒘𝑻𝒘
decay:
of 𝜃 is: 𝑚
A common estimator
1
𝜃𝑚 = � /
- 𝑖=1
𝑥
(𝑖)
�
Example: Gaussian
distribution - mean
Consider a set of samples {𝑥 1 , 𝑥 2 ,
• Unbiased
2
−
1
𝜎# 2
𝑚− ∑𝑖= 𝑥
estimator:
𝑚 𝑚
= 1 𝑚1 𝑖 𝜇
•
Variance
• how much we expect an estimator to
vary as a function of the data sample?
•
𝑣𝑎𝑟 𝜃g =?
•
Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to
exhibit low bias we would also like it to
have relatively low variance.
Example: Bernoulli
distribution
Figure (Goodfellow
2016)
Bias vs. variance
Bias vs. variance
Maximum Likelihood
Estimation
• Rather than guessing that some function
might make a good estimator and then
analyzing its bias and variance, we would
like to have some principle from which we
can derive specific functions that are good
estimators for different models.
Same
Maximum Likelihood
Estimation
• Divide by
• m:
MLE minimizes the
Empirica
dissimilarity between the l
of 𝜃
Any loss consisting of a negative log- Independent
likelihood is a cross-entropy between
the empirical distribution defined by the
training set and the probability
distribution defined by model.
Conditional log-
• likelihood
Supervised
learning
From
• Same as the
minimizing: dataset
Output of
linear
regression as
the mean
Properties of
maximum
number of examples 𝑚 → ∞, in terms of its
•
likelihood
The best estimator asymptotically, as the
𝑤
�
𝒘𝒙 + 𝑏 ≥ 0 ⇒
+
o
𝒘𝒙 + 𝑏 ≤
0⇒−
o
• 𝒘𝒙+ + 𝑏 ≥ 1
o 𝒘𝒙 – + 𝑏 ≤ −1
SVM
• y𝑖 = +1 for + samples
• y𝑖 = −1 for – samples
• 𝑦𝑖(𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and
x-
• 𝑦𝑖𝒘𝒙𝒊 + 𝑏 −1≥0
• w𝑖𝑑𝑡ℎ = 𝒙𝑦+𝑖 −𝒘𝒙 𝒊 +
𝒙 –b𝒘 𝑏 –(1+b) − 1 = 0
𝐰 1–
− � = �
= 𝒙
Constraints:
for 𝒊 in 𝒘 gutter2
the � �
• Maximize the width of the
ma ⇒ max ⇒
street: 2 𝒘
1
𝒘 22
x 𝒘 min
Lagrange multipliers
2 − ∑𝛼 𝑦𝑖 𝒘𝒙𝒊 + 𝑏 −1
𝐿 =2 1
𝑖
= ∑𝛼 𝑖
• Replacing
•
𝑦𝑖
in 𝐿 ∑𝛼
⇒ : 𝑖𝑦𝑖 = 0
𝛛𝑏
1
𝐿= 2
∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋 − ∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋
− ∑𝛼𝑖 𝑦𝑖 𝑏 +
1
∑𝛼𝑖
𝐿 = ∑𝛼𝑖 − 2 ∑𝛼𝑖 𝛼j 𝑦𝑖 𝑦j 𝒙𝒊 𝒙𝒋
• Numerical analysts will solve this problem
Kernel trick
𝛼𝑖 ≠ 0 only for points in the gutter, also
known as support vectors
•
𝑦 = 𝑥1
output
• lower
dimensionality
• no linear correlation
No linear
correlation
Let us consider the 𝑚×𝑛-dimensional design
matrix 𝑿.
•
D 𝒙 = 0.
• We will assume that the data has a mean of zero,
where 𝑉𝑎𝑟
• is
𝒛
diagonal.
Decorrelati
PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟𝒛
𝑻
on
• is
diagonal.
Remember 𝑻
that 𝑿
𝑉𝑎𝑟 𝒛 𝟏 𝒁 𝒁 = 𝑾 𝑿 𝑿𝑾 =
1 𝑻𝑿 = 𝑾𝚲𝑾𝑻
𝑻 𝑻
1
𝑚– 𝒎– 𝑚–
= 1 𝑾𝑻𝑾𝜦𝑾 𝑻𝑾 =
•
= =𝑰
•
1
• PCA disentangles the unknown factors of
variation underlying the data.
this disentangling takes the form of finding a
rotation of the input space that aligns the
principal axes of variance with the basis of the
new representation space associated with z.
k-means clustering
• Divides the training set into k different
clusters of examples
vector ℎ
• Provides a k-dimensional one-hot code
representing an input 𝑥.
If 𝑥 belongs to cluster 𝑖, then ℎ = 1 and
𝑖
all other entries of the representation ℎ are
zero.
This is an example of a sparse
representation
k-means algorithm
initialize k different centroids {𝜇(1), . . . ,
𝜇(𝑘)} to different values,
•
Uniforml
y
Sample
d
Images
Arguments supporting
the manifold
•
hypothesis
Examples we encounter are connected to
each other by applying transformations to
traverse the manifold.
QMUL
Conclusion
• The manifold assumption is at least
approximately correct.
Congratulations