0% found this document useful (0 votes)
3 views85 pages

Unit-i Machine Learning Basics

The document provides an overview of machine learning concepts, including definitions of tasks, performance measures, and types of learning such as supervised, unsupervised, and reinforcement learning. It discusses the importance of generalization, capacity, overfitting, and underfitting, along with techniques like regularization and cross-validation. Additionally, it touches on the no free lunch theorem and the role of hyperparameters in model training and evaluation.

Uploaded by

22h51a6710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views85 pages

Unit-i Machine Learning Basics

The document provides an overview of machine learning concepts, including definitions of tasks, performance measures, and types of learning such as supervised, unsupervised, and reinforcement learning. It discusses the importance of generalization, capacity, overfitting, and underfitting, along with techniques like regularization and cross-validation. Additionally, it touches on the no free lunch theorem and the role of hyperparameters in model training and evaluation.

Uploaded by

22h51a6710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

Machine Learning

Basics
Machine Learning

experience 𝐸 with respect to some class of


• A computer program is said to learn from

tasks 𝑇 and performance measure 𝑃, if its


performance at tasks in 𝑇, as measured
by 𝑃, improves with experience 𝐸 (Tom M.
Michell, 1997)
• Example
 𝐸: the experience of playing thousands of
games
 𝑇: playing checkers game
 P: the fraction of games it wins against
human opponents
The task, T
• Tasks are usually described in terms of
how the machine learning should process

𝑥 ∈ ℝ𝑛 where each entry 𝑥𝑖


an example:
is a
feature
 Classification: Learn 𝑓: ℝ𝑛 → 1, . . , k
o y = 𝑓(𝑥) : assigns the input to the
category with numerical code 𝑦
o Example: object recognition
 Classification with missing inputs: learn a
set of functions
o Example: medical diagnosis
The task,
 Regression: Learn 𝑓: ℝ → ℝ
T 𝑛

o Example: Predict claim amount for an


insured person
 Transcription: unstructured
representation to discrete textual form
o Example: optical character recognition,
speech recognition
 Machine Translation: Sequence to sequence
o Example: translate English to French
The task, T
 Structured output: output is a vector or
data structure of values that are tightly
interrelated
o Example: parsing natural language
sentence into a tree that describes
grammatical structure by tagging nodes
of the tree as being verbs, nouns, etc.
o image segmentation: assigning a
pixel to a segment
o Image captioning
The task, T
 Anomaly detection: flag unusual or
atypical events
o Credit card fraud detection
 Synthesis and sampling: generate
examples that are similar to those in the
training data
o Genreate textures for video games

o Speech synthesis
 Imputation of missing values: predict
the values of the missing entries
The task,
 Denoising: 𝑓:
T 𝑥3 ∈ ℝ → 𝑥 ∈ ℝ 𝑛 𝑛

o Predict clean example from corrupted


example
o 𝑝 𝑥 𝑥5 =?
 Density estimation or probability

o 𝑝𝑚𝑜𝑑𝑒𝑙 (𝒙): ℝ𝑛 → ℝ (x can be


mass estimation

discrete or continuous)
o Example: 𝑝 𝑚𝑜𝑑𝑒𝑙 (𝑥 𝑖 |𝒙 –𝒊 )
The performance
measure,
•Accuracy: P
 The proportion of examples for which the model
produces the correct output
• Error rate (0-1 loss):
 The proportion of examples for which the model
produces incorrect output
• Average log-probability of some examples (for
density estimation)
• It is difficult sometimes, to decide what should
be measured
• It is imparactical sometimes to measure an
implicit performance metric
• Evaluate the performance measure using a
test set
The experience,
• Elearning: 𝑝(𝑦|𝒙)
Supervised
 Experience is a labeled dataset (or datapoints)
 Each datapoint has a label or target

• Unsupervised learning: 𝑝(𝒙)


 Experience is an unlabeled dataset

 Clustering, learning probability distribution,


denoising, etc.
• The line between supervised and unsupervised

𝑝 𝒙 = ∏𝑛 𝑝(𝑥𝑖|𝑥1, … , 𝑥 𝑖 – 1 )
is often blurred

 𝑝 𝑦 𝒙 = 𝑖=1 𝑝

𝒙,𝑦
𝑝
𝑦
∑ " 𝒙,𝑦"
The experience,
E
• Semi-Supervised learning:
 Some examples include supervision
targets, but others do not.
• Multi-instance learning:
A collection of examples is labeled as
containing an example of a class
• Reinforcement learning:
 Interaction with an environment
A
• dataset
Design matrix X:
 Each row is an example
 Each column is a feature
 Example: iris dataset: 𝑋 ∈ ℝ150×4
• Vector of labels y
 Example: 0 is a person, 1 is a car, 2 is a
cat, etc.
 The label can be a set of words
(e.g. transcription)
Example: Linear
Task: regression problem ℝ
regression 𝑛

→ℝ

 𝑦# = 𝒘𝑻 𝒙

1 𝒕𝒆𝒔𝒕, 𝒚
 Have a test set: 𝑿
• Performance measure:
𝑀𝑆𝐸𝑡𝑒𝑠𝑡 0 𝒚1 𝒕𝒆𝒕𝒆𝒔𝒕 𝒕𝒆 2
𝒔𝒕 𝒔𝒕 𝑖
= −𝒚

𝒚1 𝑖 −
� 𝒕𝒆𝒔𝒕
= � 𝒚𝒕𝒆𝒔𝒕
1� 2
2
Experience
𝑿𝒕𝒓𝒂𝒊𝒏, 𝒚𝒕𝒓𝒂𝒊𝒏

𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
 Minimize
Linear regression
𝛻𝒘 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
= 10

• � 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 𝑇 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 =


0

𝑚 𝑡r𝑎𝑖 �

(𝑿 𝑿 )𝒘 = 𝑿 𝒚 𝒕𝒓𝒂𝒊𝒏 (normal
−𝒚
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻
𝑛 𝒕𝒓𝒂𝒊𝒏

equations)

• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a


simple trick
• b is called the bias term (to not confound
with statistical bias that will be discussed
later)
Linear Regression

Figure
5.1
Capacity, overfitting
and underfitting
• Generalization: ability to perform well on
previously unobserved data
• i.i.d assumptions:
 The examples in each dataset are
independent from each other,
 and that the train set and test set are
identically distributed, drawn from the
same probability distribution as each
other.
 We call that shared underlying

distribution, denoted 𝑝𝑑𝑎𝑡𝑎


distribution the data generating
Capacity, overfitting
and underfitting
• For some fixed value w, the expected
training set error is exactly the same as the
expected test set error, because both
expectations are formed using the same
dataset sampling process.
• In practice:
 expected test set error > expected train
set error
• We need ability to:
 1. Make the training error small.
(underfitting)
 2. Make the gap between training and
Capacity, overfitting
and underfitting
• Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on the training set.
• Overfitting occurs when the gap
between the training error and test
error is too large.
• Informally, a model’s capacity is its ability
to fit a wide variety of functions.
 Low capacity may cause underfitting
 High capacity may cause overfitting
Hypothesis space
• One way to control the capacity of a
learning algorithm is by choosing its
hypothesis space,
 the set of functions that the learning
algorithm is allowed to select as being the
solution.
 For example, linear regression has the set
of all linear functions of its input
 To increase capacity: (model is
polynomial of degree 9)
9

𝑦O = 𝑏 + R 𝑤𝑖 𝑥𝑖
𝑖=1
Underfitting and
Overfitting in
Polynomial Estimation

Degree Degree Degree 9:


infinitely many
1 2
Capacity
• Representational capacity: finding the best
function within a family of functions
• Effective capacity may be less than the
representational capacity because it is hard to
find the best function
• Occam’s razor: Among competing
hypotheses that explain known observations
equally well, one should choose the
“simplest” one.
• Statistical learning theory shows that the
discrepancy between training error and
generalization error is bounded from above by a
quantity that grows as the model capacity grows
but shrinks as the number of training examples
increases
Generalization and
Capacity

Figure
High capacity: non-
parametric models
• Nearest neighbor
2
𝑦O = 𝑦𝑖
regression
2
𝑖 = argmin 𝑿𝒊,: −
where

𝒙
Bayes error

predictions from the true distribution 𝑝 𝒙, 𝑦


• The error incurred by an oracle making

.
• This error is due to:
 There may still be noise in the distribution
 𝑦 might be inherently stochastic
𝑦 may be deterministic but
involves other variables besides 𝒙

Training Set
Size moderate amount
of noise added to
a degree-5
polynomial
The no Free lunch
• theorem
Averaged over all possible data generating
distributions, every classification algorithm has
the same error rate when classifying previously
unobserved points.

• In some sense, no machine learning


algorithm is universally any better than
any other.

• The most sophisticated algorithm we can


conceive of has the same average performance
(over all possible tasks) as merely predicting
that every point belongs to the same class.
No free lunch
• The goal of machine learning research is not
to seek a universal learning algorithm or the
absolute best learning algorithm.
• Instead, our goal is to understand what
kinds of distributions are relevant to the
“real world”
 and what kinds of machine learning
algorithms perform well on data drawn
distributions we care about.
Regularization
• We can give a learning algorithm a
preference for one solution in its
hypothesis space to another.
 both functions are eligible, but one is
preferred.
• Example: linear regression with weight

𝐽 𝑤 = 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 + 𝜆𝒘𝑻𝒘
decay:

• We are expressing a preference for the


weights to have smaller L2 norm
 Larger 𝜆 forces the weights to become
Weight Decay

The true function is quadratic,


but here we use only models with
degree 9
Regularization
• Regularization is any modification we
make to a learning algorithm that is
intended to reduce its generalization
error but not its training error.
• Regularization is one of the central concerns
of the field of machine learning
Hyperparameters
• The values of hyperparameters are not
adapted by the learning algorithm itself
• Polynomial regression hyperparameters:
 Degree of the polynomial
 𝜆 (control of the weight decay)
• To set the hyperparameters, we need a
validation set
 Typically, one uses about 80% of the
training data for training and 20% for
validation.
Cross validation
• It is used to estimate generalization error of a
learning algorithm A when the given dataset D
is too small for a simple train/test or train/valid
split to yield accurate estimation of
generalization error

• In k-fold cross-validation a partition of the


dataset is formed by splitting it into k non-
overlapping subsets.
 The test error may then be estimated by
taking the average test error across k trials.
On trial i, the i-th subset of the data is used as
the test set and the rest of the data is used as
Cross validation
Point estimation
• Provides the single “best” prediction of some
quantity of interest
• can be a single parameter or a vector of
parameters in some parametric model
• A point estimator or statistic is any function of the
data (assuming i.i.d. samples):
𝜽3 = 𝑔(𝒙 𝟏 ,𝒙 𝟐 ,…,𝒙 𝒎 )
• Frequentist perspective: we assume that
the true parameter value θ is fixed but
unknown
• Function estimation is really just the same as
estimating a parameter θ; the function estimator
is simply a point estimator in function space.
Statistical
the data 𝒙 𝟏 , 𝒙
Expectation over
bias 𝟐 , … , 𝒙 𝒎

𝑏𝑖𝑎𝑠 𝜽Z > = D 𝜽Z > −𝜽



An estimator 𝜽Z > is said to be unbiased
if
𝑏𝑖𝑎𝑠 𝜽Z > = 0
𝑚→

which @
implies that D 𝜽Z > = 𝜽.

An estimator 𝜽Z >
lim
is said to be
asymptotically unbiased if
Example: Bernoulli
distribution
• Consider a set of samples {𝑥 1 ,𝑥 2 ,…,𝑥 𝑚 } that are i.i.d according to a Bernoulli
𝑃 𝑥 𝑖 ; = 𝜃𝑥 𝑖 1 − 𝜃
distribution with mean θ
𝜃

1−𝑥 𝑖

of 𝜃 is: 𝑚
 A common estimator

1
𝜃𝑚 = � /
- 𝑖=1
𝑥
(𝑖)

Example: Gaussian
distribution - mean
Consider a set of samples {𝑥 1 , 𝑥 2 ,

…,𝑥 𝑚 } that are i.i.d


according to a a Gaussian distribution:


𝑝 𝑥 1𝑖 = 𝒩 𝑥 𝑖 , 𝜇, 𝜎2
𝜇̂𝑚 = 𝑥 (𝑖

𝑚 𝑖=

1 )
∑𝑚

The sample mean is an


unbiased estimator of
Gaussian mean parameter.
Example: Gaussian
distribution - variance
2
Biased estimator:𝑚𝜎#𝑚2 =
𝑖= 𝑥 − 𝑚
𝜇

1 𝑖

1 𝑚
#

The sample variance is a


biased estimator.

• Unbiased
2

1
𝜎# 2
𝑚− ∑𝑖= 𝑥
estimator:
𝑚 𝑚
= 1 𝑚1 𝑖 𝜇

Variance
• how much we expect an estimator to
vary as a function of the data sample?

𝑣𝑎𝑟 𝜃g =?

Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to
exhibit low bias we would also like it to
have relatively low variance.
Example: Bernoulli
distribution

• The variance of the estimator decreases as a


function of m, the number of examples in the
dataset.
Trading off Bias and Variance

Figure (Goodfellow
2016)
Bias vs. variance
Bias vs. variance
Maximum Likelihood
Estimation
• Rather than guessing that some function
might make a good estimator and then
analyzing its bias and variance, we would
like to have some principle from which we
can derive specific functions that are good
estimators for different models.

Same
Maximum Likelihood
Estimation
• Divide by
• m:
MLE minimizes the
Empirica
dissimilarity between the l

𝑝O𝑑𝑎𝑡𝑎 and the model


empirical distribution distributi
on
distribution

of 𝜃
Any loss consisting of a negative log- Independent
likelihood is a cross-entropy between
the empirical distribution defined by the
training set and the probability
distribution defined by model.
Conditional log-
• likelihood
Supervised
learning

• Instead of producing a single prediction 𝑦O,

distribution 𝑝(𝑦 | 𝒙).


we now think of the model as producing a
conditional Output of
linear
• For linear regression, regression as fixe
take: the mean d
Conditional log-likelihood
for linear regression
• To
maximize:

From
• Same as the
minimizing: dataset
Output of
linear
regression as
the mean
Properties of
maximum
number of examples 𝑚 → ∞, in terms of its

likelihood
The best estimator asymptotically, as the

rate of convergence as 𝑚 increases.



Consistency: plim 𝜃g𝑚 = 𝜃
𝑚→@
 Under appropriate conditions
o The true distribution must lie within the
model family
o The true distribution must correspond to
exactly one value of θ
Bayesian statistics
• the true parameter 𝜃 is unknown or
uncertain and thus is represented as a
random variable.

• It is different than the single point


estimate (frequentist statistics)
Maximum A Posteriori
(MAP) Estimation
• Add the
prior:

• MAP Bayesian inference has the advantage


of leveraging information that is brought
by the prior and cannot be found in the
 If this prior is given by 𝒩𝒘, 𝟎, 1 𝑰2 , then
training data.


𝜆𝑤
prior term
the log- is proportional to� the familiar

𝑤

weight decay penalty


Supervised
learning

examples
Logistic regression
• Support vector
machines
• Decision trees
Logistic regression

• There is no closed-form solution for the optimal


weights.
• Instead, we must search for them by maximizing
the log- likelihood.
• We can do this by minimizing the negative log-
likelihood (NLL) using gradient descent.
Support vector
• machines
Introduced by Vapnik in the
90s
• Widest street approach
• Decision rule
 Which side of the street?

𝒘𝒙 + 𝑏 ≥ 0 ⇒
+
o

𝒘𝒙 + 𝑏 ≤
0⇒−
o

 w is perpendicular to the median line of the street, but


which w?
o We add the following scaling constraints:

• 𝒘𝒙+ + 𝑏 ≥ 1
o 𝒘𝒙 – + 𝑏 ≤ −1
SVM
• y𝑖 = +1 for + samples
• y𝑖 = −1 for – samples
• 𝑦𝑖(𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and
x-
• 𝑦𝑖𝒘𝒙𝒊 + 𝑏 −1≥0
• w𝑖𝑑𝑡ℎ = 𝒙𝑦+𝑖 −𝒘𝒙 𝒊 +
𝒙 –b𝒘 𝑏 –(1+b) − 1 = 0
𝐰 1–
− � = �
= 𝒙
Constraints:
for 𝒊 in 𝒘 gutter2
the � �
• Maximize the width of the
 ma ⇒ max ⇒
street: 2 𝒘
1
𝒘 22
x 𝒘 min
Lagrange multipliers
2 − ∑𝛼 𝑦𝑖 𝒘𝒙𝒊 + 𝑏 −1
𝐿 =2 1
𝑖

𝛻𝒘𝒘 𝛼𝑖− ∑𝛼≥ 𝑖 𝑦𝑖0



• 𝐿=𝒘 𝒙𝒊
𝒘 − ∑𝛼𝑖𝑦𝑖𝒙𝒊 =𝟎⇒𝒘=
∑𝛼𝑖𝑦𝑖𝒙𝒊

𝛛
𝐿

= ∑𝛼 𝑖
• Replacing

𝑦𝑖
in 𝐿 ∑𝛼
⇒ : 𝑖𝑦𝑖 = 0
𝛛𝑏
1
 𝐿= 2
∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋 − ∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋
− ∑𝛼𝑖 𝑦𝑖 𝑏 +
1
∑𝛼𝑖
 𝐿 = ∑𝛼𝑖 − 2 ∑𝛼𝑖 𝛼j 𝑦𝑖 𝑦j 𝒙𝒊 𝒙𝒋
• Numerical analysts will solve this problem
Kernel trick
𝛼𝑖 ≠ 0 only for points in the gutter, also
known as support vectors

Decision for 𝒙𝒋: 𝒘𝒙𝒋 + 𝑏 = ∑𝛼𝑖 𝑦𝑖 𝒙𝒊𝒙𝒋


+𝑏

We can change the representation of 𝒙𝒊


to be 𝜙 𝒙𝒊

• We do not need direct access to 𝜙 𝒙𝒊


Instead we define 𝑘 𝒙 𝒊 , 𝒙𝒋 = 𝜙 𝒙𝒊
𝜙 𝒙𝒋

 The function k is called kernel


Advantages of the
kernel trick
• The kernel trick is powerful for two reasons.

nonlinear as a function of 𝒙 using convex


 it allows us to learn models that are

optimization techniques that are


guaranteed to converge efficiently
 the kernel function often admits an
implementation that is significantly more

constructing two 𝜙 𝒙 vectors and


computational efficient than naively

explicitly taking their dot product.


o In some cases, 𝜙 𝒙 can even
be infinite dimensional
Gaussian
• kernel
Gaussian kernel
 Aka RBF: Radial Basis Function

• We can think of the Gaussian kernel as performing a


kind of template matching.
 A training example x associated with training
label y becomes a template for class y.
 When a test point x’ is near x according to
Euclidean distance, the Gaussian kernel has a
large response, indicating that x’ is very
similar to the x template.
 The model then puts a large weight on the
associated training label y.
• Overall, the prediction will combine many such
training labels weighted by the similarity of the
corresponding training examples.
SVM

limitations
high computational cost of training when the
dataset is large.

• The current deep learning renaissance began


when Hinton et al. (2006) demonstrated that
a neural network could outperform the RBF
kernel SVM on the MNIST benchmark.
Watch (SVM)
• https:
//www.youtube.com/watch?v=_PwhiWxHK8o
K-Nearest Neighbors
• There is not even really a training stage or
learning process.
 We find the k-nearest neighbors to x in the
training data X. We then return the average
of the corresponding y values in the
training set
• In the case of classification, we can
average over one-hot code vectors
 We can then interpret the average over
these one-hot codes as giving a
probability distribution over classes
K-NN limitations
• It cannot learn that one feature is
more discriminative than another.

regression task with 𝑥 ∈ ℝ100 drawn from


 For example, imagine we have a

an isotropic Gaussian distribution,


 but only a single variable is relevant to the

𝑦 = 𝑥1
output

 Nearest neighbor regression will not be


able to detect this simple pattern.
Decision
Trees How a decision
divide ℝ2.
tree might

Each leaf requires at least one


training example to define,
so it is not possible for the
decision tree to
learn a function that has more local
maxima than the number of training
examples.
Decision trees
limitations
• It struggles to solve some problems that
are easy even for logistic regression.

positive class occurs wherever 𝑥2 > 𝑥1


• If we have a two-class problem and the

 the decision boundary is not axis-aligned.


The decision tree will thus need to
approximate the decision boundary with
many splits, implementing a step function
that constantly walks back and forth across
the true decision function.
Unsupervided
• learning
Density estimation
• Learning to draw samples from a distribution
• Learning to denoise data from some
distribution
• Clustering the data into groups of related
examples
• Simpler representation:
 Lower-dimensional representation
 Sparse representation
 Independent representation
Principal Components
Analysis

• lower
dimensionality
• no linear correlation
No linear
correlation
Let us consider the 𝑚×𝑛-dimensional design
matrix 𝑿.

D 𝒙 = 0.
• We will assume that the data has a mean of zero,

 If this is not the case, the data can easily be


centered by subtracting the mean from all
examples in a preprocessing step.
• The unbiased sample covariance matrix
 𝑉𝑎𝑟 𝒙 1 𝑿𝑻
associated with
𝑚−
= 𝑿 𝑻
PCA finds 𝒛 = 𝑾 𝒙
X is given by:
1

where 𝑉𝑎𝑟
• is

𝒛
diagonal.
Decorrelati
PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟𝒛
𝑻
on
• is
diagonal.
 Remember 𝑻
that 𝑿
𝑉𝑎𝑟 𝒛 𝟏 𝒁 𝒁 = 𝑾 𝑿 𝑿𝑾 =
1 𝑻𝑿 = 𝑾𝚲𝑾𝑻
𝑻 𝑻
1
𝑚– 𝒎– 𝑚–
= 1 𝑾𝑻𝑾𝜦𝑾 𝑻𝑾 =

𝑉𝑎𝑟 𝒛 𝑚1– 𝚲 since 𝑾𝑻𝑾


𝟏 1

= =𝑰

1
• PCA disentangles the unknown factors of
variation underlying the data.
 this disentangling takes the form of finding a
rotation of the input space that aligns the
principal axes of variance with the basis of the
new representation space associated with z.
k-means clustering
• Divides the training set into k different
clusters of examples

vector ℎ
• Provides a k-dimensional one-hot code

representing an input 𝑥.
 If 𝑥 belongs to cluster 𝑖, then ℎ = 1 and
𝑖
all other entries of the representation ℎ are
zero.
 This is an example of a sparse
representation
k-means algorithm
initialize k different centroids {𝜇(1), . . . ,
𝜇(𝑘)} to different values,

• then alternate between two different


steps until convergence:

assigned to cluster 𝑖, where 𝑖 is the index


 In one step, each training example is

of the nearest centroid 𝜇(𝑖).


 In the other step, each centroid 𝜇(𝑖) is

examples 𝑥(𝑗) assigned to cluster 𝑖.


updated to the mean of all training
How to evaluate
• clustering?
Suppose we have images of
 Red trucks, gray trucks
 Red cars, gray cars
• One clustering algorithm may find a cluster of
cars and a cluster of trucks
• Another learning algorithm may find a
cluster of red vehicules and a cluster of gray
vehicules
• A third algorithm may find 4 clusters …
 Red cars similarity to gray trucks is same
as their similarity to gray cars!
• A distributed representation may have two
attributes for each vehicule
Building a machine
learning
• Simple recipe
algorithm
 A dataset
 A cost function (e.g. MSE, negative log-
likelihood)
 A model (e.g. linear, non-linear)
 An optimization procedure (closed form,
gradient descent, stochastic gradient
descent … )
Exercis
• Link equations e
and
definitions
Stochastic gradient
descent

Computing the gradient is


O(m) Sample a mini-batch
Challenges motivating
deep learning
• Simple machine learning algorithms work
very well on a wide variety of important
problems.
• However, they have not succeeded in
solving the central problems in AI, such as
recognizing speech or recognizing objects.
• Generalizing to new examples becomes
exponentially more difficult when working
with high- dimensional data,
• Such data also often impose high
computational costs.
Curse of
Many Dimensionality
machine learning
problems become
exceedingly difficult when
the number of dimensions
in the data is high.

one variable, two variables, three variables,


10 regions of 10 regions of 10 regions of
interest. interest each. interest each.

For 𝑑 dimensions and 𝑣 values to be distinguished


along each axis, we seem to need 𝑂(𝑣𝑑 ) regions
Smoothness prior (aka
local consistency)
• the function we learn
should not change very
much within a small region.
 KNN
 RBF kernel
 Decision trees
• to distinguish 𝑂(𝑘)

of these require 𝑂(𝑘)


regions in input space, all
methods Vorono
examples i
diagra
m
Smoothness prior is
not enough!
• Is there a way to represent a complex
function that has many more regions to be
distinguished than the number of training
examples?
 A checkerboard contains many
variations but there is a simple
structure to them.
• The answer is Yes:
 If we introduce some dependencies
between the regions
 via additional assumptions about the
underlying data generating distribution
Deep learning solution
• Deep learning assumes that the data was
generated by the composition of factors or
features, potentially at multiple levels in a
hierarchy
• These apparently mild assumptions allow an
exponential gain in the relationship between
the number of examples and the number of
regions that can be distinguished.
• Deep and distributed representations counter the
curse of dimensionality.
Manifold
• A manifold is a connected set of points
 that can be approximated well by considering
only a small number of degrees of freedom,
or dimensions, embedded in a higher-
dimensional space.
• A manifold is a set of points, associated
with a neighborhood around each
point.
 Transformations can be applied to move
on the manifold from one position to a
neighboring one.
• From any given point, the manifold locally
appears to be an Euclidean space.
 In everyday life, we experience the surface
of the world as a 2-D plane, but it is in fact
a spherical manifold in 3-D space.
Manifold

One-dimensional manifold embedded in two


dimensional space.
Manifold
• we allow the dimensionality of the manifold
to vary from one point to another.
 This often happens when a manifold
intersects itself.
 For example, a figure eight is a manifold
that has a single dimension in most places
but two dimensions at the intersection at
the center.
Manifold learning
• Manifold learning algorithms assumes that

ℝ𝑛 consists of invalid inputs,


most of

 interesting inputs occur only along a


collection of manifolds
 Interesting variations in the output of the
learned function occurring only along
directions that lie on the manifold,
 or with interesting variations happening
only when we move from one manifold to
another.
the key assumption
Arguments supporting
the manifold

hypothesis
the probability distribution over images, text
strings, and sounds that occur in real life is
highly concentrated.

Uniforml
y
Sample
d
Images
Arguments supporting
the manifold

hypothesis
Examples we encounter are connected to
each other by applying transformations to
traverse the manifold.

QMUL
Conclusion
• The manifold assumption is at least
approximately correct.

• This concludes part I of the course


• You are now prepared to embark upon your
study of deep learning.

Congratulations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy