0% found this document useful (0 votes)

3 views85 pages

Unit-i Machine Learning Basics

The document provides an overview of machine learning concepts, including definitions of tasks, performance measures, and types of learning such as supervised, unsupervised, and reinforcement learning. It discusses the importance of generalization, capacity, overfitting, and underfitting, along with techniques like regularization and cross-validation. Additionally, it touches on the no free lunch theorem and the role of hyperparameters in model training and evaluation.

Uploaded by

22h51a6710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views85 pages

Unit-i Machine Learning Basics

Uploaded by

22h51a6710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 85

Machine Learning

Basics
Machine Learning

experience 𝐸 with respect to some class of

• A computer program is said to learn from

tasks 𝑇 and performance measure 𝑃, if its

performance at tasks in 𝑇, as measured
by 𝑃, improves with experience 𝐸 (Tom M.
Michell, 1997)
• Example
 𝐸: the experience of playing thousands of
games
 𝑇: playing checkers game
 P: the fraction of games it wins against
human opponents
The task, T
• Tasks are usually described in terms of
how the machine learning should process

𝑥 ∈ ℝ𝑛 where each entry 𝑥𝑖

an example:
is a
feature
 Classification: Learn 𝑓: ℝ𝑛 → 1, . . , k
o y = 𝑓(𝑥) : assigns the input to the
category with numerical code 𝑦
o Example: object recognition
 Classification with missing inputs: learn a
set of functions
o Example: medical diagnosis
The task,
 Regression: Learn 𝑓: ℝ → ℝ
T 𝑛

o Example: Predict claim amount for an

insured person
 Transcription: unstructured
representation to discrete textual form
o Example: optical character recognition,
speech recognition
 Machine Translation: Sequence to sequence
o Example: translate English to French
The task, T
 Structured output: output is a vector or
data structure of values that are tightly
interrelated
o Example: parsing natural language
sentence into a tree that describes
grammatical structure by tagging nodes
of the tree as being verbs, nouns, etc.
o image segmentation: assigning a
pixel to a segment
o Image captioning
The task, T
 Anomaly detection: flag unusual or
atypical events
o Credit card fraud detection
 Synthesis and sampling: generate
examples that are similar to those in the
training data
o Genreate textures for video games

o Speech synthesis
 Imputation of missing values: predict
the values of the missing entries
The task,
 Denoising: 𝑓:
T 𝑥3 ∈ ℝ → 𝑥 ∈ ℝ 𝑛 𝑛

o Predict clean example from corrupted

example
o 𝑝 𝑥 𝑥5 =?
 Density estimation or probability

o 𝑝𝑚𝑜𝑑𝑒𝑙 (𝒙): ℝ𝑛 → ℝ (x can be

mass estimation

discrete or continuous)
o Example: 𝑝 𝑚𝑜𝑑𝑒𝑙 (𝑥 𝑖 |𝒙 –𝒊 )
The performance
measure,
•Accuracy: P
 The proportion of examples for which the model
produces the correct output
• Error rate (0-1 loss):
 The proportion of examples for which the model
produces incorrect output
• Average log-probability of some examples (for
density estimation)
• It is difficult sometimes, to decide what should
be measured
• It is imparactical sometimes to measure an
implicit performance metric
• Evaluate the performance measure using a
test set
The experience,
• Elearning: 𝑝(𝑦|𝒙)
Supervised
 Experience is a labeled dataset (or datapoints)
 Each datapoint has a label or target

• Unsupervised learning: 𝑝(𝒙)

 Experience is an unlabeled dataset

 Clustering, learning probability distribution,

denoising, etc.
• The line between supervised and unsupervised

𝑝 𝒙 = ∏𝑛 𝑝(𝑥𝑖|𝑥1, … , 𝑥 𝑖 – 1 )
is often blurred

 𝑝 𝑦 𝒙 = 𝑖=1 𝑝


𝒙,𝑦
𝑝
𝑦
∑ " 𝒙,𝑦"
The experience,
E
• Semi-Supervised learning:
 Some examples include supervision
targets, but others do not.
• Multi-instance learning:
A collection of examples is labeled as
containing an example of a class
• Reinforcement learning:
 Interaction with an environment
A
• dataset
Design matrix X:
 Each row is an example
 Each column is a feature
 Example: iris dataset: 𝑋 ∈ ℝ150×4
• Vector of labels y
 Example: 0 is a person, 1 is a car, 2 is a
cat, etc.
 The label can be a set of words
(e.g. transcription)
Example: Linear
Task: regression problem ℝ
regression 𝑛

→ℝ
•

 𝑦# = 𝒘𝑻 𝒙

1 𝒕𝒆𝒔𝒕, 𝒚
 Have a test set: 𝑿
• Performance measure:
𝑀𝑆𝐸𝑡𝑒𝑠𝑡 0 𝒚1 𝒕𝒆𝒕𝒆𝒔𝒕 𝒕𝒆 2
𝒔𝒕 𝒔𝒕 𝑖
= −𝒚
�
𝒚1 𝑖 −
� 𝒕𝒆𝒔𝒕
= � 𝒚𝒕𝒆𝒔𝒕
1� 2
2
Experience
𝑿𝒕𝒓𝒂𝒊𝒏, 𝒚𝒕𝒓𝒂𝒊𝒏
•

𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
 Minimize
Linear regression
𝛻𝒘 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
= 10
•

• � 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 𝑇 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 =

0
�
𝑚 𝑡r𝑎𝑖 �

(𝑿 𝑿 )𝒘 = 𝑿 𝒚 𝒕𝒓𝒂𝒊𝒏 (normal
−𝒚
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻
𝑛 𝒕𝒓𝒂𝒊𝒏
�

equations)
•

• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a

simple trick
• b is called the bias term (to not confound
with statistical bias that will be discussed
later)
Linear Regression

Figure
5.1
Capacity, overfitting
and underfitting
• Generalization: ability to perform well on
previously unobserved data
• i.i.d assumptions:
 The examples in each dataset are
independent from each other,
 and that the train set and test set are
identically distributed, drawn from the
same probability distribution as each
other.
 We call that shared underlying

distribution, denoted 𝑝𝑑𝑎𝑡𝑎

distribution the data generating
Capacity, overfitting
and underfitting
• For some fixed value w, the expected
training set error is exactly the same as the
expected test set error, because both
expectations are formed using the same
dataset sampling process.
• In practice:
 expected test set error > expected train
set error
• We need ability to:
 1. Make the training error small.
(underfitting)
 2. Make the gap between training and
Capacity, overfitting
and underfitting
• Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on the training set.
• Overfitting occurs when the gap
between the training error and test
error is too large.
• Informally, a model’s capacity is its ability
to fit a wide variety of functions.
 Low capacity may cause underfitting
 High capacity may cause overfitting
Hypothesis space
• One way to control the capacity of a
learning algorithm is by choosing its
hypothesis space,
 the set of functions that the learning
algorithm is allowed to select as being the
solution.
 For example, linear regression has the set
of all linear functions of its input
 To increase capacity: (model is
polynomial of degree 9)
9

𝑦O = 𝑏 + R 𝑤𝑖 𝑥𝑖
𝑖=1
Underfitting and
Overfitting in
Polynomial Estimation

Degree Degree Degree 9:

infinitely many
1 2
Capacity
• Representational capacity: finding the best
function within a family of functions
• Effective capacity may be less than the
representational capacity because it is hard to
find the best function
• Occam’s razor: Among competing
hypotheses that explain known observations
equally well, one should choose the
“simplest” one.
• Statistical learning theory shows that the
discrepancy between training error and
generalization error is bounded from above by a
quantity that grows as the model capacity grows
but shrinks as the number of training examples
increases
Generalization and
Capacity

Figure
High capacity: non-
parametric models
• Nearest neighbor
2
𝑦O = 𝑦𝑖
regression
2
𝑖 = argmin 𝑿𝒊,: −
where

𝒙
Bayes error

predictions from the true distribution 𝑝 𝒙, 𝑦

• The error incurred by an oracle making

.
• This error is due to:
 There may still be noise in the distribution
 𝑦 might be inherently stochastic
𝑦 may be deterministic but
involves other variables besides 𝒙

Training Set
Size moderate amount
of noise added to
a degree-5
polynomial
The no Free lunch
• theorem
Averaged over all possible data generating
distributions, every classification algorithm has
the same error rate when classifying previously
unobserved points.

• In some sense, no machine learning

algorithm is universally any better than
any other.

• The most sophisticated algorithm we can

conceive of has the same average performance
(over all possible tasks) as merely predicting
that every point belongs to the same class.
No free lunch
• The goal of machine learning research is not
to seek a universal learning algorithm or the
absolute best learning algorithm.
• Instead, our goal is to understand what
kinds of distributions are relevant to the
“real world”
 and what kinds of machine learning
algorithms perform well on data drawn
distributions we care about.
Regularization
• We can give a learning algorithm a
preference for one solution in its
hypothesis space to another.
 both functions are eligible, but one is
preferred.
• Example: linear regression with weight

𝐽 𝑤 = 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 + 𝜆𝒘𝑻𝒘
decay:

• We are expressing a preference for the

weights to have smaller L2 norm
 Larger 𝜆 forces the weights to become
Weight Decay

The true function is quadratic,

but here we use only models with
degree 9
Regularization
• Regularization is any modification we
make to a learning algorithm that is
intended to reduce its generalization
error but not its training error.
• Regularization is one of the central concerns
of the field of machine learning
Hyperparameters
• The values of hyperparameters are not
adapted by the learning algorithm itself
• Polynomial regression hyperparameters:
 Degree of the polynomial
 𝜆 (control of the weight decay)
• To set the hyperparameters, we need a
validation set
 Typically, one uses about 80% of the
training data for training and 20% for
validation.
Cross validation
• It is used to estimate generalization error of a
learning algorithm A when the given dataset D
is too small for a simple train/test or train/valid
split to yield accurate estimation of
generalization error

• In k-fold cross-validation a partition of the

dataset is formed by splitting it into k non-
overlapping subsets.
 The test error may then be estimated by
taking the average test error across k trials.
On trial i, the i-th subset of the data is used as
the test set and the rest of the data is used as
Cross validation
Point estimation
• Provides the single “best” prediction of some
quantity of interest
• can be a single parameter or a vector of
parameters in some parametric model
• A point estimator or statistic is any function of the
data (assuming i.i.d. samples):
𝜽3 = 𝑔(𝒙 𝟏 ,𝒙 𝟐 ,…,𝒙 𝒎 )
• Frequentist perspective: we assume that
the true parameter value θ is fixed but
unknown
• Function estimation is really just the same as
estimating a parameter θ; the function estimator
is simply a point estimator in function space.
Statistical
the data 𝒙 𝟏 , 𝒙
Expectation over
bias 𝟐 , … , 𝒙 𝒎

𝑏𝑖𝑎𝑠 𝜽Z > = D 𝜽Z > −𝜽

•
An estimator 𝜽Z > is said to be unbiased
if
𝑏𝑖𝑎𝑠 𝜽Z > = 0
𝑚→

which @
implies that D 𝜽Z > = 𝜽.
•
An estimator 𝜽Z >
lim
is said to be
asymptotically unbiased if
Example: Bernoulli
distribution
• Consider a set of samples {𝑥 1 ,𝑥 2 ,…,𝑥 𝑚 } that are i.i.d according to a Bernoulli
𝑃 𝑥 𝑖 ; = 𝜃𝑥 𝑖 1 − 𝜃
distribution with mean θ
𝜃

1−𝑥 𝑖

of 𝜃 is: 𝑚
 A common estimator

1
𝜃𝑚 = � /
- 𝑖=1
𝑥
(𝑖)
�
Example: Gaussian
distribution - mean
Consider a set of samples {𝑥 1 , 𝑥 2 ,

…,𝑥 𝑚 } that are i.i.d

•

according to a a Gaussian distribution:

𝑝 𝑥 1𝑖 = 𝒩 𝑥 𝑖 , 𝜇, 𝜎2
𝜇̂𝑚 = 𝑥 (𝑖

𝑚 𝑖=
•
1 )
∑𝑚

The sample mean is an

unbiased estimator of
Gaussian mean parameter.
Example: Gaussian
distribution - variance
2
Biased estimator:𝑚𝜎#𝑚2 =
𝑖= 𝑥 − 𝑚
𝜇
•
1 𝑖
∑
1 𝑚
#

The sample variance is a

biased estimator.

• Unbiased
2
−
1
𝜎# 2
𝑚− ∑𝑖= 𝑥
estimator:
𝑚 𝑚
= 1 𝑚1 𝑖 𝜇
•
Variance
• how much we expect an estimator to
vary as a function of the data sample?
•
𝑣𝑎𝑟 𝜃g =?
•
Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to
exhibit low bias we would also like it to
have relatively low variance.
Example: Bernoulli
distribution

• The variance of the estimator decreases as a

function of m, the number of examples in the
dataset.
Trading off Bias and Variance

Figure (Goodfellow
2016)
Bias vs. variance
Bias vs. variance
Maximum Likelihood
Estimation
• Rather than guessing that some function
might make a good estimator and then
analyzing its bias and variance, we would
like to have some principle from which we
can derive specific functions that are good
estimators for different models.

Same
Maximum Likelihood
Estimation
• Divide by
• m:
MLE minimizes the
Empirica
dissimilarity between the l

𝑝O𝑑𝑎𝑡𝑎 and the model

empirical distribution distributi
on
distribution

of 𝜃
Any loss consisting of a negative log- Independent
likelihood is a cross-entropy between
the empirical distribution defined by the
training set and the probability
distribution defined by model.
Conditional log-
• likelihood
Supervised
learning

• Instead of producing a single prediction 𝑦O,

distribution 𝑝(𝑦 | 𝒙).

we now think of the model as producing a
conditional Output of
linear
• For linear regression, regression as fixe
take: the mean d
Conditional log-likelihood
for linear regression
• To
maximize:

From
• Same as the
minimizing: dataset
Output of
linear
regression as
the mean
Properties of
maximum
number of examples 𝑚 → ∞, in terms of its
•
likelihood
The best estimator asymptotically, as the

rate of convergence as 𝑚 increases.

•
Consistency: plim 𝜃g𝑚 = 𝜃
𝑚→@
 Under appropriate conditions
o The true distribution must lie within the
model family
o The true distribution must correspond to
exactly one value of θ
Bayesian statistics
• the true parameter 𝜃 is unknown or
uncertain and thus is represented as a
random variable.

• It is different than the single point

estimate (frequentist statistics)
Maximum A Posteriori
(MAP) Estimation
• Add the
prior:

• MAP Bayesian inference has the advantage

of leveraging information that is brought
by the prior and cannot be found in the
 If this prior is given by 𝒩𝒘, 𝟎, 1 𝑰2 , then
training data.
�
�
𝜆𝑤
prior term
the log- is proportional to� the familiar

𝑤
�

weight decay penalty

Supervised
learning
•
examples
Logistic regression
• Support vector
machines
• Decision trees
Logistic regression

• There is no closed-form solution for the optimal

weights.
• Instead, we must search for them by maximizing
the log- likelihood.
• We can do this by minimizing the negative log-
likelihood (NLL) using gradient descent.
Support vector
• machines
Introduced by Vapnik in the
90s
• Widest street approach
• Decision rule
 Which side of the street?

𝒘𝒙 + 𝑏 ≥ 0 ⇒
+
o

𝒘𝒙 + 𝑏 ≤
0⇒−
o

 w is perpendicular to the median line of the street, but

which w?
o We add the following scaling constraints:

• 𝒘𝒙+ + 𝑏 ≥ 1
o 𝒘𝒙 – + 𝑏 ≤ −1
SVM
• y𝑖 = +1 for + samples
• y𝑖 = −1 for – samples
• 𝑦𝑖(𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and
x-
• 𝑦𝑖𝒘𝒙𝒊 + 𝑏 −1≥0
• w𝑖𝑑𝑡ℎ = 𝒙𝑦+𝑖 −𝒘𝒙 𝒊 +
𝒙 –b𝒘 𝑏 –(1+b) − 1 = 0
𝐰 1–
− � = �
= 𝒙
Constraints:
for 𝒊 in 𝒘 gutter2
the � �
• Maximize the width of the
 ma ⇒ max ⇒
street: 2 𝒘
1
𝒘 22
x 𝒘 min
Lagrange multipliers
2 − ∑𝛼 𝑦𝑖 𝒘𝒙𝒊 + 𝑏 −1
𝐿 =2 1
𝑖

𝛻𝒘𝒘 𝛼𝑖− ∑𝛼≥ 𝑖 𝑦𝑖0

•
• 𝐿=𝒘 𝒙𝒊
𝒘 − ∑𝛼𝑖𝑦𝑖𝒙𝒊 =𝟎⇒𝒘=
∑𝛼𝑖𝑦𝑖𝒙𝒊

𝛛
𝐿

= ∑𝛼 𝑖
• Replacing
•
𝑦𝑖
in 𝐿 ∑𝛼
⇒ : 𝑖𝑦𝑖 = 0
𝛛𝑏
1
 𝐿= 2
∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋 − ∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋
− ∑𝛼𝑖 𝑦𝑖 𝑏 +
1
∑𝛼𝑖
 𝐿 = ∑𝛼𝑖 − 2 ∑𝛼𝑖 𝛼j 𝑦𝑖 𝑦j 𝒙𝒊 𝒙𝒋
• Numerical analysts will solve this problem
Kernel trick
𝛼𝑖 ≠ 0 only for points in the gutter, also
known as support vectors
•

Decision for 𝒙𝒋: 𝒘𝒙𝒋 + 𝑏 = ∑𝛼𝑖 𝑦𝑖 𝒙𝒊𝒙𝒋

+𝑏
•

We can change the representation of 𝒙𝒊

to be 𝜙 𝒙𝒊
•

• We do not need direct access to 𝜙 𝒙𝒊

Instead we define 𝑘 𝒙 𝒊 , 𝒙𝒋 = 𝜙 𝒙𝒊
𝜙 𝒙𝒋


 The function k is called kernel

Advantages of the
kernel trick
• The kernel trick is powerful for two reasons.

nonlinear as a function of 𝒙 using convex

 it allows us to learn models that are

optimization techniques that are

guaranteed to converge efficiently
 the kernel function often admits an
implementation that is significantly more

constructing two 𝜙 𝒙 vectors and

computational efficient than naively

explicitly taking their dot product.

o In some cases, 𝜙 𝒙 can even
be infinite dimensional
Gaussian
• kernel
Gaussian kernel
 Aka RBF: Radial Basis Function

• We can think of the Gaussian kernel as performing a

kind of template matching.
 A training example x associated with training
label y becomes a template for class y.
 When a test point x’ is near x according to
Euclidean distance, the Gaussian kernel has a
large response, indicating that x’ is very
similar to the x template.
 The model then puts a large weight on the
associated training label y.
• Overall, the prediction will combine many such
training labels weighted by the similarity of the
corresponding training examples.
SVM
•
limitations
high computational cost of training when the
dataset is large.

• The current deep learning renaissance began

when Hinton et al. (2006) demonstrated that
a neural network could outperform the RBF
kernel SVM on the MNIST benchmark.
Watch (SVM)
• https:
//www.youtube.com/watch?v=_PwhiWxHK8o
K-Nearest Neighbors
• There is not even really a training stage or
learning process.
 We find the k-nearest neighbors to x in the
training data X. We then return the average
of the corresponding y values in the
training set
• In the case of classification, we can
average over one-hot code vectors
 We can then interpret the average over
these one-hot codes as giving a
probability distribution over classes
K-NN limitations
• It cannot learn that one feature is
more discriminative than another.

regression task with 𝑥 ∈ ℝ100 drawn from

 For example, imagine we have a

an isotropic Gaussian distribution,

 but only a single variable is relevant to the

𝑦 = 𝑥1
output

 Nearest neighbor regression will not be

able to detect this simple pattern.
Decision
Trees How a decision
divide ℝ2.
tree might

Each leaf requires at least one

training example to define,
so it is not possible for the
decision tree to
learn a function that has more local
maxima than the number of training
examples.
Decision trees
limitations
• It struggles to solve some problems that
are easy even for logistic regression.

positive class occurs wherever 𝑥2 > 𝑥1

• If we have a two-class problem and the

 the decision boundary is not axis-aligned.

The decision tree will thus need to
approximate the decision boundary with
many splits, implementing a step function
that constantly walks back and forth across
the true decision function.
Unsupervided
• learning
Density estimation
• Learning to draw samples from a distribution
• Learning to denoise data from some
distribution
• Clustering the data into groups of related
examples
• Simpler representation:
 Lower-dimensional representation
 Sparse representation
 Independent representation
Principal Components
Analysis

• lower
dimensionality
• no linear correlation
No linear
correlation
Let us consider the 𝑚×𝑛-dimensional design
matrix 𝑿.
•

D 𝒙 = 0.
• We will assume that the data has a mean of zero,

 If this is not the case, the data can easily be

centered by subtracting the mean from all
examples in a preprocessing step.
• The unbiased sample covariance matrix
 𝑉𝑎𝑟 𝒙 1 𝑿𝑻
associated with
𝑚−
= 𝑿 𝑻
PCA finds 𝒛 = 𝑾 𝒙
X is given by:
1

where 𝑉𝑎𝑟
• is

𝒛
diagonal.
Decorrelati
PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟𝒛
𝑻
on
• is
diagonal.
 Remember 𝑻
that 𝑿
𝑉𝑎𝑟 𝒛 𝟏 𝒁 𝒁 = 𝑾 𝑿 𝑿𝑾 =
1 𝑻𝑿 = 𝑾𝚲𝑾𝑻
𝑻 𝑻
1
𝑚– 𝒎– 𝑚–
= 1 𝑾𝑻𝑾𝜦𝑾 𝑻𝑾 =
•

𝑉𝑎𝑟 𝒛 𝑚1– 𝚲 since 𝑾𝑻𝑾

𝟏 1

= =𝑰
•

1
• PCA disentangles the unknown factors of
variation underlying the data.
 this disentangling takes the form of finding a
rotation of the input space that aligns the
principal axes of variance with the basis of the
new representation space associated with z.
k-means clustering
• Divides the training set into k different
clusters of examples

vector ℎ
• Provides a k-dimensional one-hot code

representing an input 𝑥.
 If 𝑥 belongs to cluster 𝑖, then ℎ = 1 and
𝑖
all other entries of the representation ℎ are
zero.
 This is an example of a sparse
representation
k-means algorithm
initialize k different centroids {𝜇(1), . . . ,
𝜇(𝑘)} to different values,
•

• then alternate between two different

steps until convergence:

assigned to cluster 𝑖, where 𝑖 is the index

 In one step, each training example is

of the nearest centroid 𝜇(𝑖).

 In the other step, each centroid 𝜇(𝑖) is

examples 𝑥(𝑗) assigned to cluster 𝑖.

updated to the mean of all training
How to evaluate
• clustering?
Suppose we have images of
 Red trucks, gray trucks
 Red cars, gray cars
• One clustering algorithm may find a cluster of
cars and a cluster of trucks
• Another learning algorithm may find a
cluster of red vehicules and a cluster of gray
vehicules
• A third algorithm may find 4 clusters …
 Red cars similarity to gray trucks is same
as their similarity to gray cars!
• A distributed representation may have two
attributes for each vehicule
Building a machine
learning
• Simple recipe
algorithm
 A dataset
 A cost function (e.g. MSE, negative log-
likelihood)
 A model (e.g. linear, non-linear)
 An optimization procedure (closed form,
gradient descent, stochastic gradient
descent … )
Exercis
• Link equations e
and
definitions
Stochastic gradient
descent

Computing the gradient is

O(m) Sample a mini-batch
Challenges motivating
deep learning
• Simple machine learning algorithms work
very well on a wide variety of important
problems.
• However, they have not succeeded in
solving the central problems in AI, such as
recognizing speech or recognizing objects.
• Generalizing to new examples becomes
exponentially more difficult when working
with high- dimensional data,
• Such data also often impose high
computational costs.
Curse of
Many Dimensionality
machine learning
problems become
exceedingly difficult when
the number of dimensions
in the data is high.

one variable, two variables, three variables,

10 regions of 10 regions of 10 regions of
interest. interest each. interest each.

For 𝑑 dimensions and 𝑣 values to be distinguished

along each axis, we seem to need 𝑂(𝑣𝑑 ) regions
Smoothness prior (aka
local consistency)
• the function we learn
should not change very
much within a small region.
 KNN
 RBF kernel
 Decision trees
• to distinguish 𝑂(𝑘)

of these require 𝑂(𝑘)

regions in input space, all
methods Vorono
examples i
diagra
m
Smoothness prior is
not enough!
• Is there a way to represent a complex
function that has many more regions to be
distinguished than the number of training
examples?
 A checkerboard contains many
variations but there is a simple
structure to them.
• The answer is Yes:
 If we introduce some dependencies
between the regions
 via additional assumptions about the
underlying data generating distribution
Deep learning solution
• Deep learning assumes that the data was
generated by the composition of factors or
features, potentially at multiple levels in a
hierarchy
• These apparently mild assumptions allow an
exponential gain in the relationship between
the number of examples and the number of
regions that can be distinguished.
• Deep and distributed representations counter the
curse of dimensionality.
Manifold
• A manifold is a connected set of points
 that can be approximated well by considering
only a small number of degrees of freedom,
or dimensions, embedded in a higher-
dimensional space.
• A manifold is a set of points, associated
with a neighborhood around each
point.
 Transformations can be applied to move
on the manifold from one position to a
neighboring one.
• From any given point, the manifold locally
appears to be an Euclidean space.
 In everyday life, we experience the surface
of the world as a 2-D plane, but it is in fact
a spherical manifold in 3-D space.
Manifold

One-dimensional manifold embedded in two

dimensional space.
Manifold
• we allow the dimensionality of the manifold
to vary from one point to another.
 This often happens when a manifold
intersects itself.
 For example, a figure eight is a manifold
that has a single dimension in most places
but two dimensions at the intersection at
the center.
Manifold learning
• Manifold learning algorithms assumes that

ℝ𝑛 consists of invalid inputs,

most of

 interesting inputs occur only along a

collection of manifolds
 Interesting variations in the output of the
learned function occurring only along
directions that lie on the manifold,
 or with interesting variations happening
only when we move from one manifold to
another.
the key assumption
Arguments supporting
the manifold
•
hypothesis
the probability distribution over images, text
strings, and sounds that occur in real life is
highly concentrated.

Uniforml
y
Sample
d
Images
Arguments supporting
the manifold
•
hypothesis
Examples we encounter are connected to
each other by applying transformations to
traverse the manifold.

QMUL
Conclusion
• The manifold assumption is at least
approximately correct.

• This concludes part I of the course

• You are now prepared to embark upon your
study of deep learning.

Congratulations

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Machine Learning Basics Dl2 Rk (1)
No ratings yet
Machine Learning Basics Dl2 Rk (1)
16 pages
SML_Lecture1
No ratings yet
SML_Lecture1
37 pages
4.machine Learning Basics (C)
No ratings yet
4.machine Learning Basics (C)
9 pages
DL_Unit1 (1)
100% (1)
DL_Unit1 (1)
79 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
linear+regression+with+multiple+variable
No ratings yet
linear+regression+with+multiple+variable
30 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Notes
No ratings yet
Notes
125 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
Comp Vis Week 2
No ratings yet
Comp Vis Week 2
16 pages
ML 01
No ratings yet
ML 01
24 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Lec3 Linear Regression With Multiple Vars
No ratings yet
Lec3 Linear Regression With Multiple Vars
30 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
The Ability To Learn Is A Core Artefact of Intelligence: Machine Learning Fundamentals
No ratings yet
The Ability To Learn Is A Core Artefact of Intelligence: Machine Learning Fundamentals
17 pages
lecture17
No ratings yet
lecture17
33 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
An Introduction To Machine Learning
No ratings yet
An Introduction To Machine Learning
136 pages
ML HAND WRITTEN NOTES
No ratings yet
ML HAND WRITTEN NOTES
19 pages
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
No ratings yet
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
14 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
First Cours 2
No ratings yet
First Cours 2
42 pages
19_ML_intro
No ratings yet
19_ML_intro
33 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Classification
No ratings yet
Classification
53 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
DSA5102X_lecture1
No ratings yet
DSA5102X_lecture1
51 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
15 pages
lec4
No ratings yet
lec4
33 pages
ABDUA 3 and 4
No ratings yet
ABDUA 3 and 4
102 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
07 Intro to ML
No ratings yet
07 Intro to ML
38 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
21 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Os Unit-2 Notes-1 (1)
No ratings yet
Os Unit-2 Notes-1 (1)
53 pages
DM 4 UNIT (1)
No ratings yet
DM 4 UNIT (1)
13 pages
Predictive Analytics Steps
No ratings yet
Predictive Analytics Steps
13 pages
Angular
No ratings yet
Angular
12 pages
Taller3 Econometria2
No ratings yet
Taller3 Econometria2
2 pages
Week 4 - The Multiple Linear Regression Model (Part 1) PDF
No ratings yet
Week 4 - The Multiple Linear Regression Model (Part 1) PDF
35 pages
1730985268603_Sampling Theory - Course Guide Book
No ratings yet
1730985268603_Sampling Theory - Course Guide Book
3 pages
Basic Concepts of Estimation
100% (1)
Basic Concepts of Estimation
17 pages
Going Beyond Linear Regression: Ita Cirovic Donev
No ratings yet
Going Beyond Linear Regression: Ita Cirovic Donev
42 pages
Practice Set Prob and Stat
No ratings yet
Practice Set Prob and Stat
3 pages
7.1.1. Linear Regression - Intuition
No ratings yet
7.1.1. Linear Regression - Intuition
7 pages
Dependent and Independent Variables - Wikipedia
No ratings yet
Dependent and Independent Variables - Wikipedia
6 pages
Engineering Statistic Formulae
No ratings yet
Engineering Statistic Formulae
5 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Eco401 Learning Objectives
No ratings yet
Eco401 Learning Objectives
14 pages
Chap 12 Autocorrelation
No ratings yet
Chap 12 Autocorrelation
32 pages
FE3003 Ramsey Reset Test
No ratings yet
FE3003 Ramsey Reset Test
4 pages
Statistics and Probability
83% (47)
Statistics and Probability
435 pages
Advanced Machine Learning: Module-1
No ratings yet
Advanced Machine Learning: Module-1
164 pages
MOOC Econometrics: Test Exercise 2
No ratings yet
MOOC Econometrics: Test Exercise 2
1 page
Hussain, Mahmud - 2019 - pyMannKendall A Python Package For Non Parametric Mann Kendall Family of Trend Tests
No ratings yet
Hussain, Mahmud - 2019 - pyMannKendall A Python Package For Non Parametric Mann Kendall Family of Trend Tests
3 pages
4363Introduction to Econometrics 3rd Edition James H. Stock - Download the complete ebook in PDF format and read freely
100% (6)
4363Introduction to Econometrics 3rd Edition James H. Stock - Download the complete ebook in PDF format and read freely
67 pages
Int 354 ML-1
No ratings yet
Int 354 ML-1
4 pages
Course Syllabus On Econometrics
No ratings yet
Course Syllabus On Econometrics
2 pages
5) Multiple Regression
100% (1)
5) Multiple Regression
8 pages
Topic 5-Lecture Notes
No ratings yet
Topic 5-Lecture Notes
12 pages
Regression Questions Topic 6
No ratings yet
Regression Questions Topic 6
3 pages
Tutorial-Topic 2
No ratings yet
Tutorial-Topic 2
2 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
4 pages
Departmental Handbook CS
No ratings yet
Departmental Handbook CS
39 pages
4123-Article Text-15297-1-10-20211020
No ratings yet
4123-Article Text-15297-1-10-20211020
16 pages
John P. Nolan - Stable Distributions
No ratings yet
John P. Nolan - Stable Distributions
34 pages
CH 15
No ratings yet
CH 15
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-i Machine Learning Basics

Uploaded by

Unit-i Machine Learning Basics

Uploaded by

Machine Learning

experience 𝐸 with respect to some class of

tasks 𝑇 and performance measure 𝑃, if its

𝑥 ∈ ℝ𝑛 where each entry 𝑥𝑖

o Example: Predict claim amount for an

o Predict clean example from corrupted

o 𝑝𝑚𝑜𝑑𝑒𝑙 (𝒙): ℝ𝑛 → ℝ (x can be

• Unsupervised learning: 𝑝(𝒙)

 Clustering, learning probability distribution,

• � 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 𝑇 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 =

• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a

distribution, denoted 𝑝𝑑𝑎𝑡𝑎

Degree Degree Degree 9:

predictions from the true distribution 𝑝 𝒙, 𝑦

• In some sense, no machine learning

• The most sophisticated algorithm we can

• We are expressing a preference for the

The true function is quadratic,

• In k-fold cross-validation a partition of the

𝑏𝑖𝑎𝑠 𝜽Z > = D 𝜽Z > −𝜽

…,𝑥 𝑚 } that are i.i.d

according to a a Gaussian distribution:

The sample mean is an

The sample variance is a

• The variance of the estimator decreases as a

𝑝O𝑑𝑎𝑡𝑎 and the model

• Instead of producing a single prediction 𝑦O,

distribution 𝑝(𝑦 | 𝒙).

rate of convergence as 𝑚 increases.

• It is different than the single point

• MAP Bayesian inference has the advantage

weight decay penalty

• There is no closed-form solution for the optimal

 w is perpendicular to the median line of the street, but

𝛻𝒘𝒘 𝛼𝑖− ∑𝛼≥ 𝑖 𝑦𝑖0

Decision for 𝒙𝒋: 𝒘𝒙𝒋 + 𝑏 = ∑𝛼𝑖 𝑦𝑖 𝒙𝒊𝒙𝒋

We can change the representation of 𝒙𝒊

• We do not need direct access to 𝜙 𝒙𝒊

 The function k is called kernel

nonlinear as a function of 𝒙 using convex

optimization techniques that are

constructing two 𝜙 𝒙 vectors and

explicitly taking their dot product.

• We can think of the Gaussian kernel as performing a

• The current deep learning renaissance began

regression task with 𝑥 ∈ ℝ100 drawn from

an isotropic Gaussian distribution,

 Nearest neighbor regression will not be

Each leaf requires at least one

positive class occurs wherever 𝑥2 > 𝑥1

 the decision boundary is not axis-aligned.

 If this is not the case, the data can easily be

𝑉𝑎𝑟 𝒛 𝑚1– 𝚲 since 𝑾𝑻𝑾

• then alternate between two different

assigned to cluster 𝑖, where 𝑖 is the index

of the nearest centroid 𝜇(𝑖).

examples 𝑥(𝑗) assigned to cluster 𝑖.

Computing the gradient is

one variable, two variables, three variables,

For 𝑑 dimensions and 𝑣 values to be distinguished

of these require 𝑂(𝑘)

One-dimensional manifold embedded in two

ℝ𝑛 consists of invalid inputs,

 interesting inputs occur only along a

• This concludes part I of the course

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.