0% found this document useful (0 votes)
10 views81 pages

Unit-Ii (Ml-I)

The document discusses key concepts in machine learning, focusing on linear regression, perceptrons, and neural networks. It explains the mechanics of training neural networks using gradient descent and its variants, including mini-batch and stochastic gradient descent, as well as the importance of learning rates. Additionally, it contrasts generative and discriminative classifiers, particularly in the context of logistic regression.

Uploaded by

Abhay Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views81 pages

Unit-Ii (Ml-I)

The document discusses key concepts in machine learning, focusing on linear regression, perceptrons, and neural networks. It explains the mechanics of training neural networks using gradient descent and its variants, including mini-batch and stochastic gradient descent, as well as the importance of learning rates. Additionally, it contrasts generative and discriminative classifiers, particularly in the context of logistic regression.

Uploaded by

Abhay Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Machine Learning

Classifiers

UNIT-II
2
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set
[ ] of 𝑧1
𝑧2
= 𝜙( 𝑥 )

points
What if a line/plane
doesn’t model the input-
output relationship very
well, e.g., if their Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
relationship is better

(Output )
modeled by a nonlinear (Output )
curve or curved surface?
No. We can even fit
Do linear
models a curve using a
become linear model after
useless in such suitably
cases? transforming the
Input (single feature) (Feature 2) (Feature 1)
inputs
The transformation can be predefined or
learned (e.g., using kernel methods or a
deep neural network based feature
extractor). More on this later

 The line/plane must also predict outputs the unseen (test)


Simplest Possible Linear Regression Model
 This is the base model for all
statistical machine learning
 x is a one feature data
variable
 y is the value we are trying
y w0  w1 x  
to predict
 The regression model is

Two parameters to estimate –


the slope of the line w1 and
the y-intercept w0
 ε is the unexplained,
random, or error component.
Perceptron

Cornell Aeronautical Laboratory


Perceptron
– Invented by Frank Rosenblatt in 1957 in an
attempt to understand human memory, learning,
and cognitive processes.
– The first neural network model by computation,
with a remarkable learning algorithm:
• If function can be represented by perceptron, the
learning algorithm is guaranteed to quickly
converge to the hidden function!
– Became the foundation of pattern recognition
research
Rosenblatt &
Mark I Perceptron:
the first machine that could One of the earliest and most influential neural networks:
"learn" to recognize and An important milestone in AI.
identify optical patterns. Carla P. Gomes
CS4700
Single Layer Feed-forward Neural Networks
Perceptrons

Single-layer neural network (perceptron network)

A network with all the inputs connected directly to the outputs

– Output units all operate separately: no shared weights

Since each output unit is


independent of the others,
we can limit our study
to single output perceptrons.

Carla P. Gomes
CS4700
An Artificial Neuron
Node or Unit:
A Mathematical Abstraction
Artificial Neuron,
Node or unit ,
Processing Unit i

Input
Input edges, Output
function(ini): Output edges,
each with weights Activation
weighted sum n
each with weights
a  g (  W j ,i a j )
(positive, negative, and of its inputs, function (g) i
including applied to
j 0
(positive, negative, and
change over time,
fixed input a0. input function change over time,
learning)
n (typically learning)
ini  W j ,i a j non-linear).
j 0

 a processing element producing an output based on a function of its inputs


Note: the fixed input and bias weight are conventional; some authors instead, e.g., or a 0=1 and -W0i
Carla P. Gomes
CS4700
Activation Functions

(a) Threshold activation function  a step function or threshold function


(outputs 1 when the input is positive; 0 otherwise).
(b) Sigmoid (or logistics function) activation function (key advantage:
differentiable)
(c) Sign function, +1 if input is positive, otherwise -1.

These functions have a threshold (either hard or soft) at zero.


 Changing the bias weight W0,i moves the threshold location. Carla P. Gomes
CS4700
CS 502, Fall 2020

Training NNs

• Optimizing the loss function


– Almost all DL models these days are trained with a variant
of the gradient descent (GD) algorithm
– GD applies iterative refinement of the network parameters
– GD uses the opposite direction of the gradient of the loss
with respect to the NN parameters (i.e., ) for updating
 The gradient of the loss function gives the direction of fastest
increase of the loss function when the parameters are
changed
ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖

𝜃𝑖 50
CS 502, Fall 2020

Training NNs

• The loss functions for most DL tasks are defined over very high-dimensional
spaces
– E.g., ResNet50 NN has about 23 million parameters
– This makes the loss function impossible to visualize
• We can still gain intuitions by studying 1-dimensional and 2-dimensional
examples of loss functions

1D loss (the minimum point is obvious) 2D loss (blue = low loss, red = high loss)
Picture from: https://cs231n.github.io/optimization-1/ 51
CS 502, Fall 2020

Gradient Descent Algorithm

• Steps in the gradient descent algorithm:


1. Randomly initialize the model parameters
 In the figure, the parameters are denoted
2. Compute the gradient of the loss function at :
3. Update the parameters as:
 Where α is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)

52
CS 502, Fall 2020

Gradient Descent Algorithm

• Example: a NN with only 2 parameters and , i.e.,


– Different colors are the values of the loss (minimum loss is ≈ 1.3)

1. Randomly pick a
starting point

2. Compute the
gradient at ,

𝜃
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
(
𝛻ℒ 𝜃 =
0
)
[
𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 0)/ 𝜕 𝑤 2 ]
𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 53
CS 502, Fall 2020

Gradient Descent Algorithm

• Example (contd.)
Eventually, we would reach a minimum …..
1. Randomly pick a
starting point

2. Compute the
2
𝜃 gradient at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the learning
1 rate , and update
𝜃

4. Go to step 2, repeat
0
𝜃

𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 54


CS 502, Fall 2020

Gradient Descent Algorithm

• Gradient descent algorithm stops when a local minimum of the loss surface is reached
– GD does not guarantee reaching a global minimum
– However, empirical evidence suggests that GD works well for NNs

𝜃
Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 55
CS 502, Fall 2020

Gradient Descent Algorithm

• For most tasks, the loss surface is highly complex (and


non-convex)
• Random initialization in NNs
results in different initial ℒ
parameters
– Gradient descent may reach
different minima at every
run
– Therefore, NN will produce
different predicted outputs
• Currently, we don’t have an 𝑤1 𝑤2
algorithm that guarantees
reaching a global minimum
for an arbitrary loss function
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 502, Fall 2020

Backpropagation

• How to calculate the gradients of the loss function in NNs?


• There are two ways:
1. Numerical gradient: slow, approximate, but easy way
2. Analytic gradient: requires calculus, fast, but more error-prone way
• In practice the analytic gradient is used
– Analytical differentiation for gradient computation is available in almost all deep
learning libraries

57
CS 502, Fall 2020

Mini-batch Gradient Descent

• It is wasteful to compute the loss over the entire set to


perform a single parameter update for large datasets
– E.g., ImageNet has 14M images
– GD (a.k.a. vanilla GD) is replaced with mini-batch GD
• Mini-batch gradient descent
– Approach:
 Compute the loss on a batch of images, update the parameters ,
and repeat until all images are used
 At the next epoch, shuffle the training data, and repeat above
process
– Mini-batch GD results in much faster training
– Typical batch size: 32 to 256 images
– It works because the examples in the training data are
correlated
 I.e., the gradient from a mini-batch is a good approximation of
58
the gradient of the entire training set
CS 502, Fall 2020

Stochastic Gradient Descent

• Stochastic gradient descent


– SGD uses mini-batches that consist of a single input example
 E.g., one image mini-batch
– Although this method is very fast, it may cause significant fluctuations in the loss
function
 Therefore, it is less commonly used, and mini-batch GD is preferred
– In most DL libraries, SGD is typically a mini-batch SGD (with an option to add
momentum)

59
CS 502, Fall 2020

Problems with Gradient Descent

• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points
cost

Very slow at the


plateau
Stuck at a saddle point

Stuck at a local minimum

𝛻 ℒ (𝜃 )≈ 0
𝛻ℒ ( 𝜃 )=0 𝛻 ℒ ( 𝜃 )=0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial 60
CS 502, Fall 2020

Gradient Descent with Momentum

• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization
cost
Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 61
CS 502, Fall 2020

Gradient Descent with Momentum

• Parameters update in GD with momentum :


 Where: =
• Compare to vanilla GD:
• The term is called momentum
– This term accumulates the gradients from the past several
steps
– It is similar to a momentum of a heavy ball rolling down the
hill
• The parameter referred to as a coefficient of momentum
– A typical value of the parameter is 0.9
• This method updates the parameters in the direction of
the weighted average of the past gradients

62
CS 502, Fall 2020

Learning Rate

• Learning rate
– The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
– Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training

LR LR
too too
small large

64
CS 502, Fall 2020

Learning Rate

• Training loss for different learning rates


– High learning rate: the loss increases or plateaus too quickly
– Low learning rate: the loss decreases too slowly (takes many epochs to reach a
solution)

Picture from: https://cs231n.github.io/neural-networks-3/ 65


Background: Generative and
Logistic Discriminative Classifiers
Regressi
on
Logistic Regression
• Important analytic tool in natural and social sciences
• Baseline supervised machine learning tool for
classification
• Is also the foundation of neural networks
Generative and Discriminative
Classifiers
Naive Bayes is a generative classifier: Generative models aim to model
the joint probability distribution of the input features and the class labels,
i.e., they learn how the data is generated.

by contrast:

Logistic regression is a discriminative classifier: Discriminative models


focus on modeling the conditional probability of the class labels given the
input features. Instead of modeling how the data is generated, they directly
learn the decision boundary between different classes.
Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:


Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!


Let's ignore everything else
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers

Naive Bayes

Logistic Regression
posterior
P(c|d)
72
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input observation x(i), a


vector of features [x1, x2, ... , xn]. Feature j for input x(i) is xj, more
completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class, via p(y|
x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic gradient
descent.
The two phases of logistic
regression

Training: we learn weights w and b using stochastic


gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)


using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Classification in Logistic Regression
Logistic
Regressi
on
Classification Reminder

Positive/negative sentiment

Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class  C


Binary Classification in Logistic Regression

Given a series of input/output pairs:


◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i)  {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one
observation x

Input observation: vector x = [x1, x2,…, xn]


Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class  {0,1}

(multinomial logistic regression:  {0, 1, 2, 3, 4})


How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias

If this sum is high, we say y=1; if low, then y=0


But we want a probabilistic
classifier
We need to formalize “sum is high”.
We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability, it's just a
number!

Solution: use a function of z that goes from 0 to 1


The very useful sigmoid or logistic function

85
Idea of logistic regression

We’ll compute w∙x+b


And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
Turning a probability into a classifier

0.5 here is called the decision boundary


• If you increase the decision boundary (e.g., 0.7), you become
more conservative in predicting the positive class, potentially
reducing false positives at the cost of potentially missing some
true positives.

• If you decrease the decision boundary (e.g., 0.3), you become


more liberal in predicting the positive class, potentially increasing
false positives but catching more true positives.
The probabilistic classifier

P(y=1)

wx + b
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0
Classification in Logistic Regression
Logistic
Regressi
on
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
Sentiment example: does y=1
or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

94
95
Classifying sentiment for input x

Suppose w =
b = 0.1 96
Classifying sentiment for input x

97
Overview

• Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural
measures of distance (e.g., Euclidean).

• Consider a classification problem that involves nominal data – data described by a list of attributes
(e.g., categorizing people as short or tall using gender, height, age, and ethnicity).

• How can we use such nominal data for classification? How can we learn the categories of such
data? Nonmetric methods such as decision trees provide a way to deal with such data.

ECE 8443: Lecture 18, Slide 112


• Decision trees attempt to classify a pattern through a sequence of questions. For example,
attributes such as gender and height can be used to classify people as short or tall. But the best
threshold for height is gender dependent.

• A decision tree consists of nodes and leaves, with each leaf denoting a class.

• Classes (tall or short) are the outputs of the tree.

• Attributes (gender and height) are a set of features that describe the data.

• The input data consists of values of the different attributes. Using these attribute values, the
decision tree generates a class as the output for each input data.

ECE 8443: Lecture 18, Slide 113


Basic Principles
• The top, or first node, is called the root node.

• The last level of nodes are the leaf nodes and contain the final classification.

• The intermediate nodes are the descendant or “hidden” layers.

• Binary trees, like the one shown to the right, are the most popular type of tree. However, M-ary
trees (M branches at each node) are possible.

ECE 8443: Lecture 18, Slide 114


• Nodes can contain one more questions. In a binary tree, by convention if the answer to a
question is “yes”, the left branch is selected. Note that the same question can appear in multiple
places in the network.

• Decision trees have several benefits over neural network-type approaches, including
interpretability and data-driven learning.

• Key questions include how to grow the tree, how to stop growing, and how to prune the tree to
increase generalization.

• Decision trees are very powerful and can give excellent performance on closed-set testing.
Generalization is a challenge.

ECE 8443: Lecture 18, Slide 115


Nonlinear Decision Surfaces
• Decision trees can produce nonlinear decision surfaces:

• They are an attractive alternative to other classifiers we have studied because


they are data-driven and can give arbitrarily high levels of precision on the
training data.
• But… generalization becomes a challenge.

ECE 8443: Lecture 18, Slide 116


Classification and Regression Trees (CART)
• Consider a set D of labeled training data and a set of properties
(or questions), T.

• How do we organize the tree to produce the lowest classification error?

• Any decision tree will successively split the data into smaller and smaller
subsets. It would be ideal if all the samples associated with a leaf node were
from the small class. Such a subset, or node, is considered pure in this case.

• A generic tree-growing methodology, known as CART, successively splits


nodes until they are pure. Six key questions:

1) Should the questions be binary (e.g., is gender male or female) or numeric


(e.g., is height >= 5’4”) or multi-valued (e.g., race)?

2) Which properties should be tested at each node?

3) When should a node be declared a leaf?

4) If the tree becomes too large, how can it be pruned?

5) If the leaf node is impure, what category should be assigned to it?

6) How should missing data be handled?

ECE 8443: Lecture 18, Slide 117


Operation

ECE 8443: Lecture 18, Slide 118


Entropy-Based Splitting Criterion
• We prefer trees that are simple and compact. Why? (Hint: Occam’s Razor).

• Hence, we seek a property query, Ti, that splits the data at a node to increase
the purity at that node. Let i(N) denote the impurity of a node N.
• To split data at a node, we need to find the question that results in the greatest
entropy reduction (removes uncertainty in the data):

i ( N )   P ( j ) log(P ( j ))
j

Note this will peak when the two classes are equally likely (same size).
ECE 8443: Lecture 18, Slide 119
ECE 8443: Lecture 18, Slide 120
Alternate Splitting Criteria
• Variance impurity:

i ( N ) P (1 ) P (2 )

because this is related to the variance of a distribution associated with the two
classes.

• Gini Impurity:
1
i ( N )   P (i )P ( j )  [1   P 2 ( j )]
i j 2 j

The expected error rate at node N if the category label is selected randomly
from the class distribution present at node N.

• Misclassification impurity:
i ( N ) 1  max P ( j )
j

measures the minimum probability that a training pattern would be


misclassified at node N.

• In practice, simple entropy splitting (choosing the question that splits the data
into two classes of equal size) is very effective.

ECE 8443: Lecture 18, Slide 121


ECE 8443: Lecture 18, Slide 122
Choosing A Question
• An obvious heuristic is to choose the query that maximizes the decrease in
impurity:
i ( N ) i ( N )  PL ( N L )i ( N L )  (1  PL )i ( N R )
where NL and NR are the left and right descendant nodes, i(NL) and i(NR) are
their respective impurities, and PL is the fraction of patterns at node N that will
be assigned to NL when query Ti is chosen.
• This approach is considered part of a class of algorithms known as “greedy.”
• Note this decision is “local” and does not guarantee an overall optimal tree.
• A multiway split can be optimized using the gain ratio impurity:
i ( s )
i* ( s ) max B
B
  Pk log Pk
k1

where Pk is the fraction of training patterns sent to node Nk, and B is the
number of splits
N
i ( s ) i ( N )   Pk i ( N k )
k 1

ECE 8443: Lecture 18, Slide 123


When To Stop Splitting
• If we continue to grow the tree until each leaf node has the lowest impurity,
then the data will be overfit.
• Two strategies: (1) stop tree from growing or (2) grow and then prune the tree.
• A traditional approach to stopping splitting relies on cross-validation:
 Validation: train a tree on 90% of the data and test on 10% of the data
(referred to as the held-out set).
 Cross-validation: repeat for several independently chosen partitions.
 Stopping Criterion: Continue splitting until the error on the held-out data is
minimized.
• Reduction In Impurity: stop if the candidate split leads to a marginal reduction
of the impurity (drawback: leads to an unbalanced tree).

• Cost-Complexity: use a global criterion function that combines size and


impurity:  size   i ( N ) . This approach is related to minimum description
leaf nodes
length when the impurity is based on entropy.
• Other approaches based on statistical significance and hypothesis testing
attempt to assess the quality of the proposed split.
ECE 8443: Lecture 18, Slide 124
Pruning
• The most fundamental problem with decision trees is that they "overfit" the
data and hence do not provide good generalization. A solution to this problem
is to prune the tree:

• But pruning the tree will always increase the error rate on the training set .

• Cost-complexity Pruning:  size   i ( N ) . Each node in the tree can be


leaf nodes
classified in terms of its impact on the cost-complexity if it were pruned.
Nodes are successively pruned until certain heuristics are satisfied.
• By pruning the nodes that are far too specific to the training set, it is hoped
the tree will have better generalization. In practice, we use techniques such as
cross-validation and held-out training data to better calibrate the
generalization properties.
ECE 8443: Lecture 18, Slide 125
ID3 and C4.5
• Third Interactive Dichotomizer (ID3) uses nominal inputs and allows node-
specific number of branches, Bj. Growing continues until all nodes as pure.
• C4.5, the successor to ID3, is one of the most popular decision tree methods:
 Handles real-valued variables;
 Allows multiway splits for nominal data;
 Splitting based on maximization of the information gain ratio while
preserving better than average information gain;
 Stopping based on node purity;
 Pruning based on confidence/average node error rate (pessimistic pruning).
• Bayesian methods and other common modeling techniques have been
successfully applied to decision trees.

ECE 8443: Lecture 18, Slide 126


ECE 8443: Lecture 18, Slide 127
ECE 8443: Lecture 18, Slide 128

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy