Unit-Ii (Ml-I)
Unit-Ii (Ml-I)
Classifiers
UNIT-II
2
Linear Regression: Pictorially
Linear regression is like fitting a line or (hyper)plane to a set
[ ] of 𝑧1
𝑧2
= 𝜙( 𝑥 )
points
What if a line/plane
doesn’t model the input-
output relationship very
well, e.g., if their Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
relationship is better
(Output )
modeled by a nonlinear (Output )
curve or curved surface?
No. We can even fit
Do linear
models a curve using a
become linear model after
useless in such suitably
cases? transforming the
Input (single feature) (Feature 2) (Feature 1)
inputs
The transformation can be predefined or
learned (e.g., using kernel methods or a
deep neural network based feature
extractor). More on this later
Carla P. Gomes
CS4700
An Artificial Neuron
Node or Unit:
A Mathematical Abstraction
Artificial Neuron,
Node or unit ,
Processing Unit i
Input
Input edges, Output
function(ini): Output edges,
each with weights Activation
weighted sum n
each with weights
a g ( W j ,i a j )
(positive, negative, and of its inputs, function (g) i
including applied to
j 0
(positive, negative, and
change over time,
fixed input a0. input function change over time,
learning)
n (typically learning)
ini W j ,i a j non-linear).
j 0
Training NNs
𝜃𝑖 50
CS 502, Fall 2020
Training NNs
• The loss functions for most DL tasks are defined over very high-dimensional
spaces
– E.g., ResNet50 NN has about 23 million parameters
– This makes the loss function impossible to visualize
• We can still gain intuitions by studying 1-dimensional and 2-dimensional
examples of loss functions
1D loss (the minimum point is obvious) 2D loss (blue = low loss, red = high loss)
Picture from: https://cs231n.github.io/optimization-1/ 51
CS 502, Fall 2020
52
CS 502, Fall 2020
1. Randomly pick a
starting point
2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
(
𝛻ℒ 𝜃 =
0
)
[
𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 0)/ 𝜕 𝑤 2 ]
𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 53
CS 502, Fall 2020
• Example (contd.)
Eventually, we would reach a minimum …..
1. Randomly pick a
starting point
2. Compute the
2
𝜃 gradient at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the learning
1 rate , and update
𝜃
4. Go to step 2, repeat
0
𝜃
• Gradient descent algorithm stops when a local minimum of the loss surface is reached
– GD does not guarantee reaching a global minimum
– However, empirical evidence suggests that GD works well for NNs
𝜃
Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 55
CS 502, Fall 2020
Backpropagation
57
CS 502, Fall 2020
59
CS 502, Fall 2020
• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points
cost
𝛻 ℒ (𝜃 )≈ 0
𝛻ℒ ( 𝜃 )=0 𝛻 ℒ ( 𝜃 )=0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial 60
CS 502, Fall 2020
• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization
cost
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 61
CS 502, Fall 2020
62
CS 502, Fall 2020
Learning Rate
• Learning rate
– The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
– Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training
LR LR
too too
small large
64
CS 502, Fall 2020
Learning Rate
by contrast:
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Naive Bayes
Logistic Regression
posterior
P(c|d)
72
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
85
Idea of logistic regression
P(y=1)
wx + b
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
Classification in Logistic Regression
Logistic
Regressi
on
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
Sentiment example: does y=1
or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
94
95
Classifying sentiment for input x
Suppose w =
b = 0.1 96
Classifying sentiment for input x
97
Overview
• Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural
measures of distance (e.g., Euclidean).
• Consider a classification problem that involves nominal data – data described by a list of attributes
(e.g., categorizing people as short or tall using gender, height, age, and ethnicity).
• How can we use such nominal data for classification? How can we learn the categories of such
data? Nonmetric methods such as decision trees provide a way to deal with such data.
• A decision tree consists of nodes and leaves, with each leaf denoting a class.
• Attributes (gender and height) are a set of features that describe the data.
• The input data consists of values of the different attributes. Using these attribute values, the
decision tree generates a class as the output for each input data.
• The last level of nodes are the leaf nodes and contain the final classification.
• Binary trees, like the one shown to the right, are the most popular type of tree. However, M-ary
trees (M branches at each node) are possible.
• Decision trees have several benefits over neural network-type approaches, including
interpretability and data-driven learning.
• Key questions include how to grow the tree, how to stop growing, and how to prune the tree to
increase generalization.
• Decision trees are very powerful and can give excellent performance on closed-set testing.
Generalization is a challenge.
• Any decision tree will successively split the data into smaller and smaller
subsets. It would be ideal if all the samples associated with a leaf node were
from the small class. Such a subset, or node, is considered pure in this case.
• Hence, we seek a property query, Ti, that splits the data at a node to increase
the purity at that node. Let i(N) denote the impurity of a node N.
• To split data at a node, we need to find the question that results in the greatest
entropy reduction (removes uncertainty in the data):
i ( N ) P ( j ) log(P ( j ))
j
Note this will peak when the two classes are equally likely (same size).
ECE 8443: Lecture 18, Slide 119
ECE 8443: Lecture 18, Slide 120
Alternate Splitting Criteria
• Variance impurity:
i ( N ) P (1 ) P (2 )
because this is related to the variance of a distribution associated with the two
classes.
• Gini Impurity:
1
i ( N ) P (i )P ( j ) [1 P 2 ( j )]
i j 2 j
The expected error rate at node N if the category label is selected randomly
from the class distribution present at node N.
• Misclassification impurity:
i ( N ) 1 max P ( j )
j
• In practice, simple entropy splitting (choosing the question that splits the data
into two classes of equal size) is very effective.
where Pk is the fraction of training patterns sent to node Nk, and B is the
number of splits
N
i ( s ) i ( N ) Pk i ( N k )
k 1
• But pruning the tree will always increase the error rate on the training set .