0% found this document useful (0 votes)

10 views81 pages

Unit-Ii (Ml-I)

The document discusses key concepts in machine learning, focusing on linear regression, perceptrons, and neural networks. It explains the mechanics of training neural networks using gradient descent and its variants, including mini-batch and stochastic gradient descent, as well as the importance of learning rates. Additionally, it contrasts generative and discriminative classifiers, particularly in the context of logistic regression.

Uploaded by

Abhay Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views81 pages

Unit-Ii (Ml-I)

Uploaded by

Abhay Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 81

Machine Learning

Classifiers

UNIT-II
2
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set
[ ] of 𝑧1
𝑧2
= 𝜙( 𝑥 )

points
What if a line/plane
doesn’t model the input-
output relationship very
well, e.g., if their Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
relationship is better

(Output )
modeled by a nonlinear (Output )
curve or curved surface?
No. We can even fit
Do linear
models a curve using a
become linear model after
useless in such suitably
cases? transforming the
Input (single feature) (Feature 2) (Feature 1)
inputs
The transformation can be predefined or
learned (e.g., using kernel methods or a
deep neural network based feature
extractor). More on this later

 The line/plane must also predict outputs the unseen (test)

Simplest Possible Linear Regression Model
 This is the base model for all
statistical machine learning
 x is a one feature data
variable
 y is the value we are trying
y w0  w1 x  
to predict
 The regression model is

Two parameters to estimate –

the slope of the line w1 and
the y-intercept w0
 ε is the unexplained,
random, or error component.
Perceptron

Cornell Aeronautical Laboratory

Perceptron
– Invented by Frank Rosenblatt in 1957 in an
attempt to understand human memory, learning,
and cognitive processes.
– The first neural network model by computation,
with a remarkable learning algorithm:
• If function can be represented by perceptron, the
learning algorithm is guaranteed to quickly
converge to the hidden function!
– Became the foundation of pattern recognition
research
Rosenblatt &
Mark I Perceptron:
the first machine that could One of the earliest and most influential neural networks:
"learn" to recognize and An important milestone in AI.
identify optical patterns. Carla P. Gomes
CS4700
Single Layer Feed-forward Neural Networks
Perceptrons

Single-layer neural network (perceptron network)

A network with all the inputs connected directly to the outputs

– Output units all operate separately: no shared weights

Since each output unit is

independent of the others,
we can limit our study
to single output perceptrons.

Carla P. Gomes
CS4700
An Artificial Neuron
Node or Unit:
A Mathematical Abstraction
Artificial Neuron,
Node or unit ,
Processing Unit i

Input
Input edges, Output
function(ini): Output edges,
each with weights Activation
weighted sum n
each with weights
a  g (  W j ,i a j )
(positive, negative, and of its inputs, function (g) i
including applied to
j 0
(positive, negative, and
change over time,
fixed input a0. input function change over time,
learning)
n (typically learning)
ini  W j ,i a j non-linear).
j 0

 a processing element producing an output based on a function of its inputs

Note: the fixed input and bias weight are conventional; some authors instead, e.g., or a 0=1 and -W0i
Carla P. Gomes
CS4700
Activation Functions

(a) Threshold activation function  a step function or threshold function

(outputs 1 when the input is positive; 0 otherwise).
(b) Sigmoid (or logistics function) activation function (key advantage:
differentiable)
(c) Sign function, +1 if input is positive, otherwise -1.

These functions have a threshold (either hard or soft) at zero.

 Changing the bias weight W0,i moves the threshold location. Carla P. Gomes
CS4700
CS 502, Fall 2020

Training NNs

• Optimizing the loss function

– Almost all DL models these days are trained with a variant
of the gradient descent (GD) algorithm
– GD applies iterative refinement of the network parameters
– GD uses the opposite direction of the gradient of the loss
with respect to the NN parameters (i.e., ) for updating
 The gradient of the loss function gives the direction of fastest
increase of the loss function when the parameters are
changed
ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖

𝜃𝑖 50
CS 502, Fall 2020

Training NNs

• The loss functions for most DL tasks are defined over very high-dimensional
spaces
– E.g., ResNet50 NN has about 23 million parameters
– This makes the loss function impossible to visualize
• We can still gain intuitions by studying 1-dimensional and 2-dimensional
examples of loss functions

1D loss (the minimum point is obvious) 2D loss (blue = low loss, red = high loss)
Picture from: https://cs231n.github.io/optimization-1/ 51
CS 502, Fall 2020

Gradient Descent Algorithm

• Steps in the gradient descent algorithm:

1. Randomly initialize the model parameters
 In the figure, the parameters are denoted
2. Compute the gradient of the loss function at :
3. Update the parameters as:
 Where α is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)

52
CS 502, Fall 2020

Gradient Descent Algorithm

• Example: a NN with only 2 parameters and , i.e.,

– Different colors are the values of the loss (minimum loss is ≈ 1.3)

1. Randomly pick a
starting point

2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
(
𝛻ℒ 𝜃 =
0
)
[
𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 0)/ 𝜕 𝑤 2 ]
𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 53
CS 502, Fall 2020

Gradient Descent Algorithm

• Example (contd.)
Eventually, we would reach a minimum …..
1. Randomly pick a
starting point

2. Compute the
2
𝜃 gradient at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the learning
1 rate , and update
𝜃

4. Go to step 2, repeat
0
𝜃

𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 54

CS 502, Fall 2020

Gradient Descent Algorithm

• Gradient descent algorithm stops when a local minimum of the loss surface is reached
– GD does not guarantee reaching a global minimum
– However, empirical evidence suggests that GD works well for NNs

𝜃
Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 55
CS 502, Fall 2020

Gradient Descent Algorithm

• For most tasks, the loss surface is highly complex (and

non-convex)
• Random initialization in NNs
results in different initial ℒ
parameters
– Gradient descent may reach
different minima at every
run
– Therefore, NN will produce
different predicted outputs
• Currently, we don’t have an 𝑤1 𝑤2
algorithm that guarantees
reaching a global minimum
for an arbitrary loss function
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 502, Fall 2020

Backpropagation

• How to calculate the gradients of the loss function in NNs?

• There are two ways:
1. Numerical gradient: slow, approximate, but easy way
2. Analytic gradient: requires calculus, fast, but more error-prone way
• In practice the analytic gradient is used
– Analytical differentiation for gradient computation is available in almost all deep
learning libraries

57
CS 502, Fall 2020

Mini-batch Gradient Descent

• It is wasteful to compute the loss over the entire set to

perform a single parameter update for large datasets
– E.g., ImageNet has 14M images
– GD (a.k.a. vanilla GD) is replaced with mini-batch GD
• Mini-batch gradient descent
– Approach:
 Compute the loss on a batch of images, update the parameters ,
and repeat until all images are used
 At the next epoch, shuffle the training data, and repeat above
process
– Mini-batch GD results in much faster training
– Typical batch size: 32 to 256 images
– It works because the examples in the training data are
correlated
 I.e., the gradient from a mini-batch is a good approximation of
58
the gradient of the entire training set
CS 502, Fall 2020

Stochastic Gradient Descent

• Stochastic gradient descent

– SGD uses mini-batches that consist of a single input example
 E.g., one image mini-batch
– Although this method is very fast, it may cause significant fluctuations in the loss
function
 Therefore, it is less commonly used, and mini-batch GD is preferred
– In most DL libraries, SGD is typically a mini-batch SGD (with an option to add
momentum)

59
CS 502, Fall 2020

Problems with Gradient Descent

• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points
cost

Very slow at the

plateau
Stuck at a saddle point

Stuck at a local minimum

𝛻 ℒ (𝜃 )≈ 0
𝛻ℒ ( 𝜃 )=0 𝛻 ℒ ( 𝜃 )=0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial 60
CS 502, Fall 2020

Gradient Descent with Momentum

• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization
cost
Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 61
CS 502, Fall 2020

Gradient Descent with Momentum

• Parameters update in GD with momentum :

 Where: =
• Compare to vanilla GD:
• The term is called momentum
– This term accumulates the gradients from the past several
steps
– It is similar to a momentum of a heavy ball rolling down the
hill
• The parameter referred to as a coefficient of momentum
– A typical value of the parameter is 0.9
• This method updates the parameters in the direction of
the weighted average of the past gradients

62
CS 502, Fall 2020

Learning Rate

• Learning rate
– The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
– Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training

LR LR
too too
small large

64
CS 502, Fall 2020

Learning Rate

• Training loss for different learning rates

– High learning rate: the loss increases or plateaus too quickly
– Low learning rate: the loss decreases too slowly (takes many epochs to reach a
solution)

Picture from: https://cs231n.github.io/neural-networks-3/ 65

Background: Generative and
Logistic Discriminative Classifiers
Regressi
on
Logistic Regression
• Important analytic tool in natural and social sciences
• Baseline supervised machine learning tool for
classification
• Is also the foundation of neural networks
Generative and Discriminative
Classifiers
Naive Bayes is a generative classifier: Generative models aim to model
the joint probability distribution of the input features and the class labels,
i.e., they learn how the data is generated.

by contrast:

Logistic regression is a discriminative classifier: Discriminative models

focus on modeling the conditional probability of the class labels given the
input features. Instead of modeling how the data is generated, they directly
learn the decision boundary between different classes.
Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:

Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!

Let's ignore everything else
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers

Naive Bayes

Logistic Regression
posterior
P(c|d)
72
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input observation x(i), a

vector of features [x1, x2, ... , xn]. Feature j for input x(i) is xj, more
completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class, via p(y|
x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic gradient
descent.
The two phases of logistic
regression

Training: we learn weights w and b using stochastic

gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)

using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Classification in Logistic Regression
Logistic
Regressi
on
Classification Reminder

Positive/negative sentiment

Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class  C

Binary Classification in Logistic Regression

Given a series of input/output pairs:

◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i)  {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one
observation x

Input observation: vector x = [x1, x2,…, xn]

Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class  {0,1}

(multinomial logistic regression:  {0, 1, 2, 3, 4})

How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias

If this sum is high, we say y=1; if low, then y=0

But we want a probabilistic
classifier
We need to formalize “sum is high”.
We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability, it's just a
number!

Solution: use a function of z that goes from 0 to 1

The very useful sigmoid or logistic function

85
Idea of logistic regression

We’ll compute w∙x+b

And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
Turning a probability into a classifier

0.5 here is called the decision boundary

• If you increase the decision boundary (e.g., 0.7), you become
more conservative in predicting the positive class, potentially
reducing false positives at the cost of potentially missing some
true positives.

• If you decrease the decision boundary (e.g., 0.3), you become

more liberal in predicting the positive class, potentially increasing
false positives but catching more true positives.
The probabilistic classifier

P(y=1)

wx + b
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0
Classification in Logistic Regression
Logistic
Regressi
on
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
Sentiment example: does y=1
or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

94
95
Classifying sentiment for input x

Suppose w =
b = 0.1 96
Classifying sentiment for input x

97
Overview

• Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural
measures of distance (e.g., Euclidean).

• Consider a classification problem that involves nominal data – data described by a list of attributes
(e.g., categorizing people as short or tall using gender, height, age, and ethnicity).

• How can we use such nominal data for classification? How can we learn the categories of such
data? Nonmetric methods such as decision trees provide a way to deal with such data.

ECE 8443: Lecture 18, Slide 112

• Decision trees attempt to classify a pattern through a sequence of questions. For example,
attributes such as gender and height can be used to classify people as short or tall. But the best
threshold for height is gender dependent.

• A decision tree consists of nodes and leaves, with each leaf denoting a class.

• Classes (tall or short) are the outputs of the tree.

• Attributes (gender and height) are a set of features that describe the data.

• The input data consists of values of the different attributes. Using these attribute values, the
decision tree generates a class as the output for each input data.

ECE 8443: Lecture 18, Slide 113

Basic Principles
• The top, or first node, is called the root node.

• The last level of nodes are the leaf nodes and contain the final classification.

• The intermediate nodes are the descendant or “hidden” layers.

• Binary trees, like the one shown to the right, are the most popular type of tree. However, M-ary
trees (M branches at each node) are possible.

ECE 8443: Lecture 18, Slide 114

• Nodes can contain one more questions. In a binary tree, by convention if the answer to a
question is “yes”, the left branch is selected. Note that the same question can appear in multiple
places in the network.

• Decision trees have several benefits over neural network-type approaches, including
interpretability and data-driven learning.

• Key questions include how to grow the tree, how to stop growing, and how to prune the tree to
increase generalization.

• Decision trees are very powerful and can give excellent performance on closed-set testing.
Generalization is a challenge.

ECE 8443: Lecture 18, Slide 115

Nonlinear Decision Surfaces
• Decision trees can produce nonlinear decision surfaces:

• They are an attractive alternative to other classifiers we have studied because

they are data-driven and can give arbitrarily high levels of precision on the
training data.
• But… generalization becomes a challenge.

ECE 8443: Lecture 18, Slide 116

Classification and Regression Trees (CART)
• Consider a set D of labeled training data and a set of properties
(or questions), T.

• How do we organize the tree to produce the lowest classification error?

• Any decision tree will successively split the data into smaller and smaller
subsets. It would be ideal if all the samples associated with a leaf node were
from the small class. Such a subset, or node, is considered pure in this case.

• A generic tree-growing methodology, known as CART, successively splits

nodes until they are pure. Six key questions:

1) Should the questions be binary (e.g., is gender male or female) or numeric

(e.g., is height >= 5’4”) or multi-valued (e.g., race)?

2) Which properties should be tested at each node?

3) When should a node be declared a leaf?

4) If the tree becomes too large, how can it be pruned?

5) If the leaf node is impure, what category should be assigned to it?

6) How should missing data be handled?

ECE 8443: Lecture 18, Slide 117

Operation

ECE 8443: Lecture 18, Slide 118

Entropy-Based Splitting Criterion
• We prefer trees that are simple and compact. Why? (Hint: Occam’s Razor).

• Hence, we seek a property query, Ti, that splits the data at a node to increase
the purity at that node. Let i(N) denote the impurity of a node N.
• To split data at a node, we need to find the question that results in the greatest
entropy reduction (removes uncertainty in the data):

i ( N )   P ( j ) log(P ( j ))
j

Note this will peak when the two classes are equally likely (same size).
ECE 8443: Lecture 18, Slide 119
ECE 8443: Lecture 18, Slide 120
Alternate Splitting Criteria
• Variance impurity:

i ( N ) P (1 ) P (2 )

because this is related to the variance of a distribution associated with the two
classes.

• Gini Impurity:
1
i ( N )   P (i )P ( j )  [1   P 2 ( j )]
i j 2 j

The expected error rate at node N if the category label is selected randomly
from the class distribution present at node N.

• Misclassification impurity:
i ( N ) 1  max P ( j )
j

measures the minimum probability that a training pattern would be

misclassified at node N.

• In practice, simple entropy splitting (choosing the question that splits the data
into two classes of equal size) is very effective.

ECE 8443: Lecture 18, Slide 121

ECE 8443: Lecture 18, Slide 122
Choosing A Question
• An obvious heuristic is to choose the query that maximizes the decrease in
impurity:
i ( N ) i ( N )  PL ( N L )i ( N L )  (1  PL )i ( N R )
where NL and NR are the left and right descendant nodes, i(NL) and i(NR) are
their respective impurities, and PL is the fraction of patterns at node N that will
be assigned to NL when query Ti is chosen.
• This approach is considered part of a class of algorithms known as “greedy.”
• Note this decision is “local” and does not guarantee an overall optimal tree.
• A multiway split can be optimized using the gain ratio impurity:
i ( s )
i* ( s ) max B
B
  Pk log Pk
k1

where Pk is the fraction of training patterns sent to node Nk, and B is the
number of splits
N
i ( s ) i ( N )   Pk i ( N k )
k 1

ECE 8443: Lecture 18, Slide 123

When To Stop Splitting
• If we continue to grow the tree until each leaf node has the lowest impurity,
then the data will be overfit.
• Two strategies: (1) stop tree from growing or (2) grow and then prune the tree.
• A traditional approach to stopping splitting relies on cross-validation:
 Validation: train a tree on 90% of the data and test on 10% of the data
(referred to as the held-out set).
 Cross-validation: repeat for several independently chosen partitions.
 Stopping Criterion: Continue splitting until the error on the held-out data is
minimized.
• Reduction In Impurity: stop if the candidate split leads to a marginal reduction
of the impurity (drawback: leads to an unbalanced tree).

• Cost-Complexity: use a global criterion function that combines size and

impurity:  size   i ( N ) . This approach is related to minimum description
leaf nodes
length when the impurity is based on entropy.
• Other approaches based on statistical significance and hypothesis testing
attempt to assess the quality of the proposed split.
ECE 8443: Lecture 18, Slide 124
Pruning
• The most fundamental problem with decision trees is that they "overfit" the
data and hence do not provide good generalization. A solution to this problem
is to prune the tree:

• But pruning the tree will always increase the error rate on the training set .

• Cost-complexity Pruning:  size   i ( N ) . Each node in the tree can be

leaf nodes
classified in terms of its impact on the cost-complexity if it were pruned.
Nodes are successively pruned until certain heuristics are satisfied.
• By pruning the nodes that are far too specific to the training set, it is hoped
the tree will have better generalization. In practice, we use techniques such as
cross-validation and held-out training data to better calibrate the
generalization properties.
ECE 8443: Lecture 18, Slide 125
ID3 and C4.5
• Third Interactive Dichotomizer (ID3) uses nominal inputs and allows node-
specific number of branches, Bj. Growing continues until all nodes as pure.
• C4.5, the successor to ID3, is one of the most popular decision tree methods:
 Handles real-valued variables;
 Allows multiway splits for nominal data;
 Splitting based on maximization of the information gain ratio while
preserving better than average information gain;
 Stopping based on node purity;
 Pruning based on confidence/average node error rate (pessimistic pruning).
• Bayesian methods and other common modeling techniques have been
successfully applied to decision trees.

ECE 8443: Lecture 18, Slide 126

ECE 8443: Lecture 18, Slide 127
ECE 8443: Lecture 18, Slide 128

UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Topic 4 (Part 2) - NN Learning
No ratings yet
Topic 4 (Part 2) - NN Learning
92 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
CS4442 - CS9542 - Part 2 - Lecture 5 - DNN - Intro
No ratings yet
CS4442 - CS9542 - Part 2 - Lecture 5 - DNN - Intro
113 pages
Lesson 4 Training ANNs
No ratings yet
Lesson 4 Training ANNs
34 pages
Economics
No ratings yet
Economics
322 pages
Lec 8 Training NN
No ratings yet
Lec 8 Training NN
71 pages
Lec 8 Training NN
No ratings yet
Lec 8 Training NN
71 pages
Optimization
No ratings yet
Optimization
51 pages
Chapter 2. Training NN
No ratings yet
Chapter 2. Training NN
50 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Neural Network
No ratings yet
Neural Network
22 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
No ratings yet
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
48 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
Unit 5
No ratings yet
Unit 5
219 pages
AD601 Deep Learning Unit-2 Notes
No ratings yet
AD601 Deep Learning Unit-2 Notes
14 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
Slides NN
No ratings yet
Slides NN
59 pages
ML3 Unit 4-3
No ratings yet
ML3 Unit 4-3
13 pages
Unit 5
No ratings yet
Unit 5
32 pages
AI Unit II Lec Notes Deep Learning
No ratings yet
AI Unit II Lec Notes Deep Learning
64 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Unit 4 ML NN, DL, CNN-1
No ratings yet
Unit 4 ML NN, DL, CNN-1
84 pages
Ann TP
No ratings yet
Ann TP
40 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
ANN-unit 4 PDF
No ratings yet
ANN-unit 4 PDF
23 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Inbound 8392301798635648784
No ratings yet
Inbound 8392301798635648784
43 pages
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
No ratings yet
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
16 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Artificial Neural Networks: Biological Motivation
No ratings yet
Artificial Neural Networks: Biological Motivation
22 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
DLA Unit 3
No ratings yet
DLA Unit 3
26 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Module4 AI
No ratings yet
Module4 AI
12 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
EAPP A Sample Critique Paper
No ratings yet
EAPP A Sample Critique Paper
4 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
NNDL
No ratings yet
NNDL
96 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Slide 2
No ratings yet
Slide 2
35 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Blackout Dice Games Instructions
No ratings yet
Blackout Dice Games Instructions
5 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Earthing Design
No ratings yet
Earthing Design
12 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
TLE9 CSS Q3 M4 Maintain-Hand-Tools
No ratings yet
TLE9 CSS Q3 M4 Maintain-Hand-Tools
8 pages
Westin Aristotle's Rhetorical Energeia
No ratings yet
Westin Aristotle's Rhetorical Energeia
11 pages
Astm A-291
No ratings yet
Astm A-291
4 pages
Admitted Student (Edited by Anmol)
No ratings yet
Admitted Student (Edited by Anmol)
16 pages
Vibration Meter Circuit Using LED Driver IC LM3915 - Gadgetronicx
No ratings yet
Vibration Meter Circuit Using LED Driver IC LM3915 - Gadgetronicx
4 pages
The Importance of PCB Trace Widths in PCB Design
No ratings yet
The Importance of PCB Trace Widths in PCB Design
10 pages
Finland 2021
No ratings yet
Finland 2021
150 pages
Simply Supported Beam Example
No ratings yet
Simply Supported Beam Example
4 pages
JEE Main 2023 Important Topics
No ratings yet
JEE Main 2023 Important Topics
6 pages
Division Memorandum No. 0555, S. 2024 - Reiteration On The Implementation of Modular Distance Learning As Provided in DepEd Order No. 037, S. 2022.
No ratings yet
Division Memorandum No. 0555, S. 2024 - Reiteration On The Implementation of Modular Distance Learning As Provided in DepEd Order No. 037, S. 2022.
2 pages
Average Final
No ratings yet
Average Final
44 pages
All Android XML Tags Explained
No ratings yet
All Android XML Tags Explained
8 pages
ULG II Master Thesis Traussnig PDF
No ratings yet
ULG II Master Thesis Traussnig PDF
90 pages
Astrology Courses in Delhi, Astrology Institutes in Delhi, Astrology Classes in Delhi
No ratings yet
Astrology Courses in Delhi, Astrology Institutes in Delhi, Astrology Classes in Delhi
2 pages
The Preparing For The Threat - Questions
No ratings yet
The Preparing For The Threat - Questions
4 pages
Power System Operation and Control - EE8702 2017 Regulation - Question Paper 2020 Nov Dec
No ratings yet
Power System Operation and Control - EE8702 2017 Regulation - Question Paper 2020 Nov Dec
6 pages
Time and Space Complexity in Algorithms
No ratings yet
Time and Space Complexity in Algorithms
8 pages
Cofee Husk Ash
No ratings yet
Cofee Husk Ash
12 pages
Kotlin Sets Full Guide
No ratings yet
Kotlin Sets Full Guide
2 pages
Science 7 q3 Balanced and Unbalanced Forces Week 1
No ratings yet
Science 7 q3 Balanced and Unbalanced Forces Week 1
10 pages
Leweke Et Al 2016 Dynamics and Instabilities of Vortex Pairs
No ratings yet
Leweke Et Al 2016 Dynamics and Instabilities of Vortex Pairs
37 pages
Environmental Studies
No ratings yet
Environmental Studies
3 pages
Reliability Analysis
No ratings yet
Reliability Analysis
22 pages
Java GUI Learning Basic To Advanced
No ratings yet
Java GUI Learning Basic To Advanced
3 pages
Li 2008
No ratings yet
Li 2008
4 pages
Mechanical Engineer - Hassan Omail - Rev 1
No ratings yet
Mechanical Engineer - Hassan Omail - Rev 1
1 page
Determinants of Work-Readiness: Siti Nurlaela Kurjono Rasto
No ratings yet
Determinants of Work-Readiness: Siti Nurlaela Kurjono Rasto
7 pages
Scholarship For MSC Student
No ratings yet
Scholarship For MSC Student
3 pages
Ancient Monuments in Sulaimani Province
No ratings yet
Ancient Monuments in Sulaimani Province
9 pages
MODULE 6 - THE LAWS OF PHYSICS-WPS Office
No ratings yet
MODULE 6 - THE LAWS OF PHYSICS-WPS Office
5 pages
CVATSFriendly 1706242751 582645 344192
No ratings yet
CVATSFriendly 1706242751 582645 344192
1 page
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-Ii (Ml-I)

Uploaded by

Unit-Ii (Ml-I)

Uploaded by

Machine Learning

 The line/plane must also predict outputs the unseen (test)

Two parameters to estimate –

Cornell Aeronautical Laboratory

Single-layer neural network (perceptron network)

A network with all the inputs connected directly to the outputs

– Output units all operate separately: no shared weights

Since each output unit is

 a processing element producing an output based on a function of its inputs

(a) Threshold activation function  a step function or threshold function

These functions have a threshold (either hard or soft) at zero.

• Optimizing the loss function

Gradient Descent Algorithm

• Steps in the gradient descent algorithm:

Gradient Descent Algorithm

• Example: a NN with only 2 parameters and , i.e.,

Gradient Descent Algorithm

𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 54

Gradient Descent Algorithm

Gradient Descent Algorithm

• For most tasks, the loss surface is highly complex (and

• How to calculate the gradients of the loss function in NNs?

Mini-batch Gradient Descent

• It is wasteful to compute the loss over the entire set to

Stochastic Gradient Descent

• Stochastic gradient descent

Problems with Gradient Descent

Very slow at the

Stuck at a local minimum

Gradient Descent with Momentum

Gradient Descent with Momentum

• Parameters update in GD with momentum :

• Training loss for different learning rates

Picture from: https://cs231n.github.io/neural-networks-3/ 65

Logistic regression is a discriminative classifier: Discriminative models

Also build a model for dog images

Now given a new image:

Oh look, dogs have collars!

1. A feature representation of the input. For each input observation x(i), a

Training: we learn weights w and b using stochastic

Test: Given a test example x we compute p(y|x)

Output: a predicted class  C

Given a series of input/output pairs:

Input observation: vector x = [x1, x2,…, xn]

(multinomial logistic regression:  {0, 1, 2, 3, 4})

If this sum is high, we say y=1; if low, then y=0

Solution: use a function of z that goes from 0 to 1

We’ll compute w∙x+b

0.5 here is called the decision boundary

• If you decrease the decision boundary (e.g., 0.3), you become

ECE 8443: Lecture 18, Slide 112

• Classes (tall or short) are the outputs of the tree.

ECE 8443: Lecture 18, Slide 113

• The intermediate nodes are the descendant or “hidden” layers.

ECE 8443: Lecture 18, Slide 114

ECE 8443: Lecture 18, Slide 115

• They are an attractive alternative to other classifiers we have studied because

ECE 8443: Lecture 18, Slide 116

• How do we organize the tree to produce the lowest classification error?

• A generic tree-growing methodology, known as CART, successively splits

1) Should the questions be binary (e.g., is gender male or female) or numeric

2) Which properties should be tested at each node?

3) When should a node be declared a leaf?

4) If the tree becomes too large, how can it be pruned?

5) If the leaf node is impure, what category should be assigned to it?

6) How should missing data be handled?

ECE 8443: Lecture 18, Slide 117

ECE 8443: Lecture 18, Slide 118

measures the minimum probability that a training pattern would be

ECE 8443: Lecture 18, Slide 121

ECE 8443: Lecture 18, Slide 123

• Cost-Complexity: use a global criterion function that combines size and

• Cost-complexity Pruning:  size   i ( N ) . Each node in the tree can be

ECE 8443: Lecture 18, Slide 126

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.