0% found this document useful (0 votes)
13 views63 pages

Linear Regression

Uploaded by

akrab.tech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views63 pages

Linear Regression

Uploaded by

akrab.tech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

CSCI417

Machine Intelligence
Lecture # 5

Fall 2023

1
Tentative Course Topics

1.Machine Learning Basics


2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Linear/Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.

2
Agenda
Regression Problem:
Univariate Linear Regression

Optimization Technique:
Gradient Descent

Regression Problem:
Univariate Linear Regression Example

Regression Problem:
Multivariate Linear Regression
Optimization Technique:
Normal Equations

3
Regression Problem

Univariate Linear Regression


Linear Regression with One Variable

4
Linear Regression with One Variable
• Linear regression with one variable is also known as "univariate”
• We want to predict a single output value y from a single input value x.

Housing Prices (Portland, Oregon) Price


(in 1000s of
dollars)
500

400

300
230
y 200

100

0
Supervised Learning 0 500 1000 1500 2000 2500 3000

Given the “right answer” for each example in the data.


1250
x Size(feet2)
The Hypothesis Function
• Our hypothesis function has the general form:

– like the equation of a straight line


– we are trying to create a function called hθ that (hypothesis)
is trying to map our input data (the x's) to our
output data (the y's).
Parameters/ weights
Notation:
x’s = “input” variable / features
y’s = “output” variable / “target” variable

Andrew Ng 6
The Hypothesis Function

(hypothesis)

Andrew Ng 7
Example input x
0
output y
4
1 7
• Suppose we have the following set of training data: 2 7
• Now we can make a random guess about our hθ 3 8
ℎ𝜃 (𝑥) = 2 + 2 𝑥
For example: 0 , 1 9

• The hypothesis function becomes 𝜃 7

• So, for input of x=1 to our hypothesis 𝜃 , 6

4. This is off by 3.

y
4

Note that we will be trying out various values 0 , 1 3

– to try to find values which provide the 2

best possible "fit" or the most


1

0
0 0.5 1 1.5 2 2.5 3 3.5

representative "straight line" through the x

data points mapped on the x-y plane.

9
Cost Function
• We can measure the accuracy of our hypothesis function by using a cost function.
• This takes an average of all the results of the hypothesis with inputs from compared to
the actual output .

Squared error function


Notation
m = Number of training examples
(𝑥 ( ) , 𝑦 ( ) ) = 𝑖 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒
Objective function (Our goal) : 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 , 𝐽(𝜃 , 𝜃 )
Choose 𝜃 , 𝜃 so that ℎ (𝑥) is close to output y for our training examples (𝑥, 𝑦)

10
input 𝒙 output 𝒚
Simplified Example 1
2
1
2
3 3

Cost function J( )= () ()

(for fixed 𝜃 , this is a function of 𝑥)


𝑊ℎ𝑒𝑟𝑒 𝑚 (# 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠) = 3
3

y
1

0
0 1 2 3 ℎ =0
x

Andrew Ng 11
input x output y
Simplified Example 1
2
1
2
3 3

Cost function J( )= () ()

(for fixed 𝜃 , this is a function of x)


Where m (# of samples)= 3
3

y
1 ℎ = 0.5𝑥

0
0 1 2 3 ℎ =0
x

Andrew Ng 12
input x output y
Simplified hypothesis 1
2
1
2
3 3

Cost function J( )= () ()

(for fixed 𝜃 , this is a function of x)


Where m (# of samples)= 3
3 ℎ = 𝑥 J(0) = )
= 1+4+9) ≈ 2.3
2
y J(0.5)= )
ℎ = 0.5𝑥
+1+2.25) ≈ 0.58
1
=
0 ℎ =0 J(1)= )
0 1
x 2 3

= )= 0

Andrew Ng 13
Simplified Example 𝐽(0) ≈ 2.3
𝐽(0.5) ≈ 0.58
𝐽(1) = 0

(for fixed 𝜃 , this is a function of x) (function of the parameter. )

2
X
y
1 X
X
0 -0.5 0 0.5 1 1.5 2 2.5
0 1
x 2 3

Minimum cost = Best fit line (our objective)

Andrew Ng 14
Summary of cost function

15
Summary of cost function

16
(for fixed , this is a function of x) (function of the parameters )

ℎ 𝑥 = 800 − 0.15 𝑥

has same
X 𝐽(𝜃 , 𝜃 )

-0.15 X

800

“Contour plots” or “Contour figures”

17
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

ℎ 𝑥 = 500 − 0.5 𝑥

-0.5

500

18
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

ℎ 𝑥 = 100 + 0.1 𝑥
0.1

100
our objective

19
Andrew Ng
Quick Summary until now

Hypothesis:

Cost Function:

Goal:
Our strategy until now: Keep changing to reduce until
we hopefully end up at a minimum!!
20
Optimization Technique

Gradient Descent
Optimization
• Optimization is the process of finding the set of
parameters/weights that minimize the c o s t
f un ct i o n .

• Strategy 1: A first very bad idea solution: RANDOM SEARCH


◮ is to simply try out many different random parameters and
keep track of what works best. (iterative refinement.)
◮ start with random weights and iteratively refine them over time
to get lower cost

22
Optimization
• Strategy 2: Following the Gradient
◮ Compute the best direction along which we should change our
parameter (weight) vector that is mathematically guaranteed to
be the direction of the steepest descend.
◮ This direction will be related to the gradient of the cost
function.

23
Problem set up

24
Gradient Descent

Andrew Ng 25
Gradient Descent

• The way we do this is by taking the derivative


(the tangential line to a function) of our cost
function. (function of the parameter )
3

• The slope of the tangent is the derivative at


that point and it will give us a direction to
2
move towards. X
• We will know that we have succeeded when
our cost function is at the very bottom of the 1
X
pits in our graph, i.e. when its value is the
minimum. X
0
-0.5 0 0.5 1 1.5 2 2.5

Andrew Ng 27
Gradient Descent
• We make steps down the cost function in the direction with the steepest descent,
and the size of each step is determined by the parameter α, which is called the
learning rate.
• The gradient descent algorithm is:
(function of the parameter )
• Repeat until convergence { 3

2
X
}
1
X

Learning rate (step size)


0
-0.5 0 0.5 1 1.5 2 2.5
Positive slope (positive number) Ɵ will decrease
Negative slope (negative number) Ɵ will increase
28
GD Intuition

29
GD algorithm

30
Two ways to compute the gradient
• There are two ways to compute the gradient:
1) Numerical gradient: A slow, approximate but easy way to implement.
Approximate (since we have to pick a small value of h, while the true
gradient is defined as the limit as h goes to zero), and that it is very
computationally expensive to compute

2) Analytic gradient: A fast, exact but more error-prone way that requires
calculus. It allows us to derive a direct formula for the gradient (no
approximations) that is also very fast to compute.

Always use analytic gradient but check implementation with


numerical gradient. This is called a gradient check.
31
Gradient Descent for Linear Regression
• When specifically applied to the case of linear regression, a new form of the gradient
descent equation can be derived. We can substitute our actual cost function and our
actual hypothesis function and modify the equation to:
- Start by initializing the parameters randomly
- Repeat until convergence { // until error is small
- Predicted values with linear regression hypothesis.
- Calculate the cost function.
- If cost is large, update parameters using GD
𝜕
Should be done 𝜕
simultaneously
𝜕
𝜕
}

32
Gradient Descent variants
• There are three variants of gradient descent based on the amount of data used
to calculate the gradient:

1. Batch gradient descent


2. Stochastic gradient descent
3. Mini-batch gradient descent

33
Batch Gradient Descent
• Batch Gradient Descent, Vanilla gradient descent, calculates the error for
each observation in the dataset but performs an update only after all
observations have been evaluated.

• One cycle through the entire training dataset is called a training epoch.
Therefore, it is often said that batch gradient descent performs model
updates at the end of each training epoch.

• Batch gradient descent is not often used, because it represents a huge


consumption of computational resources, as the entire dataset needs to
remain in memory.

34
Batch Gradient Descent

35
Stochastic Gradient Descent (SGD)
• Stochastic gradient descent, often abbreviated SGD, is a variation of the
gradient descent algorithm that calculates the error and updates the model
for each example in the training dataset.

• The noisy update process can allow the model to avoid local minima (e.g.
premature convergence).

• SGD is usually faster than batch gradient descent, but its frequent updates
cause a higher variance in the error rate, that can sometimes jump around
instead of decreasing.

36
Mini-Batch Gradient Descent
• Mini-batch gradient descent seeks to find a balance between the robustness of
stochastic gradient descent and the efficiency of batch gradient descent.

• It is the most common implementation of gradient descent used in the field of deep
learning.

• It splits the training dataset into small batches that are used to calculate model error
and update model coefficients.

37
(for fixed , this is a function of x) (function of the parameters. )

39
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

40
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

41
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

42
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

43
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

44
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

45
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

46
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

1250

47
Andrew Ng
Learning Rate
• The gradient tells us the direction, but it does not tell us how far along this
direction we should step.

• The learning rate (step size) determines how big the step would be on
each iteration. It determines how fast or slow we will move towards the
optimal weights.

48
Learning Rate
• If learning rate is large, it may fail to converge and overshoot the
minimum.
• If learning rate is very small, it would take long time to converge and
become computationally expensive.
• The most commonly used rates are :
0.001, 0.003, 0.01 (default),
0.03, 0.1, 0.3

49
Regression Problem

Multivariate Linear Regression


Linear Regression with multiple Variable

50
Multiple variables (Features)
• Linear regression with multiple variables is also known as "multivariate linear
regression".

Could the prediction be more


𝑺𝒊𝒛𝒆 (𝒇𝒆𝒆𝒕)𝟐 𝑷𝒓𝒊𝒄𝒆 $𝟏𝟎𝟎𝟎 accurate if we add #of rooms?
𝒙 𝒚
2104 460
1416 232
1534 315
852 178

51
Multiple variables (Features)

𝑺𝒊𝒛𝒆 (𝒇𝒆𝒆𝒕)𝟐 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑨𝒈𝒆 𝒐𝒇 𝒉𝒐𝒎𝒆 𝑷𝒓𝒊𝒄𝒆 $𝟏𝟎𝟎𝟎


𝒃𝒆𝒅𝒓𝒐𝒐𝒎𝒔 𝒇𝒍𝒐𝒐𝒓𝒔 (𝒚𝒆𝒂𝒓𝒔)
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒚
2104 5 1 45 460
1416 3 2 40 232
𝒎, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔
1534 3 2 30 315
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠
852 2 1 36 178
… … … … …
1416
𝐧, number of features 3
𝑥( )
=
2
𝑥 ( ) , the input (features) of the 𝑖 training example
()
40
𝑥 , value of feature j in the 𝑖 training example
( )
𝑥 = 30

52
Hypothesis

𝑥 𝜃
𝑥 𝜃
𝐹𝑜𝑟 𝑐𝑜𝑛𝑣𝑒𝑛𝑖𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛, 𝑑𝑒𝑓𝑖𝑛𝑒 𝑥 = 1
𝑥 𝜃
𝑥= 𝑥 ∈ℝ 𝜃= 𝜃 ∈ℝ
. .
ℎ 𝑥 = 𝜃 𝑥 +𝜃 𝑥 + 𝜃 𝑥 +⋯+𝜃 𝑥 . .
𝑥 𝜃
53
Hypothesis for Multiple Features

54
Hypothesis for Multiple Features 𝑥 𝜃
𝑥 𝜃
𝑥 𝜃
𝑥= 𝑥 ∈ℝ 𝜃= 𝜃 ∈ℝ
. .
. .
𝑥 𝜃

55
GD for Multiple Variables

57
Gradient Descent in Practice I - Feature Scaling

• We can speed up gradient descent by having each of our input values in roughly the same
range.
• Because will:
– descend quickly on small ranges
– descend slowly on large ranges, and
– oscillate inefficiently down to the optimum when the variables are very uneven.

• The way to prevent this is to modify


the ranges of our input variables so
that they are all roughly the same.

58
Gradient Descent in Practice I - Feature Scaling

Make sure features are on same scale using:


– Mean Normalization
– Z-Score Normalization

Un-scaled features Scaled features

https://www.blog.nipunarora.net/ml_multi_variate_linear_regression/ 59
Features Selection
• We can improve our features and the form of our hypothesis function.
• We can combine multiple features into one. For example, we can
combine and into a new feature by taking .
– Ex: is length and is width of a house, we can combine them into a new
feature area= length x width

63
Optimization Technique:

Normal Equations
Better values for
• To solve for analytically normal equation

Minimize
𝐽 𝜃 = …=0 (set to zero)
for 𝑒𝑣𝑒𝑟𝑦 𝑗 solve for 𝜃 , 𝜃 , … , 𝜃

67
Example: m=4
𝑺𝒊𝒛𝒆 (𝒇𝒆𝒆𝒕)𝟐 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑨𝒈𝒆 𝒐𝒇 𝒉𝒐𝒎𝒆 𝑷𝒓𝒊𝒄𝒆 $𝟏𝟎𝟎𝟎
𝒃𝒆𝒅𝒓𝒐𝒐𝒎𝒔 𝒇𝒍𝒐𝒐𝒓𝒔 (𝒚𝒆𝒂𝒓𝒔)
𝒙𝟎 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒚
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

Features matrix 𝑚 × (𝑛 + 1) 𝑚 − dim 𝑣𝑒𝑐𝑡𝑜𝑟

68
examples ; n features.
𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 , 𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
( )
( )
()
𝑥
()
𝑥
() m x (n+1)
= 𝑥. ∈ ℝ
𝑖
• 𝑥
. ( )
.
()
𝑥
𝑦( )
𝐸𝑥𝑎𝑚𝑝𝑙𝑒: 𝑜𝑛𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒
1 𝑥 𝑦( )

𝑥( ) =
1
() 𝑋= 1 𝑥 𝑦= .
𝑥 ⋮ ⋮ .
1 𝑥 .
𝑦( )

69
Gradient Descent Vs. Normal Equation
Gradient Descent Normal Equation
Need to choose No need to choose
Needs many iterations No need to iterate

Works well when is large Slow if is very large


What if (XT X) is non-invertible (singular/degenerate)?
reasons: Redundant features (size in both feet and meter)/ too many features m<=n

Gradient descent
Works well even when n is massive (millions)
Better suited to big data
What is a big n though
100 or even a 1000 is still (relativity) small
If n is 10 000 then look at using gradient descent
Normal equation
Normal equation needs to compute (XT X)-1
This is the inverse of an n x n matrix
With most implementations computing a matrix inverse grows by O(n3 )
Can be much slower

70
Example 2:

Age (x1) Height in cm (x2) Weight in Kg (y)


4 89 16
9 124 28
5 103 20
• What is X (design matrix) and y?

X= y=

71
Check?
• Suppose you have training examples with features. The normal equation is
• For the given values of m and n, what are the dimensions of , X, and in this equation?

X has m x n + 1 = 25 x 7
y is an m-vector = 25 x1.
θ is an (n+1)-vector= 7 x1

72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy