0% found this document useful (0 votes)
8 views51 pages

Lecture 5 - Linear Regression

The document outlines the details of Lecture 5 for CS2109S, focusing on Linear Regression and its applications in machine learning. It includes information about the upcoming midterm exam, covering topics up to Lecture 4, and provides a recap of key concepts in machine learning. The lecture also discusses various aspects of linear regression, including performance measures, gradient descent, and challenges in regression analysis.

Uploaded by

Runjia Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views51 pages

Lecture 5 - Linear Regression

The document outlines the details of Lecture 5 for CS2109S, focusing on Linear Regression and its applications in machine learning. It includes information about the upcoming midterm exam, covering topics up to Lecture 4, and provides a recap of key concepts in machine learning. The lecture also discusses various aspects of linear regression, including performance measures, gradient descent, and challenges in regression analysis.

Uploaded by

Runjia Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CS2109S: Introduction to AI and Machine Learning

Lecture 5:
Linear Regression
13 February 2023

1
Announcement

2
Midterm (Confirmed)
• Date & Time: 5 March, 16:00 – 18:00
• Please come at 15:30. You are allowed into hall at 15:45
• Venue: MPSH 2A & 2B
• Format: Digital Assessment (Examplify)
• Materials: all topics covered before recess week until Lecture 4
• Cheatsheet: 1 x A4 paper, both sides
• Calculators: Not allowed. Examplify has a built-in calculator

More details will be announced later.


3
Materials

4
Recap
• Machine Learning
• What is ML? – machine that learns through data
• Types of Feedback: supervised, unsupervised, semi-supervised, reinforcement
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL): greedy, top-down, recursive algorithm
• Entropy and Information Gain
• Different types of attributes: many values, differing costs, missing values
• Pruning: min-sample, max-depth
• Ensemble Methods: bagging, boosting

5
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation

6
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation

7
Regression
Housing price prediction
($M)
500

400

300

200

100

(sqft)
500 1000 1500 2000 2500

1150?
8
1
Linear Regression ℎ𝑤 𝑥 = 100 + 𝑥
2
1
ℎ𝑤 𝑥 = 100 + 𝑥
Housing price prediction 5
($M) 1
ℎ𝑤 𝑥 = 0 + 𝑥
500 5
400

300 ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥
200 Find 𝒘 that “fits the data well”!
What does this mean?
100

(sqft)
500 1000 1500 2000 2500

0? 1150? 2600?
9
Linear Regression: Measuring Fit
For a set of 𝑚 examples
𝑥 1 , 𝑦 (1) , … , 𝑥 𝑚 , 𝑦 (𝑚) ($M)
500
we can compute the average (mean)
400
squared error as follows.
300
𝑚 200
1 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ ℎ𝑤 (𝑥 𝑖 ) − 𝑦 (𝑖) 100
2𝑚
𝑖=1
(sqft)
Loss function 𝑦ො (𝑖) 500 1000 1500 2000 2500

Mathematical
convenience Want to minimize the loss/error!
10
Linear Regression: Measuring Fit
How do we know how to
For a set of 𝑚 examples position/rotate our line?

𝑥 1 , 𝑦 (1) , … , 𝑥 𝑚 , 𝑦 (𝑚) ($M)


500
we can compute the average (mean)
400
squared error as follows.
300
𝑚 200
1 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ ℎ𝑤 (𝑥 𝑖 ) − 𝑦 (𝑖) 100
2𝑚
𝑖=1
(sqft)
Loss function 𝑦ො (𝑖) 500 1000 1500 2000 2500

Mathematical
convenience Want to minimize the loss/error!
11
Linear Regression Naïve Approach: Enumerate all possible lines

ℎ𝑤 𝑥 = 0𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
12 + 22 + 32
=
6
14
=
6

12
Linear Regression Naïve Approach: Enumerate all possible lines

ℎ𝑤 𝑥 = 0.5𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
2
0.5−1 + 1−2 2 + 1.5−3 2
=
2×3
0.52 + 1 + 1.5 2
= 6
3.5
= 6
13
Linear Regression Naïve Approach: Enumerate all possible lines

ℎ𝑤 𝑥 = 1𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1

14
Linear Regression Naïve Approach: Enumerate all possible lines

ℎ𝑤 𝑥 = 1.5𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1

15
Linear Regression Naïve Approach: Enumerate all possible lines

ℎ𝑤 𝑥 = 2𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1

16
Linear Regression Naïve Approach: Enumerate all possible lines

ℎ𝑤 𝑥 = 1𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1

Select this line

17
Linear Regression Loss Landscape

Hypothesis: 𝑦 ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 (𝑤)


ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1

Want to get here!


Can we do better?
18
Linear Regression Better Approach

Hypothesis: 𝑦 ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 (𝑤)


ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
𝑤1 ← 𝑤1 + 𝑐 𝑤1 ← 𝑤1 − 𝑐
← 0.5 + 𝑐 ← 1.5 − 𝑐
How do we get
the appropriate c?
19
Linear Regression Better Approach

Hypothesis: 𝑦 ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 (𝑤)


ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = ෍ 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
𝑚 𝑤1 ← 𝑤1 + 𝑐 𝑤1 ← 𝑤1 − 𝑐
𝜕𝐽𝑀𝑆𝐸 𝑤 1 𝑖
← 0.5 + 𝑐 ← 1.5 − 𝑐
− = − ෍(𝑤1 𝑥 − 𝑦 𝑖 )𝑥 (𝑖)
𝜕𝑤 𝑚 𝜕𝐽𝑀𝑆𝐸 𝑤1
𝑖=1
= − 0.5 × 1 − 1 × 1 = +0.5 +𝑐 𝑤1 ← 𝑤1 −
𝜕𝑤1 20
= −(1.5 × 1 − 1) × 1 = −0.5 −𝑐
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation

21
Gradient Descent Remember Hill-climbing?

• Start at some 𝑤
• Pick a nearby 𝑤 that reduces 𝐽 𝑤

𝜕𝐽 𝑤0 , 𝑤1 , …
𝑤𝑗 ← 𝑤𝑗 − 𝛾
Learning Rate
𝜕𝑤𝑗
• Repeat until minimum is reached

22
Gradient Descent: 1 Parameter
𝐽𝑀𝑆𝐸 (𝑤)
• Start at some 𝑤0
• Pick a nearby 𝑤0 that reduces 𝐽 𝑤0 15

10
𝜕𝐽 𝑤0
𝑤0 ← 𝑤0 − 𝛾
Learning Rate
𝜕𝑤0 5
• Repeat until minimum is reached
𝑤0
0.5 1 1.5 2
As it gets closer to a minimum,
• The gradient becomes smaller
• The steps becomes smaller

23
Gradient Descent: 1 Parameter
𝐽𝑀𝑆𝐸 (𝑤)
• Start at some 𝑤0
• Pick a nearby 𝑤0 that reduces 𝐽 𝑤0 15

10
𝜕𝐽 𝑤0
𝑤0 ← 𝑤0 − 𝛾
Learning Rate
𝜕𝑤0 5
• Repeat until minimum is reached
𝑤0
0.5 1 1.5 2
𝜸 too large

24
Gradient Descent: 1 Parameter
𝐽𝑀𝑆𝐸 (𝑤)
• Start at some 𝑤0
• Pick a nearby 𝑤0 that reduces 𝐽 𝑤0 15

10
𝜕𝐽 𝑤0
𝑤0 ← 𝑤0 − 𝛾
Learning Rate
𝜕𝑤0 5
• Repeat until minimum is reached
𝑤0
0.5 1 1.5 2
𝜸 too small

25
Gradient Descent: 2 Parameters
• Start at some 𝑤 = (𝑤0 , 𝑤1 )
• Pick a nearby 𝑤 that reduces 𝐽 𝑤 𝐽

𝜕𝐽 𝑤0 , 𝑤1
𝑤𝑗 ← 𝑤𝑗 − 𝛾
Learning Rate
𝜕𝑤𝑗
• Repeat until minimum is reached
𝑤0
𝑤1

26
Image Credit: https://www.researchgate.net/figure/The-2D-Caussian-function-of-Example-2-Example-1-Consider-the-following-strongly-convex_fig2_230787652
Gradient Descent
Common Mistakes
𝑤0 changed!
• Start at some 𝑤
𝜕𝐽 𝑤0 , 𝑤1
𝑤0 = 𝑤0 − 𝛾
• Pick a nearby 𝑤 that reduces 𝐽 𝑤 𝜕𝑤0
𝜕𝐽 𝑤0 , 𝑤1
𝑤1 = 𝑤1 − 𝛾
𝜕𝑤1
𝜕𝐽 𝑤0 , 𝑤1 , …
𝑤𝑗 ← 𝑤𝑗 − 𝛾
Learning Rate
𝜕𝑤𝑗
𝜕𝐽 𝑤0 , 𝑤1
• Repeat until minimum is reached 𝑎=
𝜕𝑤0
𝜕𝐽 𝑤0 , 𝑤1
𝑏=
𝜕𝑤1
𝑤0 = 𝑤0 − 𝛾𝑎
𝑤1 = 𝑤1 − 𝛾𝑏

27
Image Credit: https://python.plainenglish.io/logistic-regression-in-machine-learning-from-scratch-872b1fedd05b

Linear Regression with Gradient Descent


Hypothesis: Loss Function:
𝑚
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 1
𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 = ෍( 𝑤0 +𝑤1 𝑥 (𝑖) − 𝑦 (𝑖) )2
2𝑚
𝑖=1
𝑚
𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 𝜕 1
= ෍( 𝑤0 +𝑤1 𝑥 (𝑖) − 𝑦 (𝑖) )2
𝜕𝑤𝑗 𝜕𝑤𝑗 2𝑚
𝑖=1
𝑚
𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 1
= ෍( 𝑤0 +𝑤1 𝑥 (𝑖) − 𝑦 (𝑖) )
𝜕𝑤0 𝑚
𝑖=1
𝑚
𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 1 𝑖 𝑖 𝑖
= ෍ 𝑤0 +𝑤1 𝑥 −𝑦 . 𝑥1
𝜕𝑤1 𝑚
𝑖=1
Theorem: MSE loss function is convex for linear regression.
• One minimum, global minimum Can we use mean absolute error (MAE) for our J?
28
MAE is not fully differentiable (its derivative is undefined at 0)
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
29
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
30
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
31
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
32
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
33
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
34
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
35
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
36
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1

𝑤1

𝑤0
37
Credit: Andrew Ng
Variants of Gradient Descent
𝑚
𝜕𝐽 𝑤0 , 𝑤1 , … 1 2
𝑤𝑗 ← 𝑤𝑗 − 𝛾 𝐽𝑀𝑆𝐸 (𝑤) = ෍ ℎ𝑤 (𝑥 𝑖 ) − 𝑦 (𝑖)
𝜕𝑤𝑗 2𝑚
𝑖=1

($M) ($M) ($M)

500 500 500

400 400 400

300 300 300

200 200 200

100 100 100

(sqft) (sqft) (sqft)


500 1000 1500 2000 2500 500 1000 1500 2000 2500 500 1000 1500 2000 2500

(Batch) Gradient Descent Mini-batch Gradient Descent Stochastic Gradient Descent (SGD)
• Consider all training examples • Consider a subset of training • Select one random data point at a time
examples at a time • Cheapest (Fastest) / iteration
• Cheaper (Faster) / iteration • More randomness, may escape local
• Randomness, may escape minima*
local minima* 38
Variants of Gradient Descent

Credit: analyticsvidhya.com 39
Escaping Local Minima / Plateaus on non-convex optimization

𝐽 𝐽

𝑤1 𝑤1
𝑤0 𝑤0

Batch Gradient Descent Stochastic/Mini-batch Gradient Descent


40
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation

41
Linear Regression with Many Attributes
𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑦 Hypothesis:
Bias Year # bedrooms # bathrooms Size (m2) Price ($) ℎ𝒘 𝑥 = 𝑤0 𝑥0 + 𝑤1 𝑥1 +𝑤2 𝑥2 +𝑤3 𝑥3 +𝑤4 𝑥4
1 2016 4 2 113 560,000
1 1998 3 2 102 739,000
1 1997 3 0 100 430,000 Hypothesis (for 𝑛 features):
1 2014 3 2 84 698,000 𝑤0 𝑇 𝑥0
𝑛
1 2016 3 0 112 688,888
1 1979 2 2 68 390,000 ℎ𝒘 𝑥 = ෍ 𝑤𝑗 𝑥𝑗 = 𝑤
…1
𝑥1 = 𝒘𝑇 𝑥

1 1969 2 1 53 250,000
1 1986 3 2 122 788,000
𝑗=0 𝑤𝑛 𝑥𝑛
1 1985 3 3 150 680,000
1 2009 3 2 90 828,000 Weight Update (for 𝑛 features):
HDB prices from SRX 𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 , … , 𝑤𝑛
Notation: 𝑤𝑗 ← 𝑤𝑗 − 𝛾
𝜕𝑤𝑗
• 𝑛 = number of features 𝑚
1 𝑖
• 𝑥 (𝑖) = input features of the 𝑖-th training example 𝑤𝑗 ← 𝑤𝑗 − 𝛾 ෍ ℎ𝒘 (𝑥 (𝑖) ) − 𝑦 𝑖 ⋅ 𝑥𝑗
𝑚
(𝑖) 𝑖=1
• 𝑥𝑗 = value of feature 𝑗 in 𝑖-th training example

42
Dealing with Features of Different Scales
𝑥1 𝑥2 𝑦 Hypothesis: Weight Update:
1 𝑖
# bedrooms Size (m2) Price ($1K) ℎ𝒘 𝑥 = 𝑤0 𝑥0 + 𝑤1 𝑥1 +𝑤2 𝑥2 𝑤𝑗 ← 𝑤𝑗 − 𝛾 𝑚 σ𝑚
𝑖=1 ℎ𝒘 𝑥
𝑖
−𝑦 𝑖
⋅ 𝑥𝑗
4 113 560 = 0𝑥0 + 1𝑥1 +1𝑥2
3 102 739
Simplify: set 𝑤0 = 0 , 𝑤1 = 𝑤2 = 1 𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 , … , 𝑤𝑛
3 100 430
𝜕𝑤𝑗
3 84 698
3 112 688 ℎ𝒘 𝑥 = 4 + 113 = 117
2 68 390 𝑤1 ← 1 − 1 117 − 560 ⋅ 4
2 53 250 ← 1 − 1 × −1,772
3 122 788 Δ𝐽 ← 1,773
3 150 680 𝐽
3 90 828
HDB prices from SRX
𝑤2 ← 1 − 1 117 − 560 ⋅ 113
← 1 − 1 × −50,059
How to fix this? ← 50,060
Mean normalization: 𝑤2 Other solution:
𝑥𝑖 − 𝜇𝑖
𝑥𝑖 ← Δ𝑤 Learning rate for each weight 𝛾𝑖
𝜎𝑖
std dev
Other methods of standardization also exists: 𝑤1
Min-max scaling, robust scaling, etc 43
-gradient
Dealing with Features of Different Scales

𝑤1 𝑤1
𝑤2 𝑤2
𝑤2 𝑤2

𝑤1 𝑤1

44
Image Credit: Andrew Ng
Dealing with Non-Linear Relationship

Exam
Score
(𝑦)

Anxiety (𝑥)
Which function?
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 +𝑤2 𝑥 2
Generally:
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑓1 +𝑤2 𝑓2 +𝑤3 𝑓3 + ⋯ + 𝑤𝑛 𝑓𝑛 Polynomial Regression

Need to scale this!


Transformed features:
𝑒. 𝑔. , 𝑓1 = 𝑥, 𝑓2 = 𝑥 2 45
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation

46
Normal Equation
1
(1)
𝑥1 (1)
𝑥𝑛 𝑤0 𝑦 (1) Goal: find 𝑤 that minimizes 𝐽𝑀𝑆𝐸
(2) (2)
𝑋= 1 𝑥1 … 𝑥𝑛 𝑤= 𝑤
(2)
…1 𝑌 = 𝑦… 𝜕𝐽𝑀𝑆𝐸 (𝑤)
1 ⋮ ⋮ 𝑤𝑛 Set =0
(𝑚) (𝑚) 𝑦 (𝑚) 𝜕𝑤 A bunch of math…
1 𝑥1 𝑥𝑛
ℎ𝑤 𝑋 = 𝑋𝑤 2𝑋 𝑇 𝑋𝑤 − 2𝑋 𝑇 𝑌 =0
Bias
2𝑋 𝑇 𝑋𝑤 = 2𝑋 𝑇 𝑌
𝐽 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑌
𝑤 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌
Assume
invertible

𝑤0 𝑤1

Zero gradient!
47
Gradient Descent vs Normal Equation
Gradient Descent Normal Equation
Need to choose 𝛾 Yes No
Iteration(s) Many None
Large number of features 𝑛? No problem Slow, 𝑋 𝑇 𝑋 −1
→ 𝑂(𝑛3 )
Feature scaling? May be necessary Not necessary
Constraints - 𝑋 𝑇 𝑋 needs to be invertible

48
Summary
• Linear Regression: fitting a line to data
• Gradient Descent
• Gradient Descent Algorithm: follow –gradient to reduce error
• Linear Regression with Gradient Descent: convex optimization, one minimum
• Variants of Gradient Descent: batch, mini-batch, stochastic
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes: ℎ𝑤 𝑥 = σ𝑛𝑗=0 𝑤𝑗 𝑥𝑗 = 𝑤 𝑇 𝑥
• Dealing with Features of Different Scales: normalize!
• Dealing with Non-Linear Relationship: transform features
• Normal Equation: analytically find the best parameters

49
Coming Up Next Week
• Logistic Regression
• Gradient Descent
• Multi-class classification
• Non-linear decision boundary
• (More) Performance Measure
• Receiver Operating Characteristic (ROC)
• Area under ROC (AUC)
• Model Evaluation & Selection
• Bias & Variance

50
To Do
• Lecture Training 5
• +100 Free EXP
• +50 Early bird bonus
• Problem Set 4
• Out today!

51

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy