Lecture 5 - Linear Regression
Lecture 5 - Linear Regression
Lecture 5:
Linear Regression
13 February 2023
1
Announcement
2
Midterm (Confirmed)
• Date & Time: 5 March, 16:00 – 18:00
• Please come at 15:30. You are allowed into hall at 15:45
• Venue: MPSH 2A & 2B
• Format: Digital Assessment (Examplify)
• Materials: all topics covered before recess week until Lecture 4
• Cheatsheet: 1 x A4 paper, both sides
• Calculators: Not allowed. Examplify has a built-in calculator
4
Recap
• Machine Learning
• What is ML? – machine that learns through data
• Types of Feedback: supervised, unsupervised, semi-supervised, reinforcement
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL): greedy, top-down, recursive algorithm
• Entropy and Information Gain
• Different types of attributes: many values, differing costs, missing values
• Pruning: min-sample, max-depth
• Ensemble Methods: bagging, boosting
5
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation
6
Outline
• Linear Regression
• Gradient Descent
• Gradient Descent Algorithm
• Linear Regression with Gradient Descent
• Variants of Gradient Descent
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes
• Dealing with Features of Different Scales
• Dealing with Non-Linear Relationship
• Normal Equation
7
Regression
Housing price prediction
($M)
500
400
300
200
100
(sqft)
500 1000 1500 2000 2500
1150?
8
1
Linear Regression ℎ𝑤 𝑥 = 100 + 𝑥
2
1
ℎ𝑤 𝑥 = 100 + 𝑥
Housing price prediction 5
($M) 1
ℎ𝑤 𝑥 = 0 + 𝑥
500 5
400
300 ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥
200 Find 𝒘 that “fits the data well”!
What does this mean?
100
(sqft)
500 1000 1500 2000 2500
0? 1150? 2600?
9
Linear Regression: Measuring Fit
For a set of 𝑚 examples
𝑥 1 , 𝑦 (1) , … , 𝑥 𝑚 , 𝑦 (𝑚) ($M)
500
we can compute the average (mean)
400
squared error as follows.
300
𝑚 200
1 2
𝐽𝑀𝑆𝐸 (𝑤) = ℎ𝑤 (𝑥 𝑖 ) − 𝑦 (𝑖) 100
2𝑚
𝑖=1
(sqft)
Loss function 𝑦ො (𝑖) 500 1000 1500 2000 2500
Mathematical
convenience Want to minimize the loss/error!
10
Linear Regression: Measuring Fit
How do we know how to
For a set of 𝑚 examples position/rotate our line?
Mathematical
convenience Want to minimize the loss/error!
11
Linear Regression Naïve Approach: Enumerate all possible lines
ℎ𝑤 𝑥 = 0𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
12 + 22 + 32
=
6
14
=
6
12
Linear Regression Naïve Approach: Enumerate all possible lines
ℎ𝑤 𝑥 = 0.5𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
2
0.5−1 + 1−2 2 + 1.5−3 2
=
2×3
0.52 + 1 + 1.5 2
= 6
3.5
= 6
13
Linear Regression Naïve Approach: Enumerate all possible lines
ℎ𝑤 𝑥 = 1𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
14
Linear Regression Naïve Approach: Enumerate all possible lines
ℎ𝑤 𝑥 = 1.5𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
15
Linear Regression Naïve Approach: Enumerate all possible lines
ℎ𝑤 𝑥 = 2𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
16
Linear Regression Naïve Approach: Enumerate all possible lines
ℎ𝑤 𝑥 = 1𝑥
Hypothesis: 𝑦 𝐽𝑀𝑆𝐸 (𝑤)
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 3 15
ℎ𝑤 𝑥 = 0 + 𝑤1 𝑥 2 6
10
Simplify: fix 𝑤0 = 0 for 1
easier visualization 6
𝑥
1 2 3 5
Loss Function: 6
𝑚
1 𝑖 (𝑖) 2
𝐽𝑀𝑆𝐸 (𝑤) = 𝑤1 𝑥 −𝑦 𝑤1
2𝑚 0.5 1 1.5 2
𝑖=1
17
Linear Regression Loss Landscape
21
Gradient Descent Remember Hill-climbing?
• Start at some 𝑤
• Pick a nearby 𝑤 that reduces 𝐽 𝑤
𝜕𝐽 𝑤0 , 𝑤1 , …
𝑤𝑗 ← 𝑤𝑗 − 𝛾
Learning Rate
𝜕𝑤𝑗
• Repeat until minimum is reached
22
Gradient Descent: 1 Parameter
𝐽𝑀𝑆𝐸 (𝑤)
• Start at some 𝑤0
• Pick a nearby 𝑤0 that reduces 𝐽 𝑤0 15
10
𝜕𝐽 𝑤0
𝑤0 ← 𝑤0 − 𝛾
Learning Rate
𝜕𝑤0 5
• Repeat until minimum is reached
𝑤0
0.5 1 1.5 2
As it gets closer to a minimum,
• The gradient becomes smaller
• The steps becomes smaller
23
Gradient Descent: 1 Parameter
𝐽𝑀𝑆𝐸 (𝑤)
• Start at some 𝑤0
• Pick a nearby 𝑤0 that reduces 𝐽 𝑤0 15
10
𝜕𝐽 𝑤0
𝑤0 ← 𝑤0 − 𝛾
Learning Rate
𝜕𝑤0 5
• Repeat until minimum is reached
𝑤0
0.5 1 1.5 2
𝜸 too large
24
Gradient Descent: 1 Parameter
𝐽𝑀𝑆𝐸 (𝑤)
• Start at some 𝑤0
• Pick a nearby 𝑤0 that reduces 𝐽 𝑤0 15
10
𝜕𝐽 𝑤0
𝑤0 ← 𝑤0 − 𝛾
Learning Rate
𝜕𝑤0 5
• Repeat until minimum is reached
𝑤0
0.5 1 1.5 2
𝜸 too small
25
Gradient Descent: 2 Parameters
• Start at some 𝑤 = (𝑤0 , 𝑤1 )
• Pick a nearby 𝑤 that reduces 𝐽 𝑤 𝐽
𝜕𝐽 𝑤0 , 𝑤1
𝑤𝑗 ← 𝑤𝑗 − 𝛾
Learning Rate
𝜕𝑤𝑗
• Repeat until minimum is reached
𝑤0
𝑤1
26
Image Credit: https://www.researchgate.net/figure/The-2D-Caussian-function-of-Example-2-Example-1-Consider-the-following-strongly-convex_fig2_230787652
Gradient Descent
Common Mistakes
𝑤0 changed!
• Start at some 𝑤
𝜕𝐽 𝑤0 , 𝑤1
𝑤0 = 𝑤0 − 𝛾
• Pick a nearby 𝑤 that reduces 𝐽 𝑤 𝜕𝑤0
𝜕𝐽 𝑤0 , 𝑤1
𝑤1 = 𝑤1 − 𝛾
𝜕𝑤1
𝜕𝐽 𝑤0 , 𝑤1 , …
𝑤𝑗 ← 𝑤𝑗 − 𝛾
Learning Rate
𝜕𝑤𝑗
𝜕𝐽 𝑤0 , 𝑤1
• Repeat until minimum is reached 𝑎=
𝜕𝑤0
𝜕𝐽 𝑤0 , 𝑤1
𝑏=
𝜕𝑤1
𝑤0 = 𝑤0 − 𝛾𝑎
𝑤1 = 𝑤1 − 𝛾𝑏
27
Image Credit: https://python.plainenglish.io/logistic-regression-in-machine-learning-from-scratch-872b1fedd05b
𝑤1
𝑤0
29
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
30
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
31
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
32
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
33
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
34
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
35
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
36
Credit: Andrew Ng
Linear Regression with Gradient Descent
ℎ𝑤 𝑥 𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1
𝑤1
𝑤0
37
Credit: Andrew Ng
Variants of Gradient Descent
𝑚
𝜕𝐽 𝑤0 , 𝑤1 , … 1 2
𝑤𝑗 ← 𝑤𝑗 − 𝛾 𝐽𝑀𝑆𝐸 (𝑤) = ℎ𝑤 (𝑥 𝑖 ) − 𝑦 (𝑖)
𝜕𝑤𝑗 2𝑚
𝑖=1
(Batch) Gradient Descent Mini-batch Gradient Descent Stochastic Gradient Descent (SGD)
• Consider all training examples • Consider a subset of training • Select one random data point at a time
examples at a time • Cheapest (Fastest) / iteration
• Cheaper (Faster) / iteration • More randomness, may escape local
• Randomness, may escape minima*
local minima* 38
Variants of Gradient Descent
Credit: analyticsvidhya.com 39
Escaping Local Minima / Plateaus on non-convex optimization
𝐽 𝐽
𝑤1 𝑤1
𝑤0 𝑤0
41
Linear Regression with Many Attributes
𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑦 Hypothesis:
Bias Year # bedrooms # bathrooms Size (m2) Price ($) ℎ𝒘 𝑥 = 𝑤0 𝑥0 + 𝑤1 𝑥1 +𝑤2 𝑥2 +𝑤3 𝑥3 +𝑤4 𝑥4
1 2016 4 2 113 560,000
1 1998 3 2 102 739,000
1 1997 3 0 100 430,000 Hypothesis (for 𝑛 features):
1 2014 3 2 84 698,000 𝑤0 𝑇 𝑥0
𝑛
1 2016 3 0 112 688,888
1 1979 2 2 68 390,000 ℎ𝒘 𝑥 = 𝑤𝑗 𝑥𝑗 = 𝑤
…1
𝑥1 = 𝒘𝑇 𝑥
…
1 1969 2 1 53 250,000
1 1986 3 2 122 788,000
𝑗=0 𝑤𝑛 𝑥𝑛
1 1985 3 3 150 680,000
1 2009 3 2 90 828,000 Weight Update (for 𝑛 features):
HDB prices from SRX 𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 , … , 𝑤𝑛
Notation: 𝑤𝑗 ← 𝑤𝑗 − 𝛾
𝜕𝑤𝑗
• 𝑛 = number of features 𝑚
1 𝑖
• 𝑥 (𝑖) = input features of the 𝑖-th training example 𝑤𝑗 ← 𝑤𝑗 − 𝛾 ℎ𝒘 (𝑥 (𝑖) ) − 𝑦 𝑖 ⋅ 𝑥𝑗
𝑚
(𝑖) 𝑖=1
• 𝑥𝑗 = value of feature 𝑗 in 𝑖-th training example
42
Dealing with Features of Different Scales
𝑥1 𝑥2 𝑦 Hypothesis: Weight Update:
1 𝑖
# bedrooms Size (m2) Price ($1K) ℎ𝒘 𝑥 = 𝑤0 𝑥0 + 𝑤1 𝑥1 +𝑤2 𝑥2 𝑤𝑗 ← 𝑤𝑗 − 𝛾 𝑚 σ𝑚
𝑖=1 ℎ𝒘 𝑥
𝑖
−𝑦 𝑖
⋅ 𝑥𝑗
4 113 560 = 0𝑥0 + 1𝑥1 +1𝑥2
3 102 739
Simplify: set 𝑤0 = 0 , 𝑤1 = 𝑤2 = 1 𝜕𝐽𝑀𝑆𝐸 𝑤0 , 𝑤1 , … , 𝑤𝑛
3 100 430
𝜕𝑤𝑗
3 84 698
3 112 688 ℎ𝒘 𝑥 = 4 + 113 = 117
2 68 390 𝑤1 ← 1 − 1 117 − 560 ⋅ 4
2 53 250 ← 1 − 1 × −1,772
3 122 788 Δ𝐽 ← 1,773
3 150 680 𝐽
3 90 828
HDB prices from SRX
𝑤2 ← 1 − 1 117 − 560 ⋅ 113
← 1 − 1 × −50,059
How to fix this? ← 50,060
Mean normalization: 𝑤2 Other solution:
𝑥𝑖 − 𝜇𝑖
𝑥𝑖 ← Δ𝑤 Learning rate for each weight 𝛾𝑖
𝜎𝑖
std dev
Other methods of standardization also exists: 𝑤1
Min-max scaling, robust scaling, etc 43
-gradient
Dealing with Features of Different Scales
𝑤1 𝑤1
𝑤2 𝑤2
𝑤2 𝑤2
𝑤1 𝑤1
44
Image Credit: Andrew Ng
Dealing with Non-Linear Relationship
Exam
Score
(𝑦)
Anxiety (𝑥)
Which function?
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥 +𝑤2 𝑥 2
Generally:
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑓1 +𝑤2 𝑓2 +𝑤3 𝑓3 + ⋯ + 𝑤𝑛 𝑓𝑛 Polynomial Regression
46
Normal Equation
1
(1)
𝑥1 (1)
𝑥𝑛 𝑤0 𝑦 (1) Goal: find 𝑤 that minimizes 𝐽𝑀𝑆𝐸
(2) (2)
𝑋= 1 𝑥1 … 𝑥𝑛 𝑤= 𝑤
(2)
…1 𝑌 = 𝑦… 𝜕𝐽𝑀𝑆𝐸 (𝑤)
1 ⋮ ⋮ 𝑤𝑛 Set =0
(𝑚) (𝑚) 𝑦 (𝑚) 𝜕𝑤 A bunch of math…
1 𝑥1 𝑥𝑛
ℎ𝑤 𝑋 = 𝑋𝑤 2𝑋 𝑇 𝑋𝑤 − 2𝑋 𝑇 𝑌 =0
Bias
2𝑋 𝑇 𝑋𝑤 = 2𝑋 𝑇 𝑌
𝐽 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑌
𝑤 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌
Assume
invertible
𝑤0 𝑤1
Zero gradient!
47
Gradient Descent vs Normal Equation
Gradient Descent Normal Equation
Need to choose 𝛾 Yes No
Iteration(s) Many None
Large number of features 𝑛? No problem Slow, 𝑋 𝑇 𝑋 −1
→ 𝑂(𝑛3 )
Feature scaling? May be necessary Not necessary
Constraints - 𝑋 𝑇 𝑋 needs to be invertible
48
Summary
• Linear Regression: fitting a line to data
• Gradient Descent
• Gradient Descent Algorithm: follow –gradient to reduce error
• Linear Regression with Gradient Descent: convex optimization, one minimum
• Variants of Gradient Descent: batch, mini-batch, stochastic
• Linear Regression: Challenges and Solutions
• Linear Regression with Many Attributes: ℎ𝑤 𝑥 = σ𝑛𝑗=0 𝑤𝑗 𝑥𝑗 = 𝑤 𝑇 𝑥
• Dealing with Features of Different Scales: normalize!
• Dealing with Non-Linear Relationship: transform features
• Normal Equation: analytically find the best parameters
49
Coming Up Next Week
• Logistic Regression
• Gradient Descent
• Multi-class classification
• Non-linear decision boundary
• (More) Performance Measure
• Receiver Operating Characteristic (ROC)
• Area under ROC (AUC)
• Model Evaluation & Selection
• Bias & Variance
50
To Do
• Lecture Training 5
• +100 Free EXP
• +50 Early bird bonus
• Problem Set 4
• Out today!
51