0% found this document useful (0 votes)
16 views76 pages

228371_Lecture_Notes_Week_2

Uploaded by

Jay VN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views76 pages

228371_Lecture_Notes_Week_2

Uploaded by

Jay VN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

228371

Statistical Modelling for Engineers & Technologists


LECTURE SET 2: Regression
We are
here
Lecture 1 (Monday 11am) Scatter plots, Regression equations
Lecture 2 (Weds 8am) Errors & Residual Analysis
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)


228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Regression → Fitting Equations to Data

First → Look at the data → Scatterplot

• Consider quantitative data in pairs (x, y)


• x and y need not be in the same units, but are related in some way
• We want to see if there is a relationship between x and y

1. Plot the data using a scatter plot: plot(x, y)

2. Interrogate the data – look for trends (linear?), gaps, outliers, etc.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
1. Plot the data using a scatter plot: plot(x, y)
data<- read.table("testmarks.txt", header = TRUE)
plot(data$maths, data$English, xlim = c(0,100), ylim =c(0,100))

Note:
View(data)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$english~data$maths), col="green", lw = 3)

Trend? POSITIVE, LINEAR-ish?


Gaps? Maybe in maths 60 to 70 ish?
Outliers? Maybe one?
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$english~data$maths), col="green", lw = 3)

• For linear regression, the variability in y is assumed constant,


as x changes (i.e., VERTICAL scatter about trend)

• Make sure you have your PREDICTOR and RESPONSE variables


the right way around

• Can also use boxplots to examine marginal distribution about


each variable R → boxplot(data$maths, data$english)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$maths~data$english), col=“red", lw = 3, lt = 2)

Green (solid) line: PREDICTOR: Maths


RESPONSE: English

Red (dash) line: PREDICTOR: English


RESPONSE: Maths
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Displaying groups, Example: pulse.txt


1. Plot the data using a scatter plot: plot(x, y)
pulse <- read.table(“pulse.txt", header = TRUE)
plot(pulse$pulse1, pulse$pulse2, pch = pulse$ran, col = pulse$ran)
legend(“topleft”, c(“Ran”, “Did not Run”), pch = unique(pulse$ran), col = unique(pulse$ran))

Note:
attach(pulse)
….
detach(pulse)

Not sure?
search()
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Lowess Smoother – to help identify a trend


horseshearts <- read.table(“horseshearts.txt", header = TRUE)
attach(horsehearts)
plot(weight ~ extsys)
lines(lowess(weight ~ extsys))
detach(horsehearts)

Note:
plot(weight, extsys) VS plot(weight ~ extsys)
x-axis, y-axis
dependent ~ independent
response ~ predictor
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Scatterplot Matrix
pines <- read.table(“pines.txt", header = TRUE)
pairs(pines[2:5], col = pines$area, pch = pines$area)

pairs() shows scatterplots of each


pair of variables

plot(pines$third ~ pines$second)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Correlation Coefficient, rxy

rxy measures the strength of a linear relationship between two variables X and Y

-1 ≤ rxy ≤ 1

rxy = 0 There is NO linear relationship between X and Y

|rxy|=1 There is a perfect linear relationship between X and Y

+ve rxy Indicates positive relationship (positive slope) X↑ Y↑

-ve rxy Indicates negative relationship (negative slope) X↑ Y↓

σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟𝑥𝑦 =
σ 𝑥𝑖 − 𝑥ҧ 2σ 𝑦𝑖 − 𝑦ത 2
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Correlation Coefficient, rxy Example: pines.txt


pines <- read.table(“pines.txt", header = TRUE)
cor(pines[2:5])

Diagonal == 1
Symmetric about the diagonal
Correlation NOT causation

cor.test(pines$top, pines$first)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Our Aim

An equation (mathematical model) describing the relationship between a response


variable, and one or more explanatory variables.

The model can be a straight line (Linear model), or a curved function (e.g.,
Polynomial).

We need to fit whichever model we choose by estimating the model parameters.


228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Terminology

y response variable (dependent)


x1, x2, x3, …, xp explanatory / predictors / covariates / regressor variables (independent)

Regression models statistical determination of response ~ predictor variables

- We will only consider Normal/Gaussian distributions 𝑦~𝑁 𝜇, 𝜎 2

- 𝑦~𝑁 𝛽0 + 𝛽1 𝑥, 𝜎 2 or (equivalently) 𝑦~𝛽0 + 𝛽1 𝑥 + 𝜖, where 𝜖 ~ 𝑁(0, 𝜎 2 )


228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Mechanistic Models (Deterministic Relationships)

Ohm’s Law: V = IR Voltage = Current * Resistance

If we don’t know Resistance, R, but we do know V and I, we can directly calculate R → R = V/I

If we didn’t know about Ohm’s law, we might want to fit a model instead:

𝑦~𝛽 𝑥 + 𝜖 → V~𝑅 𝐼 + 𝜖 𝜖 ~ 𝑁(0, 𝜎 2 )

Usually try simplest model first (linear), and then make them more complicated if needed.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Empirical Models

- If there is no pre-conceived notion of the form of the relationship → find an EMPIRICAL MODEL
- Do you know anything about the underlying physical mechanism?
- Predict the response as accurately as possible (e.g., instrument calibration curves)
- Start SIMPLE (linear)

Errors

- Reality is messy
- Response and explanatory variables rarely satisfy a mathematical equation exactly
- Experimental situation
- Measurement errors
- Additional unrecorded factors
- Observational data
- Inherent variability
- Even more unrecorded factors
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Linear Models
[Linear models don’t have to be “straight lines”]
In a linear model, the mean response is linear in the parameters, e.g.,

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖 +
𝑥1 𝑥2

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥22 + 𝜖 +
𝑥1 𝑥22

𝑥2
𝑦 = 𝛽1 sin 𝑥1 + 𝛽2 log Τ𝑥3 + 𝛽3 𝑥4 𝑥5 + 𝜖 + +

log(𝑦) = 𝛽0 + 𝛽1 𝑥 + 𝜖 sin 𝑥1 log


𝑥2
ൗ𝑥3 𝑥4 𝑥5
𝑥
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Non-Linear Models

In a non-linear model, you can combine your terms however you want, e.g.,

𝛽0 +𝛽1 𝑥1
𝑦 = 𝛽0 𝑒 −𝑖𝛽1𝑥 +𝜖 or 𝑦= +𝜖
𝛽2 𝑥2 +𝛽3 𝑥32

Parameter estimation is much easier for LINEAR models


→ However, mechanistic models are often non-linear
→ So, we try to linearize equations as much as possible by taking logs, etc.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Regression Analysis

Regression is the tendency of the response variable (y) to vary with one of more explanatory variables (x).

The regression equation describes this relationship mathematically.

Simple regression: one explanatory (or predictor) variable (x) Note:


𝛽0 , 𝛽1 : unknown model
𝑦𝑖 = 𝜇𝑦|𝑥 + 𝜖𝑖 parameters
if linear: 𝜇𝑦|𝑥 = 𝛽0 + 𝛽1 𝑥𝑖 b0, b1: statistics calculated
& the fitted model is also a straight line: 𝑦ො𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖 from sample data

Multiple regression: more than one explanatory (or predictor) variable (x1,…, xp)

The word Regression was first used by Francis Galton (late 1800’s) to describe tendency of tall fathers to
have not-so-tall sons (“regression towards the mean”).
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Further assumptions for inference

- 𝑦𝑖 is normally distributed
- var[𝑦𝑖 ] is constant (= σ2), i.e., does not change with x
- ϵi ~ Normal(0, σ2) independent and identically distributed – iid

Combine this with linearity of regression means we can build models that look like:

𝑦𝑖 ~𝑁 𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2
Prediction errors: ∈𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖

BUT because 𝛽0 , 𝛽1 are unknown, so are the errors. So, we ESTIMATE the error as well using bo, b1:
𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
Prediction errors estimated with residuals: ∈𝑖 = 𝑦𝑖 - 𝑦ො𝑖
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Graphical depiction of regression assumptions

𝑦𝑖 ~𝑁 𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression
We are
Lecture 1 (Monday 11am) Scatter plots, Regression equations here
Lecture 2 (Weds 8am) Errors and Residual Analysis
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

How to fit the line – Least-Squares Concept

Least squares regression line is given by those


values of bo and b1 that minimize:
𝑛
෍ 𝑒𝑖2
𝑖=1

“sum of the squared residuals”


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
How to fit the line – Least-Squares Concept

𝑦𝑖 - 𝑦ො𝑖 will be POSITIVE

𝑦𝑖 - 𝑦ො𝑖 will be NEGATIVE

Residuals are squared,


and summed….
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Sum of Squared Residuals

Total Sum of Squares: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2


𝑦𝑖 : no disguise → true value

𝑦ො𝑖 : with a “hat” → estim-HAT?


Residual Sum of Squares: 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2

𝑦ത : with a bar → mean

Regression Sum of Squares: 𝑆𝑆𝑅𝑒𝑔 = σ𝑛 ො𝑖 − 𝑦ത 2


𝑖=1 𝑦

Sum of Squares Identity: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅𝑒𝑔 + 𝑆𝑆𝑅𝑒𝑠


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Total Sum of Squares 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 𝑦ത : with a bar → mean
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Residual Sum of Squares 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 𝑦ത : with a bar → mean

Residual: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Regression Sum of Squares 𝑆𝑆𝑅𝑒𝑔 = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦ത 2 𝑦ത : with a bar → mean
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Sum of Squared Residuals 𝑦ത : with a bar → mean

Total Sum of Squares: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 How much do our DATA vary

Residual Sum of Squares: 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 How good is our MODEL

Regression Sum of Squares: 𝑆𝑆𝑅𝑒𝑔 = σ𝑛 ො𝑖 − 𝑦ത 2


𝑖=1 𝑦 Difference between MODEL
and no regression
Sum of Squares Identity: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅𝑒𝑔 + 𝑆𝑆𝑅𝑒𝑠
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean

Recap: Important Concepts

Errors, 𝝐𝒊 : Random variables whose values cannot be determined exactly

Residuals, 𝒆𝒊 = 𝑦𝑖 - 𝑦ො𝑖 : which approximates the errors

ෝ𝒊 : Predict 𝑦𝑖 from 𝑥𝑖 using 𝑏0 and 𝑏1


Fitted values, 𝒚

Least-squares: estimates bo and b1 which minimise the sum of squared residuals

APART FROM y’s:


Regular letters – estimated model, from sample data
Greek letters – “real” relationship, that we are trying to estimate
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean

Formulae you do not need


[R/Minitab will do the calculations for you]

𝑆𝑥𝑥 = ෍ 𝑥𝑖 − 𝑥ҧ 2

𝑖=1

𝑆𝑥𝑦 = ෍ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑖=1

𝑆𝑥𝑦
𝑏1 =
𝑆𝑥𝑥

𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Look at the data


Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?
RESPONSE EXPLANATORY
Steam <- read.csv(“Steam.CSV")
plot(SteamUse~Temp, data = Steam)

Note: data = Steam, as alternative to


attach(), detach, or Steam$
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Build a model


Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

model1 <- lm(SteamUse~Temp, data = Steam)


summary(model1)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Look at the model


Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖


coef(model1) plot(Steam$SteamUse ~ Steam$Temp)
abline(coef(model1), lty = 2)

Negative linear relationship


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Evaluate the model


Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

The proportion of variation explained by the fit is called R-Squared, and is given by:

𝑅2 = 𝑆𝑆𝑅𝑒𝑔 /𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 1 − 𝑺𝑺𝑹𝒆𝒔 /𝑺𝑺𝑻𝒐𝒕𝒂𝒍


𝑺𝑺𝑻𝒐𝒕𝒂𝒍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2

𝑺𝑺𝑹𝒆𝒔 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Evaluate the model


Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

𝑅2 = 𝑆𝑆𝑅𝑒𝑔 /𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 1 − 𝑺𝑺𝑹𝒆𝒔 /𝑺𝑺𝑻𝒐𝒕𝒂𝒍

If all the data lie on a straight line, then SSRes = 0, R2 = 1. The fit explains everything – we have a perfect model.

Good fit: small SSRes, and large R2.

How large? Context specific, people often use R2 ≥ 0.5 or 50%.

In our example: R2 is ~71%


summary(model1)$r.squared
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

Looking at the residuals shows how well the fit explains the systematic variation in the data.

Ideally, a plot of residuals against fitted values should be random scatter about 0 of constant size.

plot(residuals(model1)~fitted(model1))
abline(h=0)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

Looking at the residuals shows how well the fit explains the systematic variation in the data.
Ideally, a plot of residuals against fitted values should be random scatter about 0 of constant size.
plot(residuals(model1)~fitted(model1))
abline(h=0)

Shows random scatter of


constant width

Note: can also plot with


plot(model1$residuals ~ model1$fitted.values)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

If the scatter does NOT have constant width and you can see some sort of pattern, this may suggest a transformation
of the y variable.

Does NOT show random scatter


of constant width

Can see a pattern

Scatter increases with fitted values → try a shrinking


transformation of y, e.g., log or square root

“Heteroscedasticity”
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Look at the data


Q: Is the stiffness of a board a function of its density?
RESPONSE EXPLANATORY
board <- read.csv(“Particleboard.csv")
plot(Stiffness~Density, data = board)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Build a model


Q: Is the stiffness of a board a function of its density?

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖


model1 <- lm(Stiffness~Density, data = board)
abline(model1) Note: same as abline(coef(model1))

- Strong positive relationship

- Increasing variance

- Linear? Exponential? Transform?


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Build a model


Q: Is the stiffness of a board a function of its density?

No known physical law, so we need an empirical one.

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

Could also try: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + 𝛽2 𝑥𝑖2 + ∈𝑖

Or maybe a log transform will help, and improve the increasing variance issue?

Caution: When we move to empirical models, we can’t really use them to extrapolate.
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Evaluate the model


Q: Is the stiffness of a board a function of its density?
summary(model1)$r.squared
[1] 0.8447017
plot(residuals(model1)~fitted(model1))
abline(h=0) Does NOT show random scatter
of constant width

Can see a pattern

Scatter increases with fitted values → try a shrinking


transformation of y, e.g., log or square root
“Heteroscedasticity”
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – log transform


Q: Is the stiffness of a board a function of its density?

plot(log(Stiffness)~log(Density), data = board)


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – log transform


Q: Is the stiffness of a board a function of its density?

log_model<- lm(log(Stiffness)~log(Density), data = board)


abline(log_model)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – log transform


Q: Is the stiffness of a board a function of its density?
summary(log_model)$r.squared
[1] 0.9132671
plot(residuals(log_model)~fitted(log_model))
abline(h=0)

Shows random scatter of


constant width
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Look for Absence of Pattern in Residuals

The real world is messy → we need statistics.

Statistics is not just about modelling the central curve, its about modelling the variability as well.

If the residuals still show some sort of pattern, then the model may be able to be improved.

A model is “good”, when the residuals are totally random.


228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – regular vs log transform


summary(model1) summary(log_model)

Transformation has increased R2

NB: Cannot compare residual standard error here because units have changed (log scale).
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean
Transformations: Fixing Linearity and Normality y*: transformed → into a star

Power transformations can: stretch large values (good for left skewed data), or
shrink large values (good for right skewed data).

𝑦 ∗ = sign λ ∗ 𝑦 λ if λ ≠ 0
or, 𝑦 ∗ = log λ if λ = 0

where λ is the power your transforming by, and


sign(λ) is a function that is +1 if if λ > 0, and -1 if λ < 0

Alternatively, can be written like this:


𝑦∗ = 𝑦λ if λ > 0
or, 𝑦 ∗ = log λ if λ = 0
or, 𝑦 ∗ = −𝑦 λ if λ < 0
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Skewness
[Direction is based on where the longer tail is]

Long-tail on Long-tail on
the left the right
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Skewness
[Different skewness gets treated differently]

Stretch large
values

Shrink large
values
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression

Lecture 1 (Monday 11am) Scatter plots, Regression equations


Lecture 2 (Weds 8am) Errors and Residual Analysis We are
Lecture 3 (Thurs 12pm) Testing & Interpretation here
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Testing Coefficients
Regression model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

If the regression coefficient of a variable, 𝛽𝑖 is zero, then changes in that variable do not affect the response variable.
[If 𝛽1 = 0, then 𝛽1 𝑥𝑖 is always zero and 𝑥𝑖 has no influence on 𝑦𝑖 ]
[If 𝛽1 ≠ 0, then 𝛽1 𝑥𝑖 is always SOMETHING and 𝑥𝑖 has SOME affect on 𝑦𝑖 ]

We want to know – is this affect statistically significant?

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 0
The hypotheses tested are: 𝐻0 : 𝛽𝑖 = 0 vs 𝐻𝑎 : 𝛽𝑖 ≠ 0 Test statistic: 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟

summary(model1)$coefficients

If the p-value is low, the null must go,


p< 0.05

The p-value, Pr(>|t|), is the probability of getting an estimate as extreme (as far from zero) as we did, given H0 is true,
i.e., the coefficient is zero.

Can reject the null hypothesis for both the intercept, 𝛽0 and the slope coefficient, 𝛽1 .
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Interpretation of Intercept
The intercept, 𝛽0 is the expected response when x = 0 (where the line crosses the y-axis).

print(model1)

Expected steam use at 00F is 13.6 lb/month

Often the intercept is meaningless to interpret if outside of the data range


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Interpretation of Slope
The intercept, 𝛽1 is the expected increase in response when x increases by 1.

print(model1)

Estimate that steam use will increase by -0.07983 lb/month for each increase in average temperature of 10F.

Common sense check: decrease in steam use as average temperature increases, and matches the plot.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Interpretation of Slope – with transforms

(1) 𝑦 = 𝛽0 +𝛽1 𝑥 if 𝑥 is increased by 1 unit, y changes by an addition of 𝛽1 units.

(2) log(𝑦 ) = 𝛽0 + 𝛽1 𝑥 if 𝑥 is increased by 1 unit, y changes by a factor 𝑒 𝛽1 units.

(3) log(𝑦 ) = 𝛽0 + 𝛽1 log(𝑥) if 𝑥 is multiplied by a factor of 2, y changes by a factor of 2𝛽1 units.

If you’ve transformed your data, be careful with your slope interpretations.


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

F-test
𝑥1 𝑥3
There is an overall test to see if the model is useful in predicting change in the response. 𝑥2

This is called the F-test, it uses Mean Squares (MS) and is related to the Sum of Squares.

For a model with p explanatory variables:

𝑀𝑆𝑅𝑒𝑔 𝑆𝑆𝑅𝑒𝑔 /𝑝
F-test statistic: =
𝑀𝑆𝑅𝑒𝑠 𝑆𝑆𝑅𝑒𝑠 /(𝑛−𝑝−1)

Null hypothesis is that the model is NOT significant, under which the test statistic follows an F distribution
with p, and n-p-1 degrees of freedom.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

F-test
For simple linear regression (one explanatory variable), the F-test is identical to the t-test for the slope, t2 = F.

When you have more than one explanatory variable (multiple regression), the F-test is different.

summary(model1)
You run your test, get the results, look at the p-value,
if the p-value is < 0.05, we can say “the model
explains a significant amount of variation in y”
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Prediction and Estimation


Regression model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖
Estimated model: 𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 , with residuals: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

Once we have our model,


- For any value 𝑥0 , least squares line gives the fitted value using 𝑏0 + 𝑏1 𝑥0
- We get both the mean response, 𝜇𝑦|𝑥 , and the actual (individual response), y0

- We use confidence intervals to predict mean responses


- We use prediction intervals to predict individual responses

- Confidence intervals address uncertainty in the line location


- Prediction intervals address uncertainty in the line location AND variations in individual values
around the line
Thus, prediction intervals are larger than confidence intervals.

- As 𝑥0 gets further from its mean, both prediction and confidence intervals get larger.
[Avoid extrapolation]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Prediction and Estimation

Prediction and confidence intervals can be found with R:

PI12<- predict(model1, data.frame(Density = 12), interval = “prediction”)


PI12

CI12<- predict(model1, data.frame(Density = 12), interval = “confidence”)


CI12

Note: predict() expects a data.frame object for what its


predicting. So here, we get around this by creating a
data.frame of one datum using data.frame(Density = 12):
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assessing and Correcting Lack of Fit

Assumptions:

1. The relationship is linear

2. Constant Variance: The variance of the response is constant

3. Uncorrelated Errors: Errors are uncorrelated, particularly serially

4. Normal Errors: Errors are normally distributed

If we are going to assume something, we need to assess its validity


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 1: The relationship is linear

Assess: by plotting
- Plot the data
- Plot the errors (residuals) vs the explanatory variable(s)

Check: If you can see any patterns → there is some form of non-linearity

Fix: Add in some polynomial terms (dealt with later), or transform the data
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 2: Constant Variance

Assess: by plotting
- Plot the data
- Plot the errors (residuals) vs the fitted values

Check: If variability of y increases as x variable increases (commonly) or decreases


→ variance of response ≠ constant

Fix: Some variance stabilising transformations exist,


If error standard deviation is proportional to x, try taking logs.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 3: Correlated Errors


- Most often violated when observations are collected sequentially
- There may be a variable, not included in the model, which changes over time

Assess: - Plot the data


- Plot the errors (residuals) vs their “order”
- Durbin-Watson test

Check: Look for unusual runs, Durbin-Watson test ≠2 → correlated errors

Fix: Data cleaning? Hard to correct-out after the fact.


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 4: Normal Errors

Assess: Plot Normal probability plots

Check: Look for departure against Normal distributions,


For smaller sample sizes, the measured residuals may be non-normal

Fix: Model errors SHOULD be normally distributed,


HOWEVER, some violation of this assumption is usually OK.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

board <- read.csv(“Particleboard.csv")


Useful Plots of Residuals model1 <- lm(Stiffness~Density, data = board)

Residuals vs Fitted values: plot(residuals(model1) ~ fitted(model1))

Histogram of residuals: hist(residuals(model1))

q-q plot of residuals: qqnorm(residuals(model1))

Plot of residuals vs order of the data: plot(residuals(model1)) # assumes data index is collection order

Plot separately in R, or use plot.lm() command which will plot a similar set: plot.lm(model1)

Note: All tests based on F, t, etc., requires normality of residuals (F more than t).
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Residuals vs Fitted values


res1 <- residuals(model1)
fits1 <- fitted(model1)
plot(res1~fits1, main = “Residuals vs Fitted values”)
abline(h=0, lty = “dashed”)
lines(lowess(res1 ~ fits1))

Note: can also plot via


plot(residuals(model1) ~ fitted(model1))
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Histogram of Residuals
hist(res1, col = “cornsilk”, main = “Histogram of residuals”)

Note: can add breaks = n to change number of bins


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Quantile-Quantile Plot of Residuals


qqnorm(res1, main = “Normal q-q plot of residuals”)
qqline(res1)

Note: can do normality test if worried:


shapiro.test(res1)
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Influential Points


Influential points are observations that, when removed, have a large effect on the regression model and
coefficients.

Can be influential because:


the X value is unusual (high leverage), OR
the residual is unusually large, OR
both.

plot(model1, which = 5)
[plot number 5 please]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Influential Points


Need to be wary of over-interpreting regressions with influential points:
You can end up building a model on an outlier
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Influential Points


228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Leverage and Influence

- Outliers in Y are not necessarily influential

- High leverage observations in X are not necessarily influential

- Influential observations are not necessarily outliers

- In R, we find the leverages ourselves. Rule of thumb: Points with more than three times the average
leverage (hii > 3p/n) are having undue influence.

- There are different ways to measure influence: DFITS, Cooks distance, etc.

- DFITS: change in fitted values, if you remove the data point

- Cook’s distance: change in regression coefficients if you remove the data point
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Outliers

With a plot: standardized residuals (or their square root) can identify outliers:

Examine outliers, especially:


when they are large (standardized residuals > 3),
or
there are lots of outliers (> 10%)

plot(model1, which = 3)
[plot number 3 please]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

What to do with Outliers?

Is there an error in measuring / transcribing? → DELETE from data set

Is there extra information to describe why its different? → MAYBE delete from data set

If neither of the above, proceed with caution → ALWAYS plot the residuals.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Residual Plots
The four common residual plots can be plotted easily in one go:
par(mfrow=c(2,2)) # split plotting area into a 2 by 2 grid
plot(model1)
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression

Lecture 1 (Monday 11am) Scatter plots, Regression equations We are


Lecture 2 (Weds 8am) Errors and Residual Analysis here
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy