0% found this document useful (0 votes)

16 views76 pages

228371_Lecture_Notes_Week_2

Uploaded by

Jay VN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views76 pages

228371_Lecture_Notes_Week_2

Uploaded by

Jay VN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

228371

Statistical Modelling for Engineers & Technologists

LECTURE SET 2: Regression
We are
here
Lecture 1 (Monday 11am) Scatter plots, Regression equations
Lecture 2 (Weds 8am) Errors & Residual Analysis
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)

228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Regression → Fitting Equations to Data

First → Look at the data → Scatterplot

• Consider quantitative data in pairs (x, y)

• x and y need not be in the same units, but are related in some way
• We want to see if there is a relationship between x and y

1. Plot the data using a scatter plot: plot(x, y)

2. Interrogate the data – look for trends (linear?), gaps, outliers, etc.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
1. Plot the data using a scatter plot: plot(x, y)
data<- read.table("testmarks.txt", header = TRUE)
plot(data$maths, data$English, xlim = c(0,100), ylim =c(0,100))

Note:
View(data)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$english~data$maths), col="green", lw = 3)

Trend? POSITIVE, LINEAR-ish?

Gaps? Maybe in maths 60 to 70 ish?
Outliers? Maybe one?
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$english~data$maths), col="green", lw = 3)

• For linear regression, the variability in y is assumed constant,

as x changes (i.e., VERTICAL scatter about trend)

• Make sure you have your PREDICTOR and RESPONSE variables

the right way around

• Can also use boxplots to examine marginal distribution about

each variable R → boxplot(data$maths, data$english)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$maths~data$english), col=“red", lw = 3, lt = 2)

Green (solid) line: PREDICTOR: Maths

RESPONSE: English

Red (dash) line: PREDICTOR: English

RESPONSE: Maths
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Displaying groups, Example: pulse.txt

1. Plot the data using a scatter plot: plot(x, y)
pulse <- read.table(“pulse.txt", header = TRUE)
plot(pulse$pulse1, pulse$pulse2, pch = pulse$ran, col = pulse$ran)
legend(“topleft”, c(“Ran”, “Did not Run”), pch = unique(pulse$ran), col = unique(pulse$ran))

Note:
attach(pulse)
….
detach(pulse)

Not sure?
search()
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Lowess Smoother – to help identify a trend

horseshearts <- read.table(“horseshearts.txt", header = TRUE)
attach(horsehearts)
plot(weight ~ extsys)
lines(lowess(weight ~ extsys))
detach(horsehearts)

Note:
plot(weight, extsys) VS plot(weight ~ extsys)
x-axis, y-axis
dependent ~ independent
response ~ predictor
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Scatterplot Matrix
pines <- read.table(“pines.txt", header = TRUE)
pairs(pines[2:5], col = pines$area, pch = pines$area)

pairs() shows scatterplots of each

pair of variables

plot(pines$third ~ pines$second)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Correlation Coefficient, rxy

rxy measures the strength of a linear relationship between two variables X and Y

-1 ≤ rxy ≤ 1

rxy = 0 There is NO linear relationship between X and Y

|rxy|=1 There is a perfect linear relationship between X and Y

+ve rxy Indicates positive relationship (positive slope) X↑ Y↑

-ve rxy Indicates negative relationship (negative slope) X↑ Y↓

σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟𝑥𝑦 =
σ 𝑥𝑖 − 𝑥ҧ 2σ 𝑦𝑖 − 𝑦ത 2
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Correlation Coefficient, rxy Example: pines.txt

pines <- read.table(“pines.txt", header = TRUE)
cor(pines[2:5])

Diagonal == 1
Symmetric about the diagonal
Correlation NOT causation

cor.test(pines$top, pines$first)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Our Aim

An equation (mathematical model) describing the relationship between a response

variable, and one or more explanatory variables.

The model can be a straight line (Linear model), or a curved function (e.g.,
Polynomial).

We need to fit whichever model we choose by estimating the model parameters.

228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Terminology

y response variable (dependent)

x1, x2, x3, …, xp explanatory / predictors / covariates / regressor variables (independent)

Regression models statistical determination of response ~ predictor variables

- We will only consider Normal/Gaussian distributions 𝑦~𝑁 𝜇, 𝜎 2

- 𝑦~𝑁 𝛽0 + 𝛽1 𝑥, 𝜎 2 or (equivalently) 𝑦~𝛽0 + 𝛽1 𝑥 + 𝜖, where 𝜖 ~ 𝑁(0, 𝜎 2 )

228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Mechanistic Models (Deterministic Relationships)

Ohm’s Law: V = IR Voltage = Current * Resistance

If we don’t know Resistance, R, but we do know V and I, we can directly calculate R → R = V/I

If we didn’t know about Ohm’s law, we might want to fit a model instead:

𝑦~𝛽 𝑥 + 𝜖 → V~𝑅 𝐼 + 𝜖 𝜖 ~ 𝑁(0, 𝜎 2 )

Usually try simplest model first (linear), and then make them more complicated if needed.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Empirical Models

- If there is no pre-conceived notion of the form of the relationship → find an EMPIRICAL MODEL
- Do you know anything about the underlying physical mechanism?
- Predict the response as accurately as possible (e.g., instrument calibration curves)
- Start SIMPLE (linear)

Errors

- Reality is messy
- Response and explanatory variables rarely satisfy a mathematical equation exactly
- Experimental situation
- Measurement errors
- Additional unrecorded factors
- Observational data
- Inherent variability
- Even more unrecorded factors
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Linear Models
[Linear models don’t have to be “straight lines”]
In a linear model, the mean response is linear in the parameters, e.g.,

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖 +
𝑥1 𝑥2

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥22 + 𝜖 +
𝑥1 𝑥22

𝑥2
𝑦 = 𝛽1 sin 𝑥1 + 𝛽2 log Τ𝑥3 + 𝛽3 𝑥4 𝑥5 + 𝜖 + +

log(𝑦) = 𝛽0 + 𝛽1 𝑥 + 𝜖 sin 𝑥1 log

𝑥2
ൗ𝑥3 𝑥4 𝑥5
𝑥
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Non-Linear Models

In a non-linear model, you can combine your terms however you want, e.g.,

𝛽0 +𝛽1 𝑥1
𝑦 = 𝛽0 𝑒 −𝑖𝛽1𝑥 +𝜖 or 𝑦= +𝜖
𝛽2 𝑥2 +𝛽3 𝑥32

Parameter estimation is much easier for LINEAR models

→ However, mechanistic models are often non-linear
→ So, we try to linearize equations as much as possible by taking logs, etc.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Regression Analysis

Regression is the tendency of the response variable (y) to vary with one of more explanatory variables (x).

The regression equation describes this relationship mathematically.

Simple regression: one explanatory (or predictor) variable (x) Note:

𝛽0 , 𝛽1 : unknown model
𝑦𝑖 = 𝜇𝑦|𝑥 + 𝜖𝑖 parameters
if linear: 𝜇𝑦|𝑥 = 𝛽0 + 𝛽1 𝑥𝑖 b0, b1: statistics calculated
& the fitted model is also a straight line: 𝑦ො𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖 from sample data

Multiple regression: more than one explanatory (or predictor) variable (x1,…, xp)

The word Regression was first used by Francis Galton (late 1800’s) to describe tendency of tall fathers to
have not-so-tall sons (“regression towards the mean”).
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Further assumptions for inference

- 𝑦𝑖 is normally distributed
- var[𝑦𝑖 ] is constant (= σ2), i.e., does not change with x
- ϵi ~ Normal(0, σ2) independent and identically distributed – iid

Combine this with linearity of regression means we can build models that look like:

𝑦𝑖 ~𝑁 𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2
Prediction errors: ∈𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖

BUT because 𝛽0 , 𝛽1 are unknown, so are the errors. So, we ESTIMATE the error as well using bo, b1:
𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
Prediction errors estimated with residuals: ∈𝑖 = 𝑦𝑖 - 𝑦ො𝑖
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations

Graphical depiction of regression assumptions

𝑦𝑖 ~𝑁 𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression
We are
Lecture 1 (Monday 11am) Scatter plots, Regression equations here
Lecture 2 (Weds 8am) Errors and Residual Analysis
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

How to fit the line – Least-Squares Concept

Least squares regression line is given by those

values of bo and b1 that minimize:
𝑛
෍ 𝑒𝑖2
𝑖=1

“sum of the squared residuals”

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
How to fit the line – Least-Squares Concept

𝑦𝑖 - 𝑦ො𝑖 will be POSITIVE

𝑦𝑖 - 𝑦ො𝑖 will be NEGATIVE

Residuals are squared,

and summed….
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Sum of Squared Residuals

Total Sum of Squares: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2

𝑦𝑖 : no disguise → true value

𝑦ො𝑖 : with a “hat” → estim-HAT?

Residual Sum of Squares: 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2

𝑦ത : with a bar → mean

Regression Sum of Squares: 𝑆𝑆𝑅𝑒𝑔 = σ𝑛 ො𝑖 − 𝑦ത 2

𝑖=1 𝑦

Sum of Squares Identity: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅𝑒𝑔 + 𝑆𝑆𝑅𝑒𝑠

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Total Sum of Squares 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 𝑦ത : with a bar → mean
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Residual Sum of Squares 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 𝑦ത : with a bar → mean

Residual: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Regression Sum of Squares 𝑆𝑆𝑅𝑒𝑔 = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦ത 2 𝑦ത : with a bar → mean
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Sum of Squared Residuals 𝑦ത : with a bar → mean

Total Sum of Squares: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 How much do our DATA vary

Residual Sum of Squares: 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 How good is our MODEL

Regression Sum of Squares: 𝑆𝑆𝑅𝑒𝑔 = σ𝑛 ො𝑖 − 𝑦ത 2

𝑖=1 𝑦 Difference between MODEL
and no regression
Sum of Squares Identity: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅𝑒𝑔 + 𝑆𝑆𝑅𝑒𝑠
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean

Recap: Important Concepts

Errors, 𝝐𝒊 : Random variables whose values cannot be determined exactly

Residuals, 𝒆𝒊 = 𝑦𝑖 - 𝑦ො𝑖 : which approximates the errors

ෝ𝒊 : Predict 𝑦𝑖 from 𝑥𝑖 using 𝑏0 and 𝑏1

Fitted values, 𝒚

Least-squares: estimates bo and b1 which minimise the sum of squared residuals

APART FROM y’s:

Regular letters – estimated model, from sample data
Greek letters – “real” relationship, that we are trying to estimate
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean

Formulae you do not need

[R/Minitab will do the calculations for you]

𝑆𝑥𝑥 = ෍ 𝑥𝑖 − 𝑥ҧ 2

𝑖=1

𝑆𝑥𝑦 = ෍ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑖=1

𝑆𝑥𝑦
𝑏1 =
𝑆𝑥𝑥

𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Look at the data

Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?
RESPONSE EXPLANATORY
Steam <- read.csv(“Steam.CSV")
plot(SteamUse~Temp, data = Steam)

Note: data = Steam, as alternative to

attach(), detach, or Steam$
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Build a model

Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

model1 <- lm(SteamUse~Temp, data = Steam)

summary(model1)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Look at the model

Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

coef(model1) plot(Steam$SteamUse ~ Steam$Temp)
abline(coef(model1), lty = 2)

Negative linear relationship

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Evaluate the model

Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

The proportion of variation explained by the fit is called R-Squared, and is given by:

𝑅2 = 𝑆𝑆𝑅𝑒𝑔 /𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 1 − 𝑺𝑺𝑹𝒆𝒔 /𝑺𝑺𝑻𝒐𝒕𝒂𝒍

𝑺𝑺𝑻𝒐𝒕𝒂𝒍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2

𝑺𝑺𝑹𝒆𝒔 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Steam.csv – Evaluate the model

Q: Is the monthly steam consumption in a chemical plant a function of average operating temperature?

𝑅2 = 𝑆𝑆𝑅𝑒𝑔 /𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 1 − 𝑺𝑺𝑹𝒆𝒔 /𝑺𝑺𝑻𝒐𝒕𝒂𝒍

If all the data lie on a straight line, then SSRes = 0, R2 = 1. The fit explains everything – we have a perfect model.

Good fit: small SSRes, and large R2.

How large? Context specific, people often use R2 ≥ 0.5 or 50%.

In our example: R2 is ~71%

summary(model1)$r.squared
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

Looking at the residuals shows how well the fit explains the systematic variation in the data.

Ideally, a plot of residuals against fitted values should be random scatter about 0 of constant size.

plot(residuals(model1)~fitted(model1))
abline(h=0)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

Looking at the residuals shows how well the fit explains the systematic variation in the data.
Ideally, a plot of residuals against fitted values should be random scatter about 0 of constant size.
plot(residuals(model1)~fitted(model1))
abline(h=0)

Shows random scatter of

constant width

Note: can also plot with

plot(model1$residuals ~ model1$fitted.values)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

If the scatter does NOT have constant width and you can see some sort of pattern, this may suggest a transformation
of the y variable.

Does NOT show random scatter

of constant width

Can see a pattern

Scatter increases with fitted values → try a shrinking

transformation of y, e.g., log or square root

“Heteroscedasticity”
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Look at the data

Q: Is the stiffness of a board a function of its density?
RESPONSE EXPLANATORY
board <- read.csv(“Particleboard.csv")
plot(Stiffness~Density, data = board)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Build a model

Q: Is the stiffness of a board a function of its density?

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

model1 <- lm(Stiffness~Density, data = board)
abline(model1) Note: same as abline(coef(model1))

- Strong positive relationship

- Increasing variance

- Linear? Exponential? Transform?

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Build a model

Q: Is the stiffness of a board a function of its density?

No known physical law, so we need an empirical one.

Straight line as tentative model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

Could also try: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + 𝛽2 𝑥𝑖2 + ∈𝑖

Or maybe a log transform will help, and improve the increasing variance issue?

Caution: When we move to empirical models, we can’t really use them to extrapolate.
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – Evaluate the model

Q: Is the stiffness of a board a function of its density?
summary(model1)$r.squared
[1] 0.8447017
plot(residuals(model1)~fitted(model1))
abline(h=0) Does NOT show random scatter
of constant width

Can see a pattern

Scatter increases with fitted values → try a shrinking

transformation of y, e.g., log or square root
“Heteroscedasticity”
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – log transform

Q: Is the stiffness of a board a function of its density?

plot(log(Stiffness)~log(Density), data = board)

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – log transform

Q: Is the stiffness of a board a function of its density?

log_model<- lm(log(Stiffness)~log(Density), data = board)

abline(log_model)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – log transform

Q: Is the stiffness of a board a function of its density?
summary(log_model)$r.squared
[1] 0.9132671
plot(residuals(log_model)~fitted(log_model))
abline(h=0)

Shows random scatter of

constant width
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Look for Absence of Pattern in Residuals

The real world is messy → we need statistics.

Statistics is not just about modelling the central curve, its about modelling the variability as well.

If the residuals still show some sort of pattern, then the model may be able to be improved.

A model is “good”, when the residuals are totally random.

228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Examples: Particleboard.csv – regular vs log transform

summary(model1) summary(log_model)

Transformation has increased R2

NB: Cannot compare residual standard error here because units have changed (log scale).
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean
Transformations: Fixing Linearity and Normality y*: transformed → into a star

Power transformations can: stretch large values (good for left skewed data), or
shrink large values (good for right skewed data).

𝑦 ∗ = sign λ ∗ 𝑦 λ if λ ≠ 0
or, 𝑦 ∗ = log λ if λ = 0

where λ is the power your transforming by, and

sign(λ) is a function that is +1 if if λ > 0, and -1 if λ < 0

Alternatively, can be written like this:

𝑦∗ = 𝑦λ if λ > 0
or, 𝑦 ∗ = log λ if λ = 0
or, 𝑦 ∗ = −𝑦 λ if λ < 0
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Skewness
[Direction is based on where the longer tail is]

Long-tail on Long-tail on
the left the right
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis

Skewness
[Different skewness gets treated differently]

Stretch large
values

Shrink large
values
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression

Lecture 1 (Monday 11am) Scatter plots, Regression equations

Lecture 2 (Weds 8am) Errors and Residual Analysis We are
Lecture 3 (Thurs 12pm) Testing & Interpretation here
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Testing Coefficients
Regression model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖

If the regression coefficient of a variable, 𝛽𝑖 is zero, then changes in that variable do not affect the response variable.
[If 𝛽1 = 0, then 𝛽1 𝑥𝑖 is always zero and 𝑥𝑖 has no influence on 𝑦𝑖 ]
[If 𝛽1 ≠ 0, then 𝛽1 𝑥𝑖 is always SOMETHING and 𝑥𝑖 has SOME affect on 𝑦𝑖 ]

We want to know – is this affect statistically significant?

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 0
The hypotheses tested are: 𝐻0 : 𝛽𝑖 = 0 vs 𝐻𝑎 : 𝛽𝑖 ≠ 0 Test statistic: 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟

summary(model1)$coefficients

If the p-value is low, the null must go,

p< 0.05

The p-value, Pr(>|t|), is the probability of getting an estimate as extreme (as far from zero) as we did, given H0 is true,
i.e., the coefficient is zero.

Can reject the null hypothesis for both the intercept, 𝛽0 and the slope coefficient, 𝛽1 .
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Interpretation of Intercept
The intercept, 𝛽0 is the expected response when x = 0 (where the line crosses the y-axis).

print(model1)

Expected steam use at 00F is 13.6 lb/month

Often the intercept is meaningless to interpret if outside of the data range

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Interpretation of Slope
The intercept, 𝛽1 is the expected increase in response when x increases by 1.

print(model1)

Estimate that steam use will increase by -0.07983 lb/month for each increase in average temperature of 10F.

Common sense check: decrease in steam use as average temperature increases, and matches the plot.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Interpretation of Slope – with transforms

(1) 𝑦 = 𝛽0 +𝛽1 𝑥 if 𝑥 is increased by 1 unit, y changes by an addition of 𝛽1 units.

(2) log(𝑦 ) = 𝛽0 + 𝛽1 𝑥 if 𝑥 is increased by 1 unit, y changes by a factor 𝑒 𝛽1 units.

(3) log(𝑦 ) = 𝛽0 + 𝛽1 log(𝑥) if 𝑥 is multiplied by a factor of 2, y changes by a factor of 2𝛽1 units.

If you’ve transformed your data, be careful with your slope interpretations.

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

F-test
𝑥1 𝑥3
There is an overall test to see if the model is useful in predicting change in the response. 𝑥2

This is called the F-test, it uses Mean Squares (MS) and is related to the Sum of Squares.

For a model with p explanatory variables:

𝑀𝑆𝑅𝑒𝑔 𝑆𝑆𝑅𝑒𝑔 /𝑝
F-test statistic: =
𝑀𝑆𝑅𝑒𝑠 𝑆𝑆𝑅𝑒𝑠 /(𝑛−𝑝−1)

Null hypothesis is that the model is NOT significant, under which the test statistic follows an F distribution
with p, and n-p-1 degrees of freedom.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

F-test
For simple linear regression (one explanatory variable), the F-test is identical to the t-test for the slope, t2 = F.

When you have more than one explanatory variable (multiple regression), the F-test is different.

summary(model1)
You run your test, get the results, look at the p-value,
if the p-value is < 0.05, we can say “the model
explains a significant amount of variation in y”
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Prediction and Estimation

Regression model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖
Estimated model: 𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 , with residuals: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖

Once we have our model,

- For any value 𝑥0 , least squares line gives the fitted value using 𝑏0 + 𝑏1 𝑥0
- We get both the mean response, 𝜇𝑦|𝑥 , and the actual (individual response), y0

- We use confidence intervals to predict mean responses

- We use prediction intervals to predict individual responses

- Confidence intervals address uncertainty in the line location

- Prediction intervals address uncertainty in the line location AND variations in individual values
around the line
Thus, prediction intervals are larger than confidence intervals.

- As 𝑥0 gets further from its mean, both prediction and confidence intervals get larger.
[Avoid extrapolation]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Prediction and Estimation

Prediction and confidence intervals can be found with R:

PI12<- predict(model1, data.frame(Density = 12), interval = “prediction”)

PI12

CI12<- predict(model1, data.frame(Density = 12), interval = “confidence”)

CI12

Note: predict() expects a data.frame object for what its

predicting. So here, we get around this by creating a
data.frame of one datum using data.frame(Density = 12):
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assessing and Correcting Lack of Fit

Assumptions:

1. The relationship is linear

2. Constant Variance: The variance of the response is constant

3. Uncorrelated Errors: Errors are uncorrelated, particularly serially

4. Normal Errors: Errors are normally distributed

If we are going to assume something, we need to assess its validity

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 1: The relationship is linear

Assess: by plotting
- Plot the data
- Plot the errors (residuals) vs the explanatory variable(s)

Check: If you can see any patterns → there is some form of non-linearity

Fix: Add in some polynomial terms (dealt with later), or transform the data
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 2: Constant Variance

Assess: by plotting
- Plot the data
- Plot the errors (residuals) vs the fitted values

Check: If variability of y increases as x variable increases (commonly) or decreases

→ variance of response ≠ constant

Fix: Some variance stabilising transformations exist,

If error standard deviation is proportional to x, try taking logs.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 3: Correlated Errors

- Most often violated when observations are collected sequentially
- There may be a variable, not included in the model, which changes over time

Assess: - Plot the data

- Plot the errors (residuals) vs their “order”
- Durbin-Watson test

Check: Look for unusual runs, Durbin-Watson test ≠2 → correlated errors

Fix: Data cleaning? Hard to correct-out after the fact.

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Assumption 4: Normal Errors

Assess: Plot Normal probability plots

Check: Look for departure against Normal distributions,

For smaller sample sizes, the measured residuals may be non-normal

Fix: Model errors SHOULD be normally distributed,

HOWEVER, some violation of this assumption is usually OK.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

board <- read.csv(“Particleboard.csv")

Useful Plots of Residuals model1 <- lm(Stiffness~Density, data = board)

Residuals vs Fitted values: plot(residuals(model1) ~ fitted(model1))

Histogram of residuals: hist(residuals(model1))

q-q plot of residuals: qqnorm(residuals(model1))

Plot of residuals vs order of the data: plot(residuals(model1)) # assumes data index is collection order

Plot separately in R, or use plot.lm() command which will plot a similar set: plot.lm(model1)

Note: All tests based on F, t, etc., requires normality of residuals (F more than t).
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Residuals vs Fitted values

res1 <- residuals(model1)
fits1 <- fitted(model1)
plot(res1~fits1, main = “Residuals vs Fitted values”)
abline(h=0, lty = “dashed”)
lines(lowess(res1 ~ fits1))

Note: can also plot via

plot(residuals(model1) ~ fitted(model1))
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Histogram of Residuals
hist(res1, col = “cornsilk”, main = “Histogram of residuals”)

Note: can add breaks = n to change number of bins

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Quantile-Quantile Plot of Residuals

qqnorm(res1, main = “Normal q-q plot of residuals”)
qqline(res1)

Note: can do normality test if worried:

shapiro.test(res1)
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Influential Points

Influential points are observations that, when removed, have a large effect on the regression model and
coefficients.

Can be influential because:

the X value is unusual (high leverage), OR
the residual is unusually large, OR
both.

plot(model1, which = 5)
[plot number 5 please]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Influential Points

Need to be wary of over-interpreting regressions with influential points:
You can end up building a model on an outlier
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Influential Points

228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Leverage and Influence

- Outliers in Y are not necessarily influential

- High leverage observations in X are not necessarily influential

- Influential observations are not necessarily outliers

- In R, we find the leverages ourselves. Rule of thumb: Points with more than three times the average
leverage (hii > 3p/n) are having undue influence.

- There are different ways to measure influence: DFITS, Cooks distance, etc.

- DFITS: change in fitted values, if you remove the data point

- Cook’s distance: change in regression coefficients if you remove the data point
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Identifying Outliers

With a plot: standardized residuals (or their square root) can identify outliers:

Examine outliers, especially:

when they are large (standardized residuals > 3),
or
there are lots of outliers (> 10%)

plot(model1, which = 3)
[plot number 3 please]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

What to do with Outliers?

Is there an error in measuring / transcribing? → DELETE from data set

Is there extra information to describe why its different? → MAYBE delete from data set

If neither of the above, proceed with caution → ALWAYS plot the residuals.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation

Residual Plots
The four common residual plots can be plotted easily in one go:
par(mfrow=c(2,2)) # split plotting area into a 2 by 2 grid
plot(model1)
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression

Lecture 1 (Monday 11am) Scatter plots, Regression equations We are

Lecture 2 (Weds 8am) Errors and Residual Analysis here
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”

Lab1 (Friday 8am – 10am) OR Lab2 (Friday 12pm – 2pm)

Applied Statistics with Python
100% (1)
Applied Statistics with Python
320 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
228371_Lecture_Notes_Week_3
No ratings yet
228371_Lecture_Notes_Week_3
61 pages
Financial Literacy, Financial Well-being and Financial Decision-making Amongst Elderly Australians
No ratings yet
Financial Literacy, Financial Well-being and Financial Decision-making Amongst Elderly Australians
175 pages
Bio-L8- Correlation and Regression Analysis
No ratings yet
Bio-L8- Correlation and Regression Analysis
15 pages
NASA Regression Lecture
No ratings yet
NASA Regression Lecture
268 pages
Assumptions in Regression Model PDF
No ratings yet
Assumptions in Regression Model PDF
12 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Maths Project 2
No ratings yet
Maths Project 2
6 pages
Module 3 - Regression and Correlation Analysis
No ratings yet
Module 3 - Regression and Correlation Analysis
54 pages
Lecture 2. Simple Linear Regression
No ratings yet
Lecture 2. Simple Linear Regression
49 pages
StudyGuide001 2015 4 B STA1502
No ratings yet
StudyGuide001 2015 4 B STA1502
116 pages
Regression & Correlation
No ratings yet
Regression & Correlation
44 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
228371 Lecture Notes Week 4
No ratings yet
228371 Lecture Notes Week 4
54 pages
6th Lecture Note 108335647 230518 203102
No ratings yet
6th Lecture Note 108335647 230518 203102
35 pages
All That Glitters Is Not Gold - Comparing Backtest and Out-of-Sample Performance On A Large Cohort o
No ratings yet
All That Glitters Is Not Gold - Comparing Backtest and Out-of-Sample Performance On A Large Cohort o
19 pages
L7-CurveFitting(LeastSquaresRegression)
No ratings yet
L7-CurveFitting(LeastSquaresRegression)
45 pages
Assignment Responsion 08 Linear Regression Line: By: Panji Indra Wadharta 03411640000037
No ratings yet
Assignment Responsion 08 Linear Regression Line: By: Panji Indra Wadharta 03411640000037
11 pages
ANUM 2012 Curve-Fitting
No ratings yet
ANUM 2012 Curve-Fitting
44 pages
Statistics 3 Notes
No ratings yet
Statistics 3 Notes
90 pages
Chapter 2 Simple Linear Regression
No ratings yet
Chapter 2 Simple Linear Regression
70 pages
Statistics Model Paper
100% (2)
Statistics Model Paper
27 pages
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
No ratings yet
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
38 pages
Looking at Data: Relationships: Least-Squares Regression
No ratings yet
Looking at Data: Relationships: Least-Squares Regression
23 pages
Least Square Regression: Numerical Methods ECE 410
No ratings yet
Least Square Regression: Numerical Methods ECE 410
44 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
28 pages
Lecture 16 Regression
No ratings yet
Lecture 16 Regression
30 pages
Ch17 Curve Fitting
No ratings yet
Ch17 Curve Fitting
44 pages
LM Week1 1 2019
No ratings yet
LM Week1 1 2019
28 pages
Regression Equations
No ratings yet
Regression Equations
94 pages
Curve Fitting: There Are Two General Approaches For Curve Fitting
No ratings yet
Curve Fitting: There Are Two General Approaches For Curve Fitting
63 pages
2014 CMOST Presentation
100% (1)
2014 CMOST Presentation
86 pages
Clase 11 Calculo Numerico I
No ratings yet
Clase 11 Calculo Numerico I
37 pages
4 Regression Analysis
No ratings yet
4 Regression Analysis
44 pages
W1.1_CBU5201
No ratings yet
W1.1_CBU5201
36 pages
4-Curve Fitting and Interpolation
No ratings yet
4-Curve Fitting and Interpolation
48 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Module 5
No ratings yet
Module 5
28 pages
PE Civil: Transportation e-book Practice Exam
No ratings yet
PE Civil: Transportation e-book Practice Exam
41 pages
Regression
No ratings yet
Regression
6 pages
Beginning R
No ratings yet
Beginning R
337 pages
Time Series Analysis of Hydrologic Data For Water
No ratings yet
Time Series Analysis of Hydrologic Data For Water
22 pages
Assignment Responsion 08 Linear Regression Line: By: Panji Indra Wadharta 03411640000037
No ratings yet
Assignment Responsion 08 Linear Regression Line: By: Panji Indra Wadharta 03411640000037
11 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Green Finance & Stock Performance
No ratings yet
Green Finance & Stock Performance
28 pages
Stock Watson 3u Exercise Solutions Chapter 4 Instructors
No ratings yet
Stock Watson 3u Exercise Solutions Chapter 4 Instructors
16 pages
Austin 2004
No ratings yet
Austin 2004
9 pages
Lecture 6 Simple Linear Regression
No ratings yet
Lecture 6 Simple Linear Regression
36 pages
Women's Empowerment Bangladesh
100% (1)
Women's Empowerment Bangladesh
10 pages
Topic 8 - Regression Analysis
No ratings yet
Topic 8 - Regression Analysis
51 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Regression Notes- Part-1
No ratings yet
Regression Notes- Part-1
17 pages
Kontribusi Pembelajaran Di Perguruan Tinggi Dan Literasi Keuangan Terhadap Perilaku Keuangan Mahasiswa
No ratings yet
Kontribusi Pembelajaran Di Perguruan Tinggi Dan Literasi Keuangan Terhadap Perilaku Keuangan Mahasiswa
11 pages
Topic 6B Regression
No ratings yet
Topic 6B Regression
13 pages
Evaluation of Packaging Form Regarding Consumers' Sentimental Response To Bottled Beverage Containers
No ratings yet
Evaluation of Packaging Form Regarding Consumers' Sentimental Response To Bottled Beverage Containers
13 pages
2014 SSAC The Hot Hand A New Approach
No ratings yet
2014 SSAC The Hot Hand A New Approach
10 pages
Advance Stats Assignment
No ratings yet
Advance Stats Assignment
18 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
No ratings yet
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
20 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Uribe PDF
No ratings yet
Uribe PDF
225 pages
Least Square Regression
No ratings yet
Least Square Regression
13 pages
Factors Affecting Contract Management in Public Procurement Sector in Kenya A Case of Kenya Literature Bureau
No ratings yet
Factors Affecting Contract Management in Public Procurement Sector in Kenya A Case of Kenya Literature Bureau
11 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Linear Regression Models
No ratings yet
Linear Regression Models
42 pages
CH 06
No ratings yet
CH 06
22 pages
Corporate Sustainability and Financial Performance of Bangladeshi Banks
No ratings yet
Corporate Sustainability and Financial Performance of Bangladeshi Banks
56 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Ids Project
No ratings yet
Ids Project
25 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Why Nurses Need Statistics
No ratings yet
Why Nurses Need Statistics
24 pages
(eBook PDF) An Introduction to Educational Research: Connecting Methods to Practice pdf download
100% (2)
(eBook PDF) An Introduction to Educational Research: Connecting Methods to Practice pdf download
49 pages
Environmental Data Mining and Modelling Based On Machine Learning
No ratings yet
Environmental Data Mining and Modelling Based On Machine Learning
8 pages
Assignment
No ratings yet
Assignment
9 pages
Student Notes Madule 2
No ratings yet
Student Notes Madule 2
12 pages
Impact of MSP, Auc and Productivity On Overall Production of Selected Crops in India: A Study
No ratings yet
Impact of MSP, Auc and Productivity On Overall Production of Selected Crops in India: A Study
9 pages
Simple Regression Model: Erbil Technology Institute
No ratings yet
Simple Regression Model: Erbil Technology Institute
9 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
11 - Descriptive and Inferential Statistics - ThoughtCo
No ratings yet
11 - Descriptive and Inferential Statistics - ThoughtCo
3 pages
IB Data Analysis Practice 1
No ratings yet
IB Data Analysis Practice 1
3 pages
Da Unit-3
No ratings yet
Da Unit-3
27 pages
Chapter 3 FACTORS AFFECTING THE ACADEMIC PERFORMANCE OF SENI
100% (1)
Chapter 3 FACTORS AFFECTING THE ACADEMIC PERFORMANCE OF SENI
7 pages
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.