228371_Lecture_Notes_Week_2
228371_Lecture_Notes_Week_2
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Example: testmarks.txt
1. Plot the data using a scatter plot: plot(x, y)
data<- read.table("testmarks.txt", header = TRUE)
plot(data$maths, data$English, xlim = c(0,100), ylim =c(0,100))
Note:
View(data)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$english~data$maths), col="green", lw = 3)
Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$english~data$maths), col="green", lw = 3)
Example: testmarks.txt
2. Interrogate the data – look for trends (linear?), gaps, outliers, etc)
abline(lm(data$maths~data$english), col=“red", lw = 3, lt = 2)
Note:
attach(pulse)
….
detach(pulse)
Not sure?
search()
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Note:
plot(weight, extsys) VS plot(weight ~ extsys)
x-axis, y-axis
dependent ~ independent
response ~ predictor
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Scatterplot Matrix
pines <- read.table(“pines.txt", header = TRUE)
pairs(pines[2:5], col = pines$area, pch = pines$area)
plot(pines$third ~ pines$second)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
rxy measures the strength of a linear relationship between two variables X and Y
-1 ≤ rxy ≤ 1
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟𝑥𝑦 =
σ 𝑥𝑖 − 𝑥ҧ 2σ 𝑦𝑖 − 𝑦ത 2
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Diagonal == 1
Symmetric about the diagonal
Correlation NOT causation
cor.test(pines$top, pines$first)
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Our Aim
The model can be a straight line (Linear model), or a curved function (e.g.,
Polynomial).
Terminology
If we don’t know Resistance, R, but we do know V and I, we can directly calculate R → R = V/I
If we didn’t know about Ohm’s law, we might want to fit a model instead:
Usually try simplest model first (linear), and then make them more complicated if needed.
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Empirical Models
- If there is no pre-conceived notion of the form of the relationship → find an EMPIRICAL MODEL
- Do you know anything about the underlying physical mechanism?
- Predict the response as accurately as possible (e.g., instrument calibration curves)
- Start SIMPLE (linear)
Errors
- Reality is messy
- Response and explanatory variables rarely satisfy a mathematical equation exactly
- Experimental situation
- Measurement errors
- Additional unrecorded factors
- Observational data
- Inherent variability
- Even more unrecorded factors
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
Linear Models
[Linear models don’t have to be “straight lines”]
In a linear model, the mean response is linear in the parameters, e.g.,
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖 +
𝑥1 𝑥2
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥22 + 𝜖 +
𝑥1 𝑥22
𝑥2
𝑦 = 𝛽1 sin 𝑥1 + 𝛽2 log Τ𝑥3 + 𝛽3 𝑥4 𝑥5 + 𝜖 + +
Non-Linear Models
In a non-linear model, you can combine your terms however you want, e.g.,
𝛽0 +𝛽1 𝑥1
𝑦 = 𝛽0 𝑒 −𝑖𝛽1𝑥 +𝜖 or 𝑦= +𝜖
𝛽2 𝑥2 +𝛽3 𝑥32
Regression Analysis
Regression is the tendency of the response variable (y) to vary with one of more explanatory variables (x).
Multiple regression: more than one explanatory (or predictor) variable (x1,…, xp)
The word Regression was first used by Francis Galton (late 1800’s) to describe tendency of tall fathers to
have not-so-tall sons (“regression towards the mean”).
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
- 𝑦𝑖 is normally distributed
- var[𝑦𝑖 ] is constant (= σ2), i.e., does not change with x
- ϵi ~ Normal(0, σ2) independent and identically distributed – iid
Combine this with linearity of regression means we can build models that look like:
𝑦𝑖 ~𝑁 𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2
Prediction errors: ∈𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖
BUT because 𝛽0 , 𝛽1 are unknown, so are the errors. So, we ESTIMATE the error as well using bo, b1:
𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
Prediction errors estimated with residuals: ∈𝑖 = 𝑦𝑖 - 𝑦ො𝑖
228371 Lecture set 2: Regression
Lecture 1: Scatter plots, Regression equations
𝑦𝑖 ~𝑁 𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression
We are
Lecture 1 (Monday 11am) Scatter plots, Regression equations here
Lecture 2 (Weds 8am) Errors and Residual Analysis
Lecture 3 (Thurs 12pm) Testing & Interpretation
Lab (2hours on Friday) “Comparing means and simple linear regression”
Residual: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
How to fit the line – Least-Squares Concept
Residual: 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Regression Sum of Squares 𝑆𝑆𝑅𝑒𝑔 = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦ത 2 𝑦ത : with a bar → mean
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
Sum of Squared Residuals 𝑦ത : with a bar → mean
Total Sum of Squares: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 How much do our DATA vary
Residual Sum of Squares: 𝑆𝑆𝑅𝑒𝑠 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 How good is our MODEL
𝑆𝑥𝑥 = 𝑥𝑖 − 𝑥ҧ 2
𝑖=1
𝑆𝑥𝑦 = 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑖=1
𝑆𝑥𝑦
𝑏1 =
𝑆𝑥𝑥
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis
The proportion of variation explained by the fit is called R-Squared, and is given by:
If all the data lie on a straight line, then SSRes = 0, R2 = 1. The fit explains everything – we have a perfect model.
Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
Looking at the residuals shows how well the fit explains the systematic variation in the data.
Ideally, a plot of residuals against fitted values should be random scatter about 0 of constant size.
plot(residuals(model1)~fitted(model1))
abline(h=0)
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis
Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
Looking at the residuals shows how well the fit explains the systematic variation in the data.
Ideally, a plot of residuals against fitted values should be random scatter about 0 of constant size.
plot(residuals(model1)~fitted(model1))
abline(h=0)
Residual Analysis
Recall that residual = observed – fit, 𝑒𝑖 = 𝑦𝑖 - 𝑦ො𝑖
If the scatter does NOT have constant width and you can see some sort of pattern, this may suggest a transformation
of the y variable.
“Heteroscedasticity”
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis
- Increasing variance
Or maybe a log transform will help, and improve the increasing variance issue?
Caution: When we move to empirical models, we can’t really use them to extrapolate.
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis
Statistics is not just about modelling the central curve, its about modelling the variability as well.
If the residuals still show some sort of pattern, then the model may be able to be improved.
NB: Cannot compare residual standard error here because units have changed (log scale).
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis 𝑦𝑖 : no disguise → true value
𝑦ො𝑖 : with a “hat” → estim-HAT?
𝑦ത : with a bar → mean
Transformations: Fixing Linearity and Normality y*: transformed → into a star
Power transformations can: stretch large values (good for left skewed data), or
shrink large values (good for right skewed data).
𝑦 ∗ = sign λ ∗ 𝑦 λ if λ ≠ 0
or, 𝑦 ∗ = log λ if λ = 0
Skewness
[Direction is based on where the longer tail is]
Long-tail on Long-tail on
the left the right
228371 Lecture set 2: Regression
Lecture 2: Errors and Residual Analysis
Skewness
[Different skewness gets treated differently]
Stretch large
values
Shrink large
values
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression
Testing Coefficients
Regression model: 𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 + ∈𝑖
If the regression coefficient of a variable, 𝛽𝑖 is zero, then changes in that variable do not affect the response variable.
[If 𝛽1 = 0, then 𝛽1 𝑥𝑖 is always zero and 𝑥𝑖 has no influence on 𝑦𝑖 ]
[If 𝛽1 ≠ 0, then 𝛽1 𝑥𝑖 is always SOMETHING and 𝑥𝑖 has SOME affect on 𝑦𝑖 ]
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 0
The hypotheses tested are: 𝐻0 : 𝛽𝑖 = 0 vs 𝐻𝑎 : 𝛽𝑖 ≠ 0 Test statistic: 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟
summary(model1)$coefficients
The p-value, Pr(>|t|), is the probability of getting an estimate as extreme (as far from zero) as we did, given H0 is true,
i.e., the coefficient is zero.
Can reject the null hypothesis for both the intercept, 𝛽0 and the slope coefficient, 𝛽1 .
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Interpretation of Intercept
The intercept, 𝛽0 is the expected response when x = 0 (where the line crosses the y-axis).
print(model1)
Interpretation of Slope
The intercept, 𝛽1 is the expected increase in response when x increases by 1.
print(model1)
Estimate that steam use will increase by -0.07983 lb/month for each increase in average temperature of 10F.
Common sense check: decrease in steam use as average temperature increases, and matches the plot.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
F-test
𝑥1 𝑥3
There is an overall test to see if the model is useful in predicting change in the response. 𝑥2
This is called the F-test, it uses Mean Squares (MS) and is related to the Sum of Squares.
𝑀𝑆𝑅𝑒𝑔 𝑆𝑆𝑅𝑒𝑔 /𝑝
F-test statistic: =
𝑀𝑆𝑅𝑒𝑠 𝑆𝑆𝑅𝑒𝑠 /(𝑛−𝑝−1)
Null hypothesis is that the model is NOT significant, under which the test statistic follows an F distribution
with p, and n-p-1 degrees of freedom.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
F-test
For simple linear regression (one explanatory variable), the F-test is identical to the t-test for the slope, t2 = F.
When you have more than one explanatory variable (multiple regression), the F-test is different.
summary(model1)
You run your test, get the results, look at the p-value,
if the p-value is < 0.05, we can say “the model
explains a significant amount of variation in y”
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
- As 𝑥0 gets further from its mean, both prediction and confidence intervals get larger.
[Avoid extrapolation]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Assumptions:
Assess: by plotting
- Plot the data
- Plot the errors (residuals) vs the explanatory variable(s)
Check: If you can see any patterns → there is some form of non-linearity
Fix: Add in some polynomial terms (dealt with later), or transform the data
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Assess: by plotting
- Plot the data
- Plot the errors (residuals) vs the fitted values
Plot of residuals vs order of the data: plot(residuals(model1)) # assumes data index is collection order
Plot separately in R, or use plot.lm() command which will plot a similar set: plot.lm(model1)
Note: All tests based on F, t, etc., requires normality of residuals (F more than t).
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Histogram of Residuals
hist(res1, col = “cornsilk”, main = “Histogram of residuals”)
plot(model1, which = 5)
[plot number 5 please]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
- In R, we find the leverages ourselves. Rule of thumb: Points with more than three times the average
leverage (hii > 3p/n) are having undue influence.
- There are different ways to measure influence: DFITS, Cooks distance, etc.
- Cook’s distance: change in regression coefficients if you remove the data point
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Identifying Outliers
With a plot: standardized residuals (or their square root) can identify outliers:
plot(model1, which = 3)
[plot number 3 please]
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Is there extra information to describe why its different? → MAYBE delete from data set
If neither of the above, proceed with caution → ALWAYS plot the residuals.
228371 Lecture set 2: Regression
Lecture 3: Testing and Interpretation
Residual Plots
The four common residual plots can be plotted easily in one go:
par(mfrow=c(2,2)) # split plotting area into a 2 by 2 grid
plot(model1)
228371
Statistical Modelling for Engineers & Technologists
LECTURE SET 2: Regression