Reg Lin
Reg Lin
October 2024
Overview
1 Correlation
Introduction
Correlation coefficients
Example
Special cases
Inference
2 Linear Regression
Introduction
Simple linear regression
Inference
Diagnostics
Multiple regression
Inference
Example
3 Final recap
2 / 73
Correlation Linear Regression Final recap
Outline
1 Correlation
Introduction
Correlation coefficients
Example
Special cases
Inference
2 Linear Regression
3 Final recap
3 / 73
Correlation Linear Regression Final recap
Introduction
4 / 73
Correlation Linear Regression Final recap
Introduction
Some examples
5 / 73
Correlation Linear Regression Final recap
Introduction
6 / 73
Correlation Linear Regression Final recap
Correlation coefficients
• Correlation coefficient ρ
• Quantifies the amount of (linear) association between X and Y
• ρ = ±1 if the scatterplot of Y by X shows aligned points
• ρ = 0: no linear association
7 / 73
Correlation Linear Regression Final recap
Correlation coefficients
Correlations in practice
8 / 73
Correlation Linear Regression Final recap
Correlation coefficients
Correlations in practice
• ρ = 0 does not necessarily mean no association
9 / 73
Correlation Linear Regression Final recap
Correlation coefficients
Correlations in practice
• Correlation depends on range (be careful!)
10 / 73
Correlation Linear Regression Final recap
Correlation coefficients
Correlations in practice
11 / 73
Correlation Linear Regression Final recap
Correlation coefficients
12 / 73
Correlation Linear Regression Final recap
Correlation coefficients
13 / 73
Correlation Linear Regression Final recap
Example
14 / 73
Correlation Linear Regression Final recap
Example
cor.test(birthwt$lwt, birthwt$bwt)
15 / 73
Correlation Linear Regression Final recap
Special cases
Non-Gaussian variables
16 / 73
Correlation Linear Regression Final recap
Special cases
17 / 73
Correlation Linear Regression Final recap
Special cases
18 / 73
Correlation Linear Regression Final recap
Special cases
Pearson’s product-moment
19 / 73
Correlation Linear Regression Final recap
Special cases
Important caveats
20 / 73
Correlation Linear Regression Final recap
Special cases
21 / 73
Correlation Linear Regression Final recap
Special cases
22 / 73
Correlation Linear Regression Final recap
Inference
Basic recalls:
• Distribution of a random vector (X , Y )
• Density of (X , Y ):
f (x, y )dxdy = P(x ≤ X ≤ x + dx, y ≤ Y ≤ y + dx)
• Marginal densities:
fx (x)dx = P(x ≤ X ≤ x + dx) and
fy (y )dy = P(y ≤ Y ≤ y + dy )
• Covariance:
Cov (X , Y ) = E[(X − E(X ))(Y − E(Y ))] = E(XY ) − E(X )E(Y )
23 / 73
Correlation Linear Regression Final recap
Inference
Cov (X , Y )
ρ(X , Y ) = p = ρ(Y , X )
Var (X )Var (Y )
Some properties
• −1 ≤ ρ ≤ 1
• if X and Y are independent → ρ = 0
• ρ = 0 and Gaussian X and Y ⇒ independence
24 / 73
Correlation Linear Regression Final recap
Inference
Inference (estimation)
Cov (X , Y )
ρ=
σx σy
is estimated by
sxy
r=
sx sy
i.e. P
(xi − mx )(yi − my )
r = pP
(xi − mx )2 (yi − my )2
or P
xi yi − nmx my
r=
(n − 1)sx sy
25 / 73
Correlation Linear Regression Final recap
Inference
Probability distribution of r
26 / 73
Correlation Linear Regression Final recap
Inference
Confidence interval
e2z1 − 1 e2z2 − 1
CI1−α (ρ) = ;
e2z1 + 1 e2z2 + 1
27 / 73
Correlation Linear Regression Final recap
Inference
Statistical testing
Then, use as usual p-values o critical values to draw conclusions about the
significance of ρ
28 / 73
Correlation Linear Regression Final recap
Inference
Recap
29 / 73
Correlation Linear Regression Final recap
Outline
1 Correlation
2 Linear Regression
Introduction
Simple linear regression
Inference
Diagnostics
Multiple regression
Inference
Example
3 Final recap
30 / 73
Correlation Linear Regression Final recap
Introduction
• Xi et Yi are quantitative
31 / 73
Correlation Linear Regression Final recap
Introduction
Two situations
32 / 73
Correlation Linear Regression Final recap
Regression
• It also has little sense when X is not random (e.g. fixed measurement
times in an experiment)
33 / 73
Correlation Linear Regression Final recap
Y = α + βX + ϵ
→ Estimate (and test) α and β
− β is the slope (or coefficient) of the regression line
− α is the intercept
34 / 73
Correlation Linear Regression Final recap
Inference
Inference
35 / 73
Correlation Linear Regression Final recap
Inference
Least-squares line
• Find the line that minimizes the distance between observations and
predictions
80 +
●
●
70 ●
+
+
60 +
+ ●
50 ● ●
Y
+ ●
40 +
● +
30 ●
++
20
●
30 40 50 60 70 80 90
X
• Ordinary
P least-squares: to minimize
E = (yi − α − βxi )2 (sum of the squared residuals)
(a ‘good’ line is the one that minimizes E)
36 / 73
Correlation Linear Regression Final recap
Inference
Solution
Point estimates
P
sxy (xi − mx )(yi − my )
β̂ = = P
sx2 (xi − mx )2
α̂ = my − β̂mx
Variances
2
sy
sx2
− β̂ 2
d β̂) = s2 =
Var( β
n−2
2
Pn
sβ i=1 xi2
Var(α̂)
d = sα2 =
n
37 / 73
Correlation Linear Regression Final recap
Inference
38 / 73
Correlation Linear Regression Final recap
Inference
Interpretation of coefficients
39 / 73
Correlation Linear Regression Final recap
Inference
• H0 : β = β0 vs H1 : β ̸= β0
• Test statistic
β̂ − β0
tb = ∼H0 t(n−2)
sβ
• Two-sided test
− compute β̂ and tb to be compared to tα/2,n−2
− if |tb | < tα/2,n−2 , do not reject H0
the slope is not significantly different from β0
− if |tb | ≥ tα/2,n−2 , reject H0
conclude that the slope is not β0
40 / 73
Correlation Linear Regression Final recap
Inference
• H0 : α = α0 vs H1 : α ̸= α0
• Test statistic
α̂ − α0
ta = ∼H0 t(n−2)
sα
41 / 73
Correlation Linear Regression Final recap
Inference
Small example
X 23 25 36 42 50 60 68 80 85 95
Y 15 35 30 50 50 45 52 70 75 80
80 +
●
●
70 ●
+
+
60 +
+ ●
50 ● ●
Y
+ ●
40 +
● +
30 ●
++
20
●
30 40 50 60 70 80 90
X
42 / 73
Correlation Linear Regression Final recap
Inference
Estimates
32711−10×56.4×50.2
• β̂ = 9×25.342
= 0.761
q
37588
• sα = 0.0993 × 10 = 6.09
0.761
• tb = 0.0993 = 7.664 ≥ t0.025,8 = 2.306 (p = 5.94 × 10−5 )
43 / 73
Correlation Linear Regression Final recap
Inference
Confidence intervals
44 / 73
Correlation Linear Regression Final recap
Inference
Prediction intervals
• We could also look at the interval where the values of Y should lie
when X = x0
(x0 − mx )2
2 2 1
spred =s × 1+ + P
n (xi − mx )2
45 / 73
Correlation Linear Regression Final recap
Inference
Example
60 60
● ●
● ● ● ●
● ● ● ●
50 ●
● 50 ●
●
● ●
●
● ● ●
● ●
● ● ● ●
40 ●●
● ● 40 ●●
● ●
● ● ● ●
y
y
● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ●
● ●
30 ●
●
● ● ● 30 ●
●
● ● ●
● ●
● ●
● ●
20 20
● ●
10 10
0 0
5 10 15 20 25 5 10 15 20 25
x x
46 / 73
Correlation Linear Regression Final recap
Inference
• X: weight
47 / 73
Correlation Linear Regression Final recap
Inference
Result
250
^ =63.5 (12.7)
α
^
β=1.19 (0.30)
t=3.94, ddl=23, p=0.0006
200
●
● ●
●
PEmax
150
●●
●
● ●
100 ● ●
● ● ● ●
●
● ● ● ●
● ●
●
●
50
20 30 40 50 60 70
Weight (kg)
48 / 73
Correlation Linear Regression Final recap
Inference
With R
• lm(y ∼ x)
• Preceding example
x <- c(23, 25, 36, 42, 50, 60, 68, 80, 85, 95)
y <- c(15, 35, 30, 50, 50, 45, 52, 70, 75, 80)
lm(y ∼ x)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.27142 6.08872 1.194 0.267
x 0.76114 0.09931 7.664 5.94e-05 ***
• birthwt example
lm(bwt ∼ lwt, data = birthwt)
49 / 73
Correlation Linear Regression Final recap
Inference
50 / 73
Correlation Linear Regression Final recap
Inference
birthwt example
Call:
lm(formula = bwt ~ lwt, data = birthwt)
Residuals:
Min 1Q Median 3Q Max
-2192.12 -497.97 -3.84 508.32 2075.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2369.624 228.493 10.371 <2e-16 ***
lwt 4.429 1.713 2.585 0.0105 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
51 / 73
Correlation Linear Regression Final recap
Inference
• X: weight
52 / 73
Correlation Linear Regression Final recap
Diagnostics
Assumptions :
53 / 73
Correlation Linear Regression Final recap
Diagnostics
Illustration
54 / 73
Correlation Linear Regression Final recap
Diagnostics
Verifying assumptions
55 / 73
Correlation Linear Regression Final recap
Diagnostics
Residuals
56 / 73
Correlation Linear Regression Final recap
Diagnostics
Residuals vs predicted
57 / 73
Correlation Linear Regression Final recap
Diagnostics
100
50 ● ●
●●
Residuals
● ●
● ● ● ●
● ●
●
0 ● ●
●
● ●
● ● ●
●
●
● ●
−50
−100
Predicted PEmax
58 / 73
Correlation Linear Regression Final recap
Diagnostics
●
●
40
● ●
20
Sample Quantiles
●●
●●
●●
●●
●
0
●
●●
●
−20
●
● ●●
●
●
−40
● ●
−2 −1 0 1 2
Theoretical Quantiles
59 / 73
Correlation Linear Regression Final recap
Diagnostics
• Residuals vs fitted
plot(predict(f1),resid(f1), xlab="Predicted newborn
weight (kg)",ylab="Residuals (kg)")
• Q-Q plot
qqnorm(resid(f1))
qqline(resid(f1))
60 / 73
Correlation Linear Regression Final recap
Diagnostics
Correlation vs regression
61 / 73
Correlation Linear Regression Final recap
Diagnostics
80 80 ●
60 60 ●●
●
●●
● ●
●
● ●
40 40
Y
Y
● ●
● ●
● ● ●
● ● ●●
● ●
● ●
●
● ●
● ●
●
20 ● ●●●
● ●
●
● ●
20 ●
● ●●
●●
●
●●
●●
●● ●● ● ●● ●●
● ● ● ●● ●● ● ●●
● ●●
● ●● ●
● ● ●●
●
0 0
18 19 20 21 22 23 24 5 10 15 20 25
X X
62 / 73
Correlation Linear Regression Final recap
Diagnostics
• Same model: Y = α + βX + ϵ
• With X ∈ {0, 1}
63 / 73
Correlation Linear Regression Final recap
Diagnostics
• The test of β = 0 is the same as the t-test (the original one, not
Welsh’s)
64 / 73
Correlation Linear Regression Final recap
Diagnostics
65 / 73
Correlation Linear Regression Final recap
Multiple regression
• Model: Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + ϵ
66 / 73
Correlation Linear Regression Final recap
Inference
Inference
• At the end, we still get point estimates β̂k ’s and their variance / SE
67 / 73
Correlation Linear Regression Final recap
Example
Call:
lm(formula = bwt ~ lwt, data = birthwt)
Residuals:
Min 1Q Median 3Q Max
-2192.12 -497.97 -3.84 508.32 2075.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2369.624 228.493 10.371 <2e-16 ***
lwt 4.429 1.713 2.585 0.0105 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
68 / 73
Correlation Linear Regression Final recap
Example
Interpretation of coefficients
• α: it can be interpreted as the mean for the outcome y when all of the
predictors take on the value of 0
69 / 73
Correlation Linear Regression Final recap
Example
Residuals:
Min 1Q Median 3Q Max
-2233.11 -499.33 9.44 520.48 1897.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2214.412 299.311 7.398 4.59e-12 ***
age 8.089 10.063 0.804 0.4225
lwt 4.177 1.744 2.395 0.0176 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
70 / 73
Correlation Linear Regression Final recap
Example
Residuals:
Min 1Q Median 3Q Max
-1696.93 -481.80 -19.06 447.69 1702.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2515.511 289.843 8.679 2.22e-15 ***
age 6.190 9.467 0.654 0.514002
lwt 4.015 1.692 2.373 0.018685 *
smoke -206.773 101.398 -2.039 0.042874 *
ht -623.449 206.135 -3.024 0.002851 **
ui -500.843 141.189 -3.547 0.000495 ***
I(ptl > 0)TRUE -260.002 139.369 -1.866 0.063711 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
71 / 73
Correlation Linear Regression Final recap
Outline
1 Correlation
2 Linear Regression
3 Final recap
72 / 73
Correlation Linear Regression Final recap
Summing Up
73 / 73