REG2022
REG2022
LINEAR REGRESSION
Master of Statistics
2014–2015
Contents
i
2.1.4 Example: MgCO3 Content of Sand Dollars . . . . . . . . . . . . . . . . 29
3.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ii
3.4.2 Test for Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.1 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
iii
4.3 Effects of Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
iv
5.6.1 Interval Estimation of E(Yh ) . . . . . . . . . . . . . . . . . . . . . . . 157
v
5.13.2 Example: Crew Productivity Data . . . . . . . . . . . . . . . . . . . . . 201
2
6.2.2 Ra,p or MSEp Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 224
vi
6.4 Diagnostic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Appendix A 297
Appendix B 302
Bibliography 305
vii
Chapter 1
• Regression Model
1
Chapter 1. Simple Linear Regression C. Sotto
Reference
3
Chapter 1. Simple Linear Regression C. Sotto
Functional Relation
Examples:
Y = 2X
1 2
Y = gX
2
4
Chapter 1. Simple Linear Regression C. Sotto
5
Chapter 1. Simple Linear Regression C. Sotto
Statistical Relation
Y = f (X) + ε
f (X) = β0 + β1X.
Y = β0 + β1X + ε.
6
Chapter 1. Simple Linear Regression C. Sotto
Example:
Suppose Y is the height of a tree and X is its age.
The Chapman-Richards growth model is
Y = a[1 − exp(bX)]c + ε.
7
Chapter 1. Simple Linear Regression C. Sotto
Assumptions:
Additional Assumptions:
i.i.d.
εi ∼ N (0, σ 2), ∀ i
1. That is, εi is normally distributed, E(εi) = 0, and
V ar(εi) = σ 2, for all i.
2. For two different trials, i and j, the error terms εi and
εj are independent.
8
Chapter 1. Simple Linear Regression C. Sotto
Model 1 : Yi = β0 + β1Xi + εi
9
Chapter 1. Simple Linear Regression C. Sotto
10
Chapter 1. Simple Linear Regression C. Sotto
12
Chapter 1. Simple Linear Regression C. Sotto
145
140
Divorces
135
130
125
120
13
Chapter 1. Simple Linear Regression C. Sotto
b0 = β̂ 0 = Ȳ − b1X̄
where: X X
X X Xi Yi
SSXY = (Xi − X̄)(Yi − Ȳ ) = XiYi −
n
≡ (corrected) sum of the cross-products
X 2
X X ( X i )
SSXX = (Xi − X̄)2 = Xi2 −
n
≡ (corrected) sum of squares for X
n
X
SSE = Q(β̂ 0, β̂ 1) = (Yi − β̂ 0 − β̂ 1Xi)2
i=1
n
X n
X
= (Yi − Ŷ i)2 = e2i ,
i=1 i=1
Note that:
• ei = Yi − Ŷ i is the difference between observed and
predicted values at Xi
• we can think of ei as an “estimator” of the error εi
2 SSE
s = M SE = .
n−2
Estimand Estimator
SSXY
β1 β̂ 1 =
SSXX
β0 β̂ 0 = Ȳ − β̂ 1X̄
εi = Yi − E(Yi) ei = Yi − Ŷ i
2 2 SSE
σ s = M SE =
n−2
17
Chapter 1. Simple Linear Regression C. Sotto
(SSXY )2
SSE = SSY Y −
SSXX
n 2
X
n
X
(Xi − X̄)(Yi − Ȳ )
SSE = (Yi − Ȳ )2 − i=1
n
X
i=1 (Xi − X̄)2
i=1
where:
2
n
X
n
X n
X
Y i
SSY Y = (Yi − Ȳ )2 = Yi2 − i=1
i=1 i=1 n
n 2
X
n
X n
X
X i
2
SSXX = (Xi − X̄) = Xi2 − i=1
i=1 i=1 n
n
X
SSXY = (Xi − X̄)(Yi − Ȳ )
i=1
n
X
n
X
n
X
Xi Yi
i=1 i=1
= Xi Yi −
i=1 n
Gauss-Markov Theorem
19
Chapter 1. Simple Linear Regression C. Sotto
20
Chapter 1. Simple Linear Regression C. Sotto
data TOLUCA;
input SIZE HOURS;
datalines;
80 399
30 121
50 221
90 376
70 361
60 224
120 546
80 352
100 353
50 157
40 160
70 252
90 389
20 113
110 435
100 420
30 212
50 268
90 377
110 421
30 273
90 468
40 244
80 342
70 323
;
run;
21
Chapter 1. Simple Linear Regression C. Sotto
SAS Output
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
22
Chapter 2
• Analysis of Variance
• Coefficient of Determination, R2
23
Chapter 2. Inferences in Regression Analysis C. Sotto
25
Chapter 2. Inferences in Regression Analysis C. Sotto
Normality
The following are two key results that we will use to show
that β̂ 1 has a normal distribution:
X (Xi − X̄)Yi
β̂ 1 = − 0
SSXX
Thus,
(Xi − X̄)Yi X
X Xi − X̄
β̂ 1 = ≡ wiYi, with wi = ,
SSXX SSXX
implying that it is a weighted sum of independent, normal
variables. Hence, by result (2) above, β̂ 1 has a normal
distribution.
Therefore,
2
σ
β̂ 1 ∼ N β1 , X 2 .
(Xi − X̄)
The exact form of the critical region for the test will be
determined by the alternative hypothesis, HA.
28
Chapter 2. Inferences in Regression Analysis C. Sotto
SAS Code
data SAND;
input mgco3 temp;
datalines;
9.20 17.50
9.20 21.00
9.40 20.00
9.00 15.30
8.50 14.00
8.50 13.10
8.80 13.30
8.50 13.00
9.30 19.00
9.00 18.70
;
run;
29
Chapter 2. Inferences in Regression Analysis C. Sotto
SAS Output
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits
Covariance of Estimates
30
Chapter 2. Inferences in Regression Analysis C. Sotto
H0 : β1 = 0 versus HA : β1 6= 0 .
β̂ 1 − β1 0.0994 − 0 0.0994
t∗ = = v
u
u 0.0288
= √ = 5.44 .
s(β̂ 1) u
u
0.0003336
t
86.329
31
Chapter 2. Inferences in Regression Analysis C. Sotto
n
X n
X
Writing β̂ 0 = kiYi and β̂ 1 = wiYi, we obtain
i=1 i=1
Ŷ h = β̂ 0 + β̂ 1 xh
n
X n
X
= k i Yi + x h w i Yi
i=1 i=1
n
X
= (kiYi + xhwiYi)
i=1
n
X
= (ki + xhwi)Yi
i=1
Thus,
n
X n
X
Ŷ h = (ki + xhwi)Yi ≡ ciYi, with ci = ki + xhwi,
i=1 i=1
is also a linear combination of the observations.
E(Ŷ h) = β0 + β1xh
2
1 (xh − X̄)
V ar(Ŷ h) = σ 2 +
.
n SSXX
35
Chapter 2. Inferences in Regression Analysis C. Sotto
SAS Code
data SAND;
input mgco3 temp;
datalines;
9.20 17.50
9.20 21.00
9.40 20.00
9.00 15.30
8.50 14.00
8.50 13.10
8.80 13.30
8.50 13.00
9.30 19.00
9.00 18.70
. 15
;
run;
36
Chapter 2. Inferences in Regression Analysis C. Sotto
SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: mgco3
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Output Statistics
Sum of Residuals 0
Sum of Squared Residuals 0.23046
Predicted Residual SS (PRESS) 0.36696
37
Chapter 2. Inferences in Regression Analysis C. Sotto
38
Chapter 2. Inferences in Regression Analysis C. Sotto
Note that this is the same point estimator that we use for
estimating the mean response:
Ŷ h = β̂ 0 + β̂ 1xh .
The difference now is that the prediction of a single new
observation is more variable than when estimating the
mean response, i.e.
!
2 1 (xh − X̄)2
V ar Ŷ h(new) = σ 1 + + .
n SSXX
This is estimated as
2
2
!
1 (xh − X̄)
s Ŷ h(new) = M SE 1 + +
n SSXX
2
1 (xh − X̄)
= M SE + M SE +
n SSXX
!
s Ŷ h(new) = M SE + s2(Ŷ h) .
2
40
Chapter 2. Inferences in Regression Analysis C. Sotto
The plot on the left shows the line of best fit along with
the 95% confidence interval for the mean response of
growth at each value of temperature (X).
SAS Code
proc reg corr simple;
model growth = temp;
plot growth*temp/conf95 vaxis=-1 to 5 by 1;
plot growth*temp/pred95 vaxis=-1 to 5 by 1;
run;
41
Chapter 2. Inferences in Regression Analysis C. Sotto
2
M SE
s Ȳh(new) = + s2(Ŷ h) .
m
!
2
Note the difference between the latter and s Ŷ h(new) .
42
Chapter 2. Inferences in Regression Analysis C. Sotto
Working-Hotelling Band
Ŷ h ∓ W s(Ŷ h) ,
where: r
W = 2 F1−α ; 2 , n−2 and
2
1 (x h − X̄)
s2(Ŷ h) = M SE +
.
n SSXX
43
Chapter 2. Inferences in Regression Analysis C. Sotto
Note:
45
Chapter 2. Inferences in Regression Analysis C. Sotto
Yi −
| {z
Ȳ} = Yi −{z Ŷ i}
|
+ Ŷ i −
| {z
Ȳ}
deviation deviation of
total
= around fitted fitted regression
deviation
regression line value around mean
n
X
= (Yi − Ŷ i + Ŷ i − Ȳ )2
i=1
n
X
2 2
= (Yi − Ŷ i) + (Ŷ i − Ȳ ) + 2(Yi − Ŷ i)(Ŷ i − Ȳ )
i=1
n
X n
X n
X
= (Yi − Ŷ i )2 + (Ŷ i − Ȳ )2 + 2 (Yi − Ŷ i)(Ŷ i − Ȳ )
i=1
| {z }
i=1
| {z } |
i=1 {z }
Or, respectively,
• SSR ⇒ the “explained” part
• SSE ⇒ the “unexplained” part
47
Chapter 2. Inferences in Regression Analysis C. Sotto
Source of
df SS MS E(MS)
Variation
SSR
Regression 1 SSR MSR = σ 2 + β12 SSXX
1
SSE
Error n−2 SSE MSE = σ2
n−2
48
Chapter 2. Inferences in Regression Analysis C. Sotto
M SR H0
∗
F = ∼ F1 , n−2 .
M SE
H0 : β1 = 0 versus HA : β1 6= 0 .
49
Chapter 2. Inferences in Regression Analysis C. Sotto
HA : β1 < 0 or HA : β1 > 0 .
50
Chapter 2. Inferences in Regression Analysis C. Sotto
Source of Variation df SS MS F∗ p
Regression 1 0.85354 0.85354 29.63 0.0006
Error 8 0.23046 0.02881
Total 9 1.08400
Parameter Estimates
51
Chapter 2. Inferences in Regression Analysis C. Sotto
H0 : β1 = 0 versus HA : β1 6= 0 .
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 0 0 . . .
Error 9 1.08400 0.12044
Corrected Total 9 1.08400
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
55
Chapter 2. Inferences in Regression Analysis C. Sotto
SSE(R) − SSE(F )
dfR − dfF
F∗ =
SSE(F )
dfF
1.08400 − 0.23046 0.85354
= 9−8 = 1
0.23046 0.02881
8
F ∗ = 29.6265
(You will see some examples of this later in the course when
we discuss Multiple Linear Regression and more complex
types of hypotheses in Chapter 5.)
56
Chapter 2. Inferences in Regression Analysis C. Sotto
Note:
1. 0 ≤ R2 ≤ 1. Why?
2. If all the data points fall exactly on a line having
non-zero slope, then R2 = 1.
3. If β̂ 1 = 0, then R2 = 0.
4. The square root of R2 yields the correlation
coefficient, r, between X and Y , i.e.
√
r = ± R2 ,
where the sign is chosen according to the sign of β̂ 1.
√
5. The estimated slope β̂ 1 and r = R2 are correlated
as follows:
sx
r = β̂ 1 ,
sy
where sx and sy are the standard deviations of X and
Y , respectively.
57
Chapter 2. Inferences in Regression Analysis C. Sotto
Recall that
SSXY
r=√ .
SSXX SSY Y
58
Chapter 2. Inferences in Regression Analysis C. Sotto
Simple Statistics
mgco3 temp
59
Chapter 3
• Diagnostic Plots
• Lack-of-Fit Test
• Remedial Measures
• Transformations
60
Chapter 3. Diagnostics and Remedial Measures C. Sotto
• Linearity
The regression function is not linear.
• Homoscedasticity
The error terms do not have constant variance.
• Independence
The error terms are not independent.
• Outliers
The fit is alright except for some outliers.
• Normality
The error terms are not normally distributed.
• Model Extension
Important independent variables are not in the model.
61
Chapter 3. Diagnostics and Remedial Measures C. Sotto
3.2 Residuals
ei = Yi − Ŷ i
• the residual may be regarded as the “observed” error
• it is not the same as the unknown true error
εi = Yi − E(Yi)
• if the model is appropriate for the data at hand, the
residuals should reflect the properties assumed for εi
(i.e. independence, normality, zero mean and constant
variance)
Properties:
1. Mean:
n
Xei
ē = =0
i=1 n
2. (Sample) Variance:
2 n (ei − ē)2
X n
X e2i SSE
s = = = = M SE
i=1 n − 2 i=1 n − 2 n−2
3. Non-Independence:
The residuals are not independent and are subject to
two constraints:
n
X n
X
ei = 0 and Xiei = 0
i=1 i=1
62
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Semistudentized Residuals:
ei − ē ei
e∗i = √ = √
M SE M SE
√
• if M SE were an estimate of the standard deviation
of ei, then e∗i would be the studentized residual
(Studentized and other residuals will be discussed in Chapter 6.)
√
• the standard deviation of ei is not equal to M SE ;
it varies for each ei
√
• M SE is only an approximation of the standard
deviation ei; hence, we call e∗i a semistudentized
residual
Useful for:
1. finding outliers and/or misrecorded values
2. examining the shape of the distribution
64
Chapter 3. Diagnostics and Remedial Measures C. Sotto
• plot of e∗ vs. X
• normal quantile plot
Some Considerations:
66
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Note:
67
Chapter 3. Diagnostics and Remedial Measures C. Sotto
68
Chapter 3. Diagnostics and Remedial Measures C. Sotto
69
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Note:
71
Chapter 3. Diagnostics and Remedial Measures C. Sotto
72
Chapter 3. Diagnostics and Remedial Measures C. Sotto
73
Chapter 3. Diagnostics and Remedial Measures C. Sotto
74
Chapter 3. Diagnostics and Remedial Measures C. Sotto
M SE Φ
,
n + 0.25
where: Φ(a) = P (Z ≤ a) = CDF of N (0, 1) at a
(See p. 110-112 in Kutner et al. for a detailed description.)
75
Chapter 3. Diagnostics and Remedial Measures C. Sotto
76
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Procedure:
1. Fit a regression line and compute the residuals.
2. Calculate the expected value of the ei under normality.
a. Sort the residuals from lowest to highest.
b. For the k th ordered residual, its expected value
under normality is approximated by:
√
k − 0.375
M SE Φ
.
n + 0.25
3. Compute the correlation coefficient between the
original ei and their expected values under normality.
4. Test the resulting correlation coefficient using
Table B.6 in Kutner et al.
SAS Code
proc reg data=dataset;
model1: model Y = X;
output out=resids residual=res; run; quit;
proc sort data=resids; by res; run;
data resids2; set resids;
/*** Suppose MSE = 2588 and n = 50. ***/
exp_res = sqrt(2588)*probit((_n_-0.375)/(50+0.25)); run;
proc corr data=resids2;
var exp_res res; run;
proc gplot data=resids2;
plot exp_res*res; run; quit;
77
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Yi = β0 + β1xi + εi ,
εi = ρ εi−1 + ui , |ρ| < 1
ui ∼ N (0, σ 2) independent
ρ : autocorrelation parameter
Hypotheses:
H0 : ρ = 0 versus HA : ρ > 0
Statistic: n
X
(ei − ei−1)2
i=2
D= n
X
e2i
i=1
Decision Rule:
D > du ⇒ do not reject H0
D < dl ⇒ reject H0
dl ≤ D ≤ du ⇒ inconclusive
Values for dl and du can be found in Table B.7 in
Kutner et al.
78
Chapter 3. Diagnostics and Remedial Measures C. Sotto
80
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Code
* from output:
* median grp 1: -19.8759596;
* median grp 2: -2.6840404;
* in the output:
* look at the t-value and p-value for ’Pooled’ method;
* this is the t-value and p-value for Levene’s test;
81
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Output
N Median
13 -19.8759596
N Median
12 -2.6840404
82
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Variable: deviation
1 23.2054 53.4190
2 20.9386 50.1855
Diff (1-2) Pooled 24.1339 43.5582
Diff (1-2) Satterthwaite
Equality of Variances
83
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Code
/*************************************************/
/* IF YOU WANT TO DO THE COMPUTATIONS YOURSELF, */
/* YOU CAN GET THE NECESSARY VALUES USING THE FF */
/*************************************************/
* from output:
* ni = size of each group;
* dbari = mean of the absolute deviations, |ei - med(ei)|;
* cssi = corrected sum of squares, sum(di - dbari)^2;
SAS Output
N Mean Corrected SS
13 44.8150738 12566.61
N Mean Corrected SS
12 28.4503367 9610.29
84
Chapter 3. Diagnostics and Remedial Measures C. Sotto
85
Chapter 3. Diagnostics and Remedial Measures C. Sotto
86
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Procedure:
1. Fit a new regression line without the suspect
observation.
87
Chapter 3. Diagnostics and Remedial Measures C. Sotto
H0 : E(Y ) = β0 + β1X
vs.
HA : E(Y ) 6= β0 + β1X.
Figure 3.13: Scatter Plot and Fitted Regression Line for Bank Example.
88
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Assumptions
Notation
Test Procedure
89
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Procedure
Step 1:
Yij = µj + εij ,
with i = 1, . . . , nj and j = 1, . . . , c
c nj
X X
SSE(F ) = (Yij − Ȳj )2 = SSP E
j=1 i=1
c
X
dfF = (nj − 1) = n − c
j=1
Step 2:
Yij = β0 + β1Xj + εij ,
with i = 1, . . . , nj and j = 1, . . . , c
c nj
X X
SSE(R) = (Yij − β̂ 0 − β̂ 1Xj )2 = SSE
j=1 i=1
dfR = n − 2
Step 3:
(SSE − SSP E)/ [(n − 2) − (n − c)]
F∗ =
SSP E/(n − c)
SSLF/(c − 2)
=
SSP E/(n − c)
M SLF H0
∗
F = ∼ Fc−2 , n−c
M SP E
90
Chapter 3. Diagnostics and Remedial Measures C. Sotto
XX XX XX
(Yij − Ŷ ij )2 = (Yij − Ȳj )2 + (Ȳj − Ŷ ij )2 ,
This gives
SSE = SSP E + SSLF .
So,
c nj c
X X 2 X
SSLF = (Ȳj − Ŷ ij ) = nj (Ȳj − Ŷ j )2 .
j=1 i=1 j=1
91
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Decision Rule:
92
Chapter 3. Diagnostics and Remedial Measures C. Sotto
ANOVA Table
93
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Example
94
Chapter 3. Diagnostics and Remedial Measures C. Sotto
M SLF 980 H0
∗
F = = = 0.168 ∼ F2 , 4
M SP E 5950
95
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Code
data LOFFIT;
input X Y;
cards;
50 1530
50 1410
100 1690
100 1550
150 1680
150 1760
200 1850
200 1770
;
run;
96
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Output
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
97
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Overview
• Linearity
– modify the regression model
– use a transformation on X and/or Y
• Homoscedasticity
– use the method of weighted least squares
– use a variance stabilizing transformation
• Independence
– use a time series model
– use generalized least squares
– special transformations
• Normality
– use a Generalized Linear Model (GLM)
– use a transformation on Y
• Outliers
– discard outliers from recording errors (be careful!)
– add interaction(s) between independent variables
– use a robust estimation method
• Model Extension
– include more independent variables in the model
(latter part of this course)
98
Chapter 3. Diagnostics and Remedial Measures C. Sotto
3.5.1 Nonlinearity
Some Considerations:
• Find an appropriate model for the problem via
literature search.
• Develop a relationship that makes sense with
regard to the problem. Check diagnostic plots.
This is an iterative process.
• Non-parametric regression methods can be helpful.
99
Chapter 3. Diagnostics and Remedial Measures C. Sotto
3.5.3 Outliers
100
Chapter 3. Diagnostics and Remedial Measures C. Sotto
3.5.4 Non-Independence
3.5.5 Non-Normality
3.6 Transformations
102
Chapter 3. Diagnostics and Remedial Measures C. Sotto
′ ′
√
(a) X = log(X) or X = X
(b) X′ = X2 or X ′ = exp (X)
1
(c) X′ = or X ′ = exp (−X)
X
103
Chapter 3. Diagnostics and Remedial Measures C. Sotto
′
√ 1
(a) Y = Y (b) Y ′ = log(Y ) (c) Y ′ =
Y
′
Y λ, if λ 6= 0
Y =
ln Y, if λ = 0 (by definition)
Special Cases
λ Transform Y ′
2.0 Y2
√
0.5 Y
0.0 loge(Y )
1
−0.5 √
Y
1
−1.0
Y
105
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Likelihood Function:
(Yiλ − β0 − β1Xi)2
n
X
exp − 2
n
Y i=1 σ
fY λ (Yiλ, xi, β0, β1, σ 2) =
i=1 i (2πσ 2)n/2
Idea
Procedure
n = 25 children
Y : plasma level
X: age
107
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Code
data plasma;
input age plasma logplasma;
cards;
0 13.44 1.1284
0 12.84 1.1086
0 11.91 1.0759
0 20.09 1.3030
0 15.60 1.1931
1.0 10.11 1.0048
1.0 11.38 1.0561
1.0 10.28 1.0120
1.0 8.96 .9523
1.0 8.59 .9340
2.0 9.83 .9926
2.0 9.00 .9542
2.0 8.65 .9370
2.0 7.85 .8949
2.0 8.88 .9484
3.0 7.94 .8998
3.0 6.01 .7789
3.0 5.14 .7110
3.0 6.90 .8388
3.0 6.77 .8306
4.0 4.86 .6866
4.0 5.10 .7076
4.0 5.67 .7536
4.0 5.75 .7597
4.0 6.23 .7945
;
run;
108
Chapter 3. Diagnostics and Remedial Measures C. Sotto
data transformed;
set plasma(rename=(plasma=Y));
W1 = 1.0000*((Y** 1 )-1);
W2 = 3.9260*((Y** 0.6)-1);
W3 = 5.8365*((Y** 0.5)-1);
W4 = 68.7428*((Y** 0.1)-1);
W5 = 8.51632*log(Y);
W6 = -53.9767*((Y**-0.3)-1);
W7 = -49.7059*((Y**-0.5)-1);
W8 = -54.4917*((Y**-0.7)-1);
W9 = -65.0484*((Y**-0.9)-1);
W10= -72.5278*((Y**-1 )-1);
run;
data BoxCoxRes;
merge constants(keep=lambda) SSEboxcox;
run;
109
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Output
Obs lambda K1 K2
110
Chapter 3. Diagnostics and Remedial Measures C. Sotto
1. Count Data
In such cases, √
Y = β0 + β1X
is often a good point to start. A slightly, better version of
this is the Freedman-Tukey transform for stabilizing
variance (see Snedecor and Cochran, 1980, p. 447-453),
i.e. √ √
Y + Y + 1 = β0 + β1X .
For these types of transformations, as well as for any
other types of transformations of Y , you must be able to
interpret your scientific questions in terms of parameters
for the transformed variable.
2. Proportion Data
113
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Variables
SAS Code
data senic;
input id ls age ir rcr rcxr nb msa reg adc nn afs;
datalines;
1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241 60.0
2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52 40.0
...........................................................
112 17.94 56.2 5.9 26.4 91.8 835 1 1 791 407 62.9
113 9.41 59.5 3.1 20.6 91.7 29 2 3 20 22 22.9
;
run;
114
Chapter 3. Diagnostics and Remedial Measures C. Sotto
115
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Output
Descriptive Statistics
Uncorrected Standard
Variable Sum Mean SS Variance Deviation
Correlation
Variable AFS IR
116
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
117
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Output Statistics
Sum of Residuals 0
Sum of Squared Residuals 167.09706
Predicted Residual SS (PRESS) 173.54345
Mean
Source DF Square F Value Pr > F
118
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
119
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
120
Chapter 3. Diagnostics and Remedial Measures C. Sotto
Simple Statistics
Simple Statistics
expres res
121
Chapter 3. Diagnostics and Remedial Measures C. Sotto
SAS Graphs
122
Chapter 3. Diagnostics and Remedial Measures C. Sotto
123
Chapter 4
Simultaneous Inference
and Other Topics
124
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
Bonferroni inequality:
P (Ac1 ∩ Ac2) ≥ 1−P (A1)−P (A2)
128
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
Remarks:
SAS Code
129
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
SAS Output
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Parameter Estimates
**********************************************************************
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Parameter Estimates
130
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
Example:
131
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
Yi = β1Xi + εi ,
E(Yi) = β1Xi .
132
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
Remarks:
133
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
134
Chapter 4. Simultaneous Inference and Other Topics C. Sotto
Note:
135
Chapter 5
• Matrix Formulation
• Parameter Estimation
• Multicollinearity
136
Chapter 5. Multiple Linear Regression C. Sotto
138
Chapter 5. Multiple Linear Regression C. Sotto
Example:
β0 = 10
β1 = 2
E(Y ) = 10 + 2X1 + 5X2
β2 = 5
141
Chapter 5. Multiple Linear Regression C. Sotto
Example:
Y : length in hospital stay
X1 : age of patient
X2 : gender of patient
Model:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + εi
or,
β0 + β1Xi1 + β2Xi2 + εi , for fully disabled
Yi =
β0 + β1Xi1 + β2Xi2 + β4 + εi , for partially disabled
β0 + β1Xi1 + β2Xi2 + β3 + εi , for not disabled
SAS Code
proc print;
run;
144
Chapter 5. Multiple Linear Regression C. Sotto
SAS Output
NEWSAMPLE dataset
Obs Y X Group D1 D2
1 106 43 1 1 0
2 106 41 3 0 0
3 97 47 1 1 0
4 113 46 2 0 1
5 96 45 3 0 0
6 119 41 2 0 1
7 92 47 3 0 0
8 112 41 1 1 0
9 92 48 2 0 1
10 102 48 2 0 1
11 107 42 3 0 0
12 107 47 1 1 0
13 102 43 1 1 0
14 115 44 2 0 1
15 101 42 3 0 0
REHAB2 dataset
1 29 Below_Av 1 0
2 42 Below_Av 1 0
3 38 Below_Av 1 0
4 43 Below_Av 1 0
5 40 Below_Av 1 0
6 30 Below_Av 1 0
7 35 Average 0 1
8 31 Average 0 1
9 31 Average 0 1
10 29 Average 0 1
11 35 Average 0 1
12 33 Average 0 1
13 26 Above_Av 0 0
14 32 Above_Av 0 0
15 21 Above_Av 0 0
16 20 Above_Av 0 0
17 23 Above_Av 0 0
18 22 Above_Av 0 0
145
Chapter 5. Multiple Linear Regression C. Sotto
Examples:
Yi = β0 + β1Xi + β2Xi2 + εi
Yi = β0 + β1Xi + β2Xi2 + β3Xi3 + εi
Example:
Yi = β0 + β1Xi1 + β2Xi2 + β3|Xi1{zXi2} + εi
Xi3
147
Chapter 5. Multiple Linear Regression C. Sotto
Example:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi22 + β4Xi1Xi2 + εi
Is it a linear model?
148
Chapter 5. Multiple Linear Regression C. Sotto
X ′X β̂ = X ′Y ⇒ β̂ = (X ′X)−1X ′Y
In matrix formulation:
′ −1 ′
Ŷ = X β̂ = X(X
|
X)
{z
X }
Y = HY
H
e = Y − Ŷ = Y − X β̂
′ −1 ′
= Y − X(X
|
X)
{z
X }
Y
H
= Y − HY = (I − H)Y
V ar(e) = σ2 (I − H)
Source of
SS df MS
Variation
SSR
Regression SSR p−1 M SR =
p−1
SSE
Error SSE n−p M SE =
n−p
151
Chapter 5. Multiple Linear Regression C. Sotto
Hypotheses:
H0 : β1 = β2 = . . . = βp−1 = 0
vs.
HA : βk 6= 0, for some k
Test Statistic:
M SR H0
∗
F = ∼ Fp−1 , n−p
M SE
Decision Rule:
153
Chapter 5. Multiple Linear Regression C. Sotto
5.4 Coefficients
SSR SSE
R2 = =1− ∈ [0, 1]
SST O SST O
• measures the proportionate reduction of the total
variation in Y associated with the use of the set of
predictors X1, X2, . . . , Xp−1
• a large value of R2 does not necessarily imply that
the model is a useful one
• adding more predictors to the model increases R2
Adjusted Coefficient
SSE
n−p n − 1 SSE
Ra2 = 1 − SST O = 1 −
∈ [0, 1]
n − p SST O
n−1
′
E(β̂) = β = β0 β1 β2 · · · βp−1
Note that:
2
σ (β̂ 0) σ(β̂ 0, β̂ 1) · · · σ(β̂ 0, β̂ p−1)
2
σ(β̂ 0, β̂ 1) σ (β̂ 1) · · · σ(β̂ 1, β̂ p−1)
2
σ (β̂) =
... ... ... ...
σ(β̂ p−1, β̂ 0) σ(β̂ p−1 , β̂ 0) · · · σ 2(β̂ p−1)
β̂ k − βk H0
∼ tn−p, for k = 0, 1, . . . , p − 1
s(β̂ k )
We want to test:
H0 : βk = 0 versus HA : βk 6= 0 .
Test Statistic:
β̂ k H0
∗
t = ∼ tn−p
s(β̂ k )
Decision Rule:
β̂ k ∓ B s(β̂ k ), for k = 1, 2, . . . , g ,
where:
B = t1− 2gα , n−p .
156
Chapter 5. Multiple Linear Regression C. Sotto
′
Xh = 1 Xh1 Xh2 · · · Xh,p−1 .
(p × 1)
E(Yh) = X ′h β
Ŷ h = X ′h β̂
" #
2
σ (Ŷ h) = X ′h 2
σ (β̂) X h
2
s (Ŷ h) = M SE X ′h(X ′X)−1X h
" #
= X ′h 2
s (β̂) X h
157
Chapter 5. Multiple Linear Regression C. Sotto
Working-Hotelling Band
158
Chapter 5. Multiple Linear Regression C. Sotto
5.7 Predictions
Scheffé
!
Ŷ k ∓ S s Ŷ h(new) , k = 1, 2, . . . , g ,
where: r
S = g F1−α ; g , n−p
!
s Ŷ h(new) = M SE + s2(Ŷ h) .
2
159
Chapter 5. Multiple Linear Regression C. Sotto
16 17 18 19
90
70
X1
50
30
19
18
X2
17
16
245
220
Y 195
170
145
161
Chapter 5. Multiple Linear Regression C. Sotto
Levene Test
Breusch-Pagan Test
We want to test:
H0 : γ1 = γ2 = · · · = γq−1 = 0
vs.
HA : γk > 0, for some k .
Hypotheses:
Test Statistic:
M SLF SSLF/(c − p) H0
∗
F = = ∼ Fc−p , n−c ,
M SP E SSP E/(n − c)
where:
c = the number of groups with distinct
sets of levels for the X variables
SSLF = SSE − SSP E
Decision Rule:
n = 21 cities
Variables
Y : sales
X1 : persons aged 16 or younger in the community
X2 : per capita disposable personal income in the
community
Question:
Model:
Yi = β0 + β1Xi1 + β2Xi2 + εi
To Do:
SAS Code
data DWAINE;
input X1 X2 Y;
cards;
68.5 16.7 174.4
45.2 16.8 164.4
91.3 18.2 244.2
47.8 16.3 154.6
46.9 17.3 181.6
66.1 18.2 207.5
49.5 15.9 152.8
52.0 17.2 163.2
48.9 16.6 145.4
38.4 16.0 137.2
87.9 18.3 241.9
72.8 17.1 191.1
88.4 17.4 232.0
42.9 15.8 145.3
52.5 17.8 161.1
85.7 18.4 209.7
41.3 16.5 146.4
51.7 16.3 144.0
89.6 18.1 232.6
82.7 19.1 224.1
52.3 16.0 166.5
;
run;
166
Chapter 5. Multiple Linear Regression C. Sotto
SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: Y
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Ŷ = X β̂
e = Y − Ŷ
167
Chapter 5. Multiple Linear Regression C. Sotto
SAS Graphs
Figure 5.3: Residuals vs. Fitted Values for Dwaine Studios Example.
5
20
10
Residuals
0
-10
18
16
168
Chapter 5. Multiple Linear Regression C. Sotto
SAS Graphs
20
10
res
-10
-20
30 40 50 60 70 80 90
X1
20
10
res
-10
-20
169
Chapter 5. Multiple Linear Regression C. Sotto
SAS Graphs
5
20
10
Residuals
0
-10
18
16
-2 -1 0 1 2
Quantiles of Standard Normal
170
Chapter 5. Multiple Linear Regression C. Sotto
ANOVA Table
H0 : β1 = 0 and β2 = 0
versus
HA : β1 6= 0 or β2 6= 0 .
Test Statistic:
∗M SR
F = = 99.10
M SE
√
2 SSR
R = = 0.9167 and Ra2 = 0.9075
SST O
SAS Code
proc reg data=DWAINE alpha=0.05;
model Y = X1 X2 / clb;
run; quit;
SAS Output
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits
172
Chapter 5. Multiple Linear Regression C. Sotto
191.10 ∓ 2.101(2.77)
SAS Code
data DWAINE2;
set DWAINE;
output;
if _n_=21 then do;
X1 = 65.4; X2 = 17.6; Y = .; output;
end;
run;
173
Chapter 5. Multiple Linear Regression C. Sotto
SAS Output
Output Statistics
Sum of Residuals 0
Sum of Squared Residuals 2180.92741
Predicted Residual SS (PRESS) 3002.92331
174
Chapter 5. Multiple Linear Regression C. Sotto
Scheffé Bonferroni
S 2 = 5.24 = t0.975 , 18
S = 2.29 B = 2.101
175
Chapter 5. Multiple Linear Regression C. Sotto
SAS Code
data DWAINE3;
set DWAINE;
output;
if _n_=21 then do;
X1 = 65.4; X2 = 17.6; Y = .; output;
X1 = 53.1; X2 = 17.7; Y = .; output;
end;
run;
SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: Y
Output Statistics
Sum of Residuals 0
Sum of Squared Residuals 2180.92741
Predicted Residual SS (PRESS) 3002.92331
176
Chapter 5. Multiple Linear Regression C. Sotto
177
Chapter 5. Multiple Linear Regression C. Sotto
Variables
Y : amount of body fat
X1 : skinfold thickness
n = 20 subjects
X2 : thigh circumference
X3 : mid-arm circumference
Model 1:
E(Yi) = β0 + β1Xi1
SSE(X1) = 143.12
Model 2:
Hence,
178
Chapter 5. Multiple Linear Regression C. Sotto
Note that:
Why?
SST O = SSR + SSE
More Examples
SSE(X1 , X2) − SSE(X1 , X2, X3)
SSR(X3 |X1, X2) =
SSR(X1 , X2, X3) − SSR(X1 , X2)
109.95 − 98.41
=
= 11.54
396.98 − 385.44
SSE(X1 ) − SSE(X1 , X2, X3)
SSR(X3 , X2|X1) =
SSR(X1 , X2, X3) − SSR(X1 )
143.12 − 98.41
=
= 44.71
396.98 − 352.27
179
Chapter 5. Multiple Linear Regression C. Sotto
For 2 predictors:
SSR(X1|X2) + SSR(X2)
SSR(X1, X2) =
SSR(X2|X1) + SSR(X1)
Hence,
180
Chapter 5. Multiple Linear Regression C. Sotto
SAS Code
data BODY;
input X1 X2 X3 Y;
cards;
19.5 43.1 29.1 11.9
24.7 49.8 28.2 22.8
30.7 51.9 37.0 18.7
29.8 54.3 31.1 20.1
19.1 42.2 30.9 12.9
25.6 53.9 23.7 21.7
31.4 58.5 27.6 27.1
27.9 52.1 30.6 25.4
22.1 49.9 23.2 21.3
25.5 53.5 24.8 19.3
31.1 56.6 30.0 25.4
30.4 56.7 28.3 27.2
18.7 46.5 23.0 11.7
19.7 44.2 28.6 17.8
14.6 42.7 21.3 12.8
29.5 54.4 30.1 23.9
27.7 55.3 25.7 22.6
30.2 58.6 24.6 25.4
22.7 48.2 27.1 14.8
25.2 51.0 27.5 21.1
;
run;
181
Chapter 5. Multiple Linear Regression C. Sotto
SAS Output
Model 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
182
Chapter 5. Multiple Linear Regression C. Sotto
Model 2
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
183
Chapter 5. Multiple Linear Regression C. Sotto
Model 3
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
184
Chapter 5. Multiple Linear Regression C. Sotto
Model 4
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
185
Chapter 5. Multiple Linear Regression C. Sotto
For 3 predictors:
Source of Variation SS df MS
X1 SSR(X1) 1 MSR(X1 )
Source of Variation SS df MS
Total 495.39 19
186
Chapter 5. Multiple Linear Regression C. Sotto
SAS Code
187
Chapter 5. Multiple Linear Regression C. Sotto
SAS Output
Model: MODELA
Dependent Variable: Y
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS
---------------------------------------------------------------------------
Model: MODELB
Dependent Variable: Y
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS
---------------------------------------------------------------------------
Model: MODELC
Dependent Variable: Y
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type II SS
188
Chapter 5. Multiple Linear Regression C. Sotto
H0 : βk = 0 versus HA : βk 6= 0
Statistic: β̂ k H0
∗
t = ∼ tn−p
s(β̂ k )
Consider k = 3.
H0 : β3 = 0 versus HA : β3 6= 0
H0 : β3 = 0 versus HA : β3 6= 0 ,
we obtain:
−2.186
t∗ = = −1.37
1.596
(t∗)2 = (−1.37)2 = 1.8769 = F ∗
190
Chapter 5. Multiple Linear Regression C. Sotto
Full Model:
Y = β0 + β1X1 + β2X2 + β3X3 + ε
Reduced Model:
Y = β0 + β1X1 + ε
Statistic:
SSE(X1 ) − SSE(X1 , X2, X3) SSE(X1 , X2, X3)
F∗ = ÷
(n − 2) − (n − 4) n−4
∗44.71 98.41
F = ÷ = 3.63 = F (0.95; 2, 16)
2 16
F ∗ is in the boundary ⇒ borderline case
191
Chapter 5. Multiple Linear Regression C. Sotto
Statistic:
M SR(Xq , Xq+1, . . . , Xp−1|X1, X2, . . . , Xq−1)
F∗ =
M SE(X1, X2, . . . , Xp−1)
(1) H0 : β2 = β3 = 0 vs. HA : β2 6= 0 or β3 6= 0,
other, more general forms of tests on the βk are also
possible. For instance, for 3 predictors:
(2) H0 : β1 = β2 vs. HA : β1 6= β2
(3) H0 : β2 + β3 = 0 vs. HA : β2 + β3 6= 0
(4) H0 : β2 + β3 = 5 vs. HA : β2 + β3 6= 5
(5) H0 : β1 = 3 and β3 = 5 vs. HA : β1 6= 3 or β3 6= 5
For all such tests, one can use the general linear F -test
SSE(R) − SSE(F )
dfR − dfF H0
F∗ = ∼ FdfR−dfF , dfF .
SSE(F )
dfF
192
Chapter 5. Multiple Linear Regression C. Sotto
SAS Code
SAS Output
Mean
Source DF Square F Value Pr > F
Mean
Source DF Square F Value Pr > F
193
Chapter 5. Multiple Linear Regression C. Sotto
Mean
Source DF Square F Value Pr > F
Mean
Source DF Square F Value Pr > F
Mean
Source DF Square F Value Pr > F
Mean
Source DF Square F Value Pr > F
Mean
Source DF Square F Value Pr > F
194
Chapter 5. Multiple Linear Regression C. Sotto
Note:
(rY 2 − rY 1r12)2
RY2 2|1 =
(1 − rY2 1)(1 − r12
2 )
More Examples
SSR(X1|X2X3)
RY2 1|23 =
SSE(X2, X3)
SSR(X2|X1X3)
RY2 2|13 =
SSE(X1, X3)
SSR(X3|X1X2)
RY2 3|12 =
SSE(X1, X2)
SSR(X4|X1X2X3)
RY2 4|123 =
SSE(X1, X2, X3)
SSR(X2|X1) 33.17
RY2 2|1 = = = 0.232
SSE(X1) 143.12
SSR(X3|X1X2) 11.54
RY2 3|12 = = = 0.105
SSE(X1, X2) 109.95
SSR(X1|X2) 3.47
RY2 1|2 = = = 0.031
SSE(X2) 113.42
196
Chapter 5. Multiple Linear Regression C. Sotto
Correlation Transform
5.13 Multicollinearity
Model 1: Yi = β0 + β1X1 + εi
SSR(X1|X2) = SSR(X1)
SSR(X2|X1) = SSR(X2)
200
Chapter 5. Multiple Linear Regression C. Sotto
Variables
Y : crew productivity
X1 : crew size
uncorrelated
X2 : level of bonus pay
Model 2: Yi = β0 + β1X1i + εi
Model 3: Yi = β0 + β2X2i + εi
Model Predictors β̂ 1 β̂ 2
1 X1 , X2 5.375 9.25
2 X1 5.375 –
3 X2 – 9.25
201
Chapter 5. Multiple Linear Regression C. Sotto
SAS Code
data CREW;
input X1 X2 Y;
cards;
4 2 42
4 2 39
4 3 48
4 3 51
6 2 49
6 2 53
6 3 61
6 3 60
;
run;
202
Chapter 5. Multiple Linear Regression C. Sotto
SAS Output
3 Variables: X1 X2 Y
Simple Statistics
Simple Statistics
X1 4.00000 6.00000
X2 2.00000 3.00000
Y 39.00000 61.00000
X1 X2 Y
203
Chapter 5. Multiple Linear Regression C. Sotto
Model 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
204
Chapter 5. Multiple Linear Regression C. Sotto
Model 2
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
205
Chapter 5. Multiple Linear Regression C. Sotto
Model 3
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
206
Chapter 5. Multiple Linear Regression C. Sotto
In practice,
1. perfect correlation is difficult to find
2. multicollinearity does not inhibit our ability to obtain
a good fit
3. the interpretations of the regression coefficients are
not fully applicable and are somewhat less meaningful
207
Chapter 5. Multiple Linear Regression C. Sotto
208
Chapter 5. Multiple Linear Regression C. Sotto
210
Chapter 5. Multiple Linear Regression C. Sotto
7. (X ′X) Matrix
• severe multicollinearity has the effect of making
the determinant of (X ′X) come close to zero
Variables
Y : amount of body fat
X1 : skinfold thickness
n = 20 subjects
X2 : thigh circumference
X3 : mid-arm circumference
211
Chapter 5. Multiple Linear Regression C. Sotto
Model 1: Yi = β0 + β1X1i + εi
Model 2: Yi = β0 + β1X2i + εi
Model 3: Yi = β0 + β2X1i + β2X2i + εi
Model 4: Yi = β0 + β2X1i + β2X2i + β2X3i + εi
1 X1 0.8572 − 0.1288 −
2 X2 − 0.8566 − 0.1100
3 X1 , X2 0.2224 0.6594 0.3034 0.2912
4 X1 , X2 , X3 4.3341 −0.2859 3.0155 2.5820
212
Chapter 5. Multiple Linear Regression C. Sotto
SSR(X2) = 381.97
SSR(X2|X1) = 33.17
Xh1 = 25
3 X1 , X2 19.36 0.624
Xh2 = 50
Xh1 = 25
4 X1 , X2 , X3 Xh2 = 50 19.19 0.621
Xh3 = 29
213
Chapter 5. Multiple Linear Regression C. Sotto
Variables
Y : measure of satisfaction with life, Y ∈ [1, 20]
X1 : family income in 1000 dollars
X2 : measure of occupational prestige, X2 ∈ [0, 100]
X3 : number of years of education
X4 : frequency of attendance to religious services
X5 : population of current residence
Model:
214
Chapter 5. Multiple Linear Regression C. Sotto
215
Chapter 6
• Outlier Detection
• Diagnostics Techniques
• Remedial Measures
216
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Objective
Data
218
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Graphs
Figure 6.2: Scatter Plot Matrix for Surgical Unit Example.
0 20 40 60 80 100 0 2 4 6
11
9
X1 7
5
3
100
80
60
X2
40
20
0
110
90
X3 70
50
30
10
6
4
X4
2
0
800
600
Time 400
200
0
3 5 7 9 11 10 30 50 70 90 110 0 200 400 600 800
X1 X2 X3 X4 Time LogTime
X1 1.00000000 0.09011973 -0.14963411 0.5024157 0.3725187 0.3464042
X2 0.09011973 1.00000000 -0.02360544 0.3690256 0.5539760 0.5928888
X3 -0.14963411 -0.02360544 1.00000000 0.4164245 0.5802438 0.6651216
X4 0.50241567 0.36902563 0.41642451 1.0000000 0.7223266 0.7262058
Time 0.37251865 0.55397598 0.58024382 0.7223266 1.0000000 0.9130965
LogTime 0.34640419 0.59288884 0.66512160 0.7262058 0.9130965 1.0000000
219
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
13
300
200
Residuals
18
100
0
-2 -1 0 1 2
Quantiles of Standard Normal
13
300
200
Residuals
18
100
0
220
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
22
27
0.10
0.05
Residuals
0.00
-0.05
-0.10
-2 -1 0 1 2
Quantiles of Standard Normal
22
27
0.10
0.05
Residuals
0.00
-0.05
-0.10
221
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
2k different models.
222
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
• Rp2 or SSEp
• Ra2 or M SEp
• Cp
• P RESSp
223
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
2
6.2.2 Ra,p or M SEp Criterion
2 n − 1 SSEp M SEp
Ra,p =1−
.
=1−
n − p SST O SST O/(n − 1)
2
• Ra,p gives essentially the same information as M SEp
2
Ra,p increases ⇐⇒ M SEp decreases
224
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SSEp
Cp = − (n − 2p)
M SE(X1, X2, . . . , XP −1 )
where:
M SE(X1, X2, . . . , XP −1) is the M SE for the
model containing all (P − 1) predictors; and,
• in a plot of Cp vs. p
⇒ models with little bias fall near the line Cp = p
⇒ models with substantial bias fall considerably
above the line Cp = p
⇒ Cp values below the line Cp = p are interpreted as
showing no bias
• AIC and SBC differ only in the penalty term for the
number of parameters in the model
⇒ for n ≥ 8, SBC penalty > AIC penalty
⇒ SBC favors more parsimonious models
226
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
• differs from SSE in that the fitted value Ŷi for the
P RESS is obtained by deleting the ith case from the
data set
Yi : observed values
Ŷ i(i) : fitted value for the ith case obtained by
fitting a model without the ith case, i.e.
using only (n − 1) cases
227
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Thus, P = 5 and 1 ≤ p ≤ 5.
228
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Rp2 Plot
229
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
M SEp Plot
230
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Cp Plot
231
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
P RESSp Plot
232
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Code
data SURG;
input X1 X2 X3 X4 X5 X6 X7 X8 TIME logTime;
cards;
6.7 62 81 2.59 50 0 1 0 695 6.544
5.1 59 66 1.70 39 0 0 0 403 5.999
7.4 57 83 2.16 55 0 0 0 710 6.565
6.5 73 41 2.01 48 0 0 0 349 5.854
7.8 65 115 4.30 45 0 0 1 2343 7.759
...........................................
...........................................
...........................................
...........................................
...........................................
3.9 82 103 4.55 50 0 1 0 1078 6.983
6.6 77 46 1.95 50 0 1 0 405 6.005
6.4 85 40 1.21 58 0 0 1 579 6.361
6.4 59 85 2.33 63 0 1 0 550 6.310
8.8 78 72 3.20 56 0 0 0 651 6.478
;
run;
233
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Output
Number in Adjusted
Model R-Square R-Square C(p) Variables in Model
234
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Options
The BEST= option of the MODEL statement is used with the RSQUARE,
ADJRSQ and CP model-selection methods. A small value of the BEST= option
greatly reduces the CPU time required for large problems.
If the BEST= option is omitted and the number of regressors is less than
eleven, all possible subsets are evaluated. If the BEST= option is omitted and
the number of regressors is greater than ten, the number of subsets selected is
at most equal to the number of regressors.
The SELECTION= option of the MODEL statement specifies the method used
to select the model, where name can be FORWARD (or F), BACKWARD (or
B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP or NONE (uses the
full model). Default: NONE
The SSE option of the MODEL statement computes the error sum of squares
for each model selected. (Only available when SELECTION=RSQUARE,
ADJRSQ or CP.)
235
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Step 1
238
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Step 2
239
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Step 3
• Calculate
M SR(xS1 |xS2 )
FS1|S2 = .
M SE(xS1 , xS2 )
• Using FS1|S2 , determine if xS1 can be removed from
the model now that xS2 is included in the model.
240
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
241
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Code
data HALD;
input X1 X2 X3 X4 Y;
cards;
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
;
run;
242
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
243
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Output
Cp SELECTION PROCEDURE
Number in
Model C(p) R-Square Variables in Model
2 2.6782 0.9787 X1 X2
3 3.0182 0.9823 X1 X2 X4
3 3.0413 0.9823 X1 X2 X3
3 3.4968 0.9813 X1 X3 X4
4 5.0000 0.9824 X1 X2 X3 X4
2 5.4959 0.9725 X1 X4
3 7.3375 0.9728 X2 X3 X4
2 22.3731 0.9353 X3 X4
2 62.4377 0.8470 X2 X3
2 138.2259 0.6801 X2 X4
1 138.7308 0.6745 X4
1 142.4864 0.6663 X2
2 198.0947 0.5482 X1 X3
1 202.5488 0.5339 X1
1 315.1543 0.2859 X3
244
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Output
Number in Adjusted
Model R-Square R-Square C(p) Variables in Model
245
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Output
STEPWISE SELECTION PROCEDURE
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
246
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.
247
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Output
FORWARD SELECTION PROCEDURE
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
248
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
No other variable met the 0.5000 significance level for entry into the model.
249
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Output
BACKWARD SELECTION PROCEDURE
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
250
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
All variables left in the model are significant at the 0.1000 level.
251
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Stepwise Regression
Step 1
Step 2
Step 3
SSR(x4|x1) 1190.925
Fx4|x1 = = = 159.29,
M SE(x1, x4) 7.476
Step 4
253
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Step 5
Step 6
F p-value
SSR(x3|x1,x2) 9.794
Fx3|x1,x2 = MSE(x1,x2,x3)
= 5.346
= 1.833 0.2089
SSR(x4|x2,x3) 9.932
Fx4|x1,x2 = MSE(x1,x2,x4) = 5.330 = 1.86 0.2054
Note:
It is of course clear in advance that at this step x4 cannot
enter the model, since it is removed from the model in
the previous step based on the same Fx4|x1,x2 -value.
254
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
SAS Code
data one;
retain seed 31491711;
nobs = 30;
p = 0.5;
mu = 20;
var = 100;
do i=1 to nobs;
array x{25} x1-x25;
e = sqrt(var)*rannor(seed);
x1 = sqrt(var)*rannor(seed);
do k = 2 to 24;
x[k] = sqrt(var)*
( sqrt(0.35)*x[k-1]/sqrt(var) +
sqrt(0.65)*rannor(seed) ) ;
end;
x25 = ranbin(seed,1,p);
y = mu + e;
output;
end;
run;
255
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
256
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
257
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
258
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the
model.
259
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
260
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
261
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
All variables left in the model are significant at the 0.1000 level.
262
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
263
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
No other variable met the 0.5000 significance level for entry into the model.
264
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Example
Figure 6.11: Scatter Plot for Regression with One Predictor Variable
Illustrating Outlying Cases.
• Residuals
ei = Yi − Ŷ i
• Semistudentized Residuals
∗ ei
ei = √
M SE
• Studentized Residuals
ei r
ri = , where: s(ei) = M SE(1 − hii)
s(ei)
Note: hii is the ith diagonal element of the
hat matrix, H = X(X ′X)−1X ′ .
• Deleted Residuals
ei
di = Yi − Ŷ i(i) =
1 − hii
Note: Ŷ i(i) is the predicted value for the ith
case obtained from fitting the same regression
model without the ith case.
266
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
|ti| ≤ t1− 2n
α , n−p−1 ⇒ case i is not outlying in Y
267
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
DFFITS
v
u
u
Ŷ i − Ŷ i(i) hii u
u
DF F IT Si = s = t iu
t
M SE(i)hii 1 − hii
Rule of Thumb:
269
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Cook’s Distance
!
2
n
X Ŷ j − Ŷ j(i) e2i hii
Di = =
j=1 pM SE pM SE (1 − hii)2
Rule of Thumb:
270
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
DFBETAS
β̂ k − β̂ k(i)
DF BET ASk(i) = s ,
M SE(i) ckk
where: ckk = k th diagonal element of (X ′X)−1
Rule of Thumb:
SAS Code
271
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
6.4.4 Examples
272
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
(V IF )k = (1 − Rk2 )−1 k = 1, . . . , p − 1
where:
Rk2 = coefficient of multiple determination
when Xk is regressed on all the
other (p − 2) predictor variables
Rk2 = 0 =⇒ (V IF )k = 1
Rk2 = 1 =⇒ (V IF )k = ∞
Rk2 ∈ (0, 1) =⇒ (V IF )k > 1
Rule of Thumb:
274
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Objective
Variables
ID : identification number
LSTAY : average length of stay of all patients
AGE : average age of patients
INFRISK : average estimated probability of acquiring
infection in the hospital
RURATIO : ratio of cultures performed to number of
patients without symptoms
RUXRAY : ratio of number of X-rays performed to
number of patients
NBEDS : average number of beds
MEDSCH : 1 = yes, 2 = no
REGION : geographic region, codes 1, 2, 3 and 4
AVECEN : average number of patients per day
during study period
NNURSE : average of number of nurses
FACILI : percent of 35 potential facilities
275
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Data
ID LSTAY AGE INFRISK RURATIO RUXRAY NBEDS MEDSCH REGION AVECEN NNURSE FACILI
Analysis
277
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
278
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
279
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Correlation Analysis
9 Variables: AGE INFRISK RURATIO RUXRAY NBEDS AVECEN NNURSE FACILI LOGSTAY
Simple Statistics
AGE 1.00000 0.02518 -0.10113 0.16099 -0.19787 -0.17221 -0.23643 -0.16352 0.17064
0.8525 0.4542 0.2316 0.1401 0.2002 0.0766 0.2242 0.2044
INFRISK 0.02518 1.00000 0.44783 0.33396 0.49007 0.50085 0.53009 0.45334 0.47137
0.8525 0.0005 0.0111 0.0001 <.0001 <.0001 0.0004 0.0002
RURATIO -0.10113 0.44783 1.00000 0.19482 0.16780 0.20362 0.23884 0.23954 0.25483
0.4542 0.0005 0.1464 0.2121 0.1287 0.0736 0.0727 0.0557
RUXRAY 0.16099 0.33396 0.19482 1.00000 0.06682 0.08554 0.06020 0.12833 0.36377
0.2316 0.0111 0.1464 0.6214 0.5269 0.6564 0.3414 0.0054
NBEDS -0.19787 0.49007 0.16780 0.06682 1.00000 0.99000 0.90893 0.76448 0.57431
0.1401 0.0001 0.2121 0.6214 <.0001 <.0001 <.0001 <.0001
AVECEN -0.17221 0.50085 0.20362 0.08554 0.99000 1.00000 0.90389 0.72942 0.60799
0.2002 <.0001 0.1287 0.5269 <.0001 <.0001 <.0001 <.0001
NNURSE -0.23643 0.53009 0.23884 0.06020 0.90893 0.90389 1.00000 0.70706 0.47005
0.0766 <.0001 0.0736 0.6564 <.0001 <.0001 <.0001 0.0002
FACILI -0.16352 0.45334 0.23954 0.12833 0.76448 0.72942 0.70706 1.00000 0.40391
0.2242 0.0004 0.0727 0.3414 <.0001 <.0001 <.0001 0.0018
LOGSTAY 0.17064 0.47137 0.25483 0.36377 0.57431 0.60799 0.47005 0.40391 1.00000
0.2044 0.0002 0.0557 0.0054 <.0001 <.0001 0.0002 0.0018
280
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Stepwise Procedure
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
281
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.
282
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Cp Selection Procedure
Number in
Model C(p) R-Square Variables in Model
283
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
284
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
285
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
286
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Diagnostics
287
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Sum of Residuals 0
Sum of Squared Residuals 0.16182
Predicted Residual SS (PRESS) 0.19371
-|---------|---------|---------|---------|---------|---------|---------|---------|---------|
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
288
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
β̂ 0 0.61043 0.61887
s(β̂ 0) 0.08881 0.12477
β̂ AGE 0.00388 0.00399
s(β̂ AGE) 0.00163 0.00211
β̂ RUXRAY 0.00117 0.00152
s(β̂ RUXRAY) 0.00041881 0.00043724
β̂ AVECEN 0.00029261 0.00015680
s(β̂ AVECEN) 0.00004558 0.00006216
289
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
293
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
294
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Examples:
Growth from birth to maturity in human subjects
typically is nonlinear in nature, characterized by
rapid growth shortly after birth, pronounced
growth during puberty, and a leveling off sometime
before adulthood.
Dose-response relationships tend to be nonlinear
with little or no change in response for low dose
levels of a drug, followed by rapid S-shaped
changes occurring in the more active dose region,
and finally with dose response leveling off as it
reaches a saturated level.
295
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
γ0
Yi = γ0 + γ1 exp(γ2Xi) + εi Yi = + εi
1 + γ1 exp(γ2Xi)
γ0
E(Yi) = γ0 + γ1 exp(γ2Xi) E(Yi) =
1 + γ1 exp(γ2Xi)
γ0 = 100 γ0 = 10
γ1 = − 50 γ1 = 20
γ2 = − 2 γ2 = − 2
10
E(Y ) = 100 + (−50) exp(−2X) E(Y ) =
1 + 20 exp(−2X)
296
Appendix A
P-Value
2-Tailed P-Value:
PHo (|T | ≥ |to|) = PHo (T ≥ |to|) + PHo (T ≤ −|to|)
= 2PHo (T ≥ |to|), for symmetric distn’s.
For one-sided tests, the p-value is defined on just one tail:
1-Tailed P-Value:
PHo (T ≥ |to|) or PHo (T ≤ −|to|).
298
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
299
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Example 1:
to = 2.5 df = 8 α = 0.05
t1−α/2 ; df = t1−0.025 ; 8 = 2.306
t1−α ; df = t1−0.05 ; 8 = 1.860
2-tailed p-value:
P (|T | ≥ 2.5) = 2P (T ≥ 2.5) = 2×0.0185 = 0.0370
1-tailed p-value:
P (T ≥ 2.5) = 0.0185
300
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
Example 2:
to = 2.0 df = 8 α = 0.05
t1−α/2 ; df = t1−0.025 ; 8 = 2.306
t1−α ; df = t1−0.05 ; 8 = 1.860
2-tailed p-value:
P (|T | ≥ 2.0) = 2P (T ≥ 2.0) = 2×0.0403 = 0.0806
1-tailed p-value:
P (T ≥ 2.0) = 0.0403
301
Appendix B
Matrix Expressions
P P
n Xi Yi
′ ′
XX=
P
XY =
P
Xi P
Xi2 Xi Yi
Y ′ J Y = ( Y i )2
X X
Y ′Y = Yi2
P P
Xi2 − Xi
P P
n (Xi − X̄)2 n (Xi − X̄)2
(X ′ X)−1 =
P
− Xi n
P P
n (Xi − X̄)2 n (Xi − X̄)2
302
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
1
X11 X12 ··· X1,p−1 Y1
1
X21 X22 ··· X2,p−1 Y2
..
.. .. ... .. ..
X=
Y =
1
Xi1 Xi2 ··· Xi,p−1 Yi
..
.. .. ... .. ..
1 Xn1 Xn2 ··· Xn,p−1 Yn
P P P P
n Xi1 Xi2 Xi3 ··· Xi,p−1
P P P P P
Xi1 2
Xi1 Xi1 Xi2 Xi1 Xi3 ··· Xi1 Xi,p−1
P P P P P
2
Xi2 Xi2 Xi1 Xi2 Xi2 Xi3 ··· Xi2 Xi,p−1
′
XX=
P P P P P
Xi3 Xi3 Xi1 Xi3 Xi2 2
Xi3 ··· Xi3 Xi,p−1
.. .. .. .. ... ..
P P P P P
Xi,p−1 Xi,p−1 Xi1 Xi,p−1 Xi2 Xi,p−1 Xi3 ··· 2
Xi,p−1
P
Yi
P
YiXi1
P
X ′Y =
YiXi2
...
P
YiXi,p−1
303
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto
304
Bibliography
Draper, N. and Smith, H. (1998). Applied Regression Analysis (3rd ed). NY:
John Wiley.
Kutner, M., Nachtsheim, C., Neter, J. and Li, W. (2005). Applied Linear
Statistical Models (5th ed). NY: McGraw-Hill Irwin.
Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression: A Second
Course in Statistics. Reading, MA: Addison-Wesley.
SAS Institute Inc. (2008). SAS 9.2 Help and Documentation. Cary, NC: SAS
Institute Inc.
Sokal, R.R. and Rohlf, F.J. (1993). Biometry: The Principles and Practice of
Statistics in Biological Research (3rd ed). NY: Freeman.
305