0% found this document useful (0 votes)
44 views313 pages

REG2022

The document is a comprehensive guide on Linear Regression, authored by Prof. Dr. Cristina Sotto for the Master of Statistics program. It covers various topics including Simple and Multiple Linear Regression, estimation of parameters, inference, diagnostics, and remedial measures. The document also includes case studies and examples to illustrate the concepts discussed.

Uploaded by

Tanvir Arefin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views313 pages

REG2022

The document is a comprehensive guide on Linear Regression, authored by Prof. Dr. Cristina Sotto for the Master of Statistics program. It covers various topics including Simple and Multiple Linear Regression, estimation of parameters, inference, diagnostics, and remedial measures. The document also includes case studies and examples to illustrate the concepts discussed.

Uploaded by

Tanvir Arefin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 313

Interuniversity Institute for Biostatistics

and statistical Bioinformatics

LINEAR REGRESSION

Prof. dr. Cristina Sotto

Master of Statistics
2014–2015
Contents

1 Simple Linear Regression 1

1.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Functional vs. Statistical Relation . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Formal Statement of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Graphical Representation of the Simple Linear Regression Model . . . . . 10

1.3.2 Interpretation of the Regression Parameters . . . . . . . . . . . . . . . . 11

1.4 Estimation of the Regression Parameters . . . . . . . . . . . . . . . . . . . . . 13

1.4.1 Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.2 Properties of Least Squares Estimators . . . . . . . . . . . . . . . . . . 19

1.5 Example: Toluca Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Inferences in Regression Analysis 23

2.1 Inference for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 Sampling Distribution of β̂ 1 . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1.2 Confidence Interval for β1 . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.3 Hypothesis Test for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

i
2.1.4 Example: MgCO3 Content of Sand Dollars . . . . . . . . . . . . . . . . 29

2.2 Inference for β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 Inference for Mean Response . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Sampling Distribution of Ŷ h . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 Example: MgCO3 Content of Sand Dollars . . . . . . . . . . . . . . . . 35

2.4 Predicting New Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.4.1 Predicting a Single New Observation . . . . . . . . . . . . . . . . . . . 38

2.4.2 Predicting the Mean of m New Observations . . . . . . . . . . . . . . . 42

2.5 Confidence Band for the Regression Line . . . . . . . . . . . . . . . . . . . . . 43

2.6 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.7 F -Test versus t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.8 General Linear Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.9 R2 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3 Diagnostics and Remedial Measures 60

3.1 Departures from the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3 Diagnostics Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.1 Univariate Plots of X and Y . . . . . . . . . . . . . . . . . . . . . . . 64

3.3.2 Bivariate Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4 Formal Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.4.1 Tests for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

ii
3.4.2 Test for Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.4.3 Tests for Non-Constancy of Variance . . . . . . . . . . . . . . . . . . . 79

3.4.4 Outlier Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.4.5 Lack-of-Fit Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.5 Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.5.1 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.5.2 Non-Constancy of Error Variance . . . . . . . . . . . . . . . . . . . . . 100

3.5.3 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.5.4 Non-Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.5.5 Non-Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.6 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.6.1 Linearizing Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.6.2 Non-Normality or Unequal Error Variance . . . . . . . . . . . . . . . . . 104

3.6.3 Box-Cox Family of Transformations . . . . . . . . . . . . . . . . . . . . 105

3.6.4 Two Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

3.7 Case Study: SENIC Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4 Simultaneous Inference and Other Topics 124

4.1 Joint Estimation of β0 and β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.1.1 Statement vs. Family Confidence . . . . . . . . . . . . . . . . . . . . . 126

4.1.2 Bonferroni Joint Confidence Intervals . . . . . . . . . . . . . . . . . . . 127

4.2 Regression Through the Origin . . . . . . . . . . . . . . . . . . . . . . . . . . 131

iii
4.3 Effects of Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.3.1 Errors in the Response Variable . . . . . . . . . . . . . . . . . . . . . . 134

4.3.2 Errors in the Predictor Variable . . . . . . . . . . . . . . . . . . . . . . 135

5 Multiple Linear Regression 136

5.1 Multiple Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.2 General Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.2.1 General Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.2.2 Specific Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.3 Matrix Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.3.1 Estimation of Regression Coefficients . . . . . . . . . . . . . . . . . . . 149

5.3.2 Fitted Values and Residuals . . . . . . . . . . . . . . . . . . . . . . . . 150

5.3.3 ANOVA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.3.4 F -Test for Regression Relation . . . . . . . . . . . . . . . . . . . . . . 153

5.4 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.4.1 Multiple Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.4.2 Multiple Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.5 Inference About the Regression Parameters . . . . . . . . . . . . . . . . . . . . 155

5.5.1 Interval Estimation of βk . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.5.2 Test for βk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.5.3 Joint Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.6 Inference About Mean Response . . . . . . . . . . . . . . . . . . . . . . . . . . 157

iv
5.6.1 Interval Estimation of E(Yh ) . . . . . . . . . . . . . . . . . . . . . . . 157

5.6.2 Confidence Region for Regression Surface . . . . . . . . . . . . . . . . . 158

5.6.3 Simultaneous Confidence Intervals for Several Mean Responses . . . . . 158

5.7 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.7.1 New Observation, Yh(new) . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.7.2 Mean of m New Observations . . . . . . . . . . . . . . . . . . . . . . . 159

5.7.3 g New Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.8 Diagnostics and Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . 160

5.8.1 Diagnostic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

5.8.2 Formal Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5.8.3 Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.9 Case Study: Dwaine Studios Data . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.10 Extra Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

5.10.1 Example: Body Fat Data . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.10.2 Decomposition of SSR . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.10.3 ANOVA Table with SSR Decomposition . . . . . . . . . . . . . . . . . 186

5.10.4 Extra Sum of Squares in Test for Regression . . . . . . . . . . . . . . . 189

5.11 Coefficient of Partial Determination . . . . . . . . . . . . . . . . . . . . . . . . 195

5.12 Standardized Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . 197

5.13 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

5.13.1 Uncorrelated Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 200

v
5.13.2 Example: Crew Productivity Data . . . . . . . . . . . . . . . . . . . . . 201

5.13.3 Correlated Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

5.13.4 Effects of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 209

5.13.5 Example: Body Fat Data . . . . . . . . . . . . . . . . . . . . . . . . . 211

5.13.6 Example: Life Satisfaction Data . . . . . . . . . . . . . . . . . . . . . . 214

6 Model Building, Diagnostics and Remedial Measures 216

6.1 Model Building Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

6.2 Criteria for Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

6.2.1 Rp2 or SSEp Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

2
6.2.2 Ra,p or MSEp Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 224

6.2.3 Mallows Cp Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6.2.4 AIC and SBC Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 226

6.2.5 P RESSp Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.2.6 Example: Surgical Unit Data . . . . . . . . . . . . . . . . . . . . . . . 228

6.3 Automatic Search Procedures for Model Selection . . . . . . . . . . . . . . . . 236

6.3.1 “Best” Subsets Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 236

6.3.2 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

6.3.3 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

6.3.4 Backward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

6.3.5 Example: HALD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

6.3.6 Danger of Automatic Selection Procedures . . . . . . . . . . . . . . . . 255

vi
6.4 Diagnostic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

6.4.1 Identifying Outlying Y Observations . . . . . . . . . . . . . . . . . . . . 266

6.4.2 Identifying Outlying X Observations . . . . . . . . . . . . . . . . . . . 268

6.4.3 Identifying Influential Cases . . . . . . . . . . . . . . . . . . . . . . . . 269

6.4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

6.4.5 Multicollinearity Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 273

6.5 Case Study: SENIC Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

6.6 Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

6.6.1 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . 291

6.6.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

6.6.3 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

6.6.4 Regression with Autocorrelated Errors . . . . . . . . . . . . . . . . . . . 294

6.6.5 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

Appendix A 297

Appendix B 302

Bibliography 305

vii
Chapter 1

Simple Linear Regression:


Linear Regression with
One Independent Variable

Chapter 1 in Kutner et al.

• Regression Model

• Least Squares Estimation of the Coefficients

• Estimation of the Error Variance σ 2

• Properties of the Least Squares Estimators

1
Chapter 1. Simple Linear Regression C. Sotto

1.1 Historical Background

Sir Francis Galton (1822-1911) studied the height of the


fathers (X) and the height of their adults sons (Y ) by
using a linear equation.

He observed a peculiar biological phenomenon – that the


heights of descendants of tall (or short) ancestors tend to
regress back towards a normal average (also known as
regression toward the mean). Hence, the term
“regression” was coined.
2
Chapter 1. Simple Linear Regression C. Sotto

Reference

“Regression towards mediocrity in hereditary stature.”


Journal of the Anthropological Institute (1886) 15,
246-263.

3
Chapter 1. Simple Linear Regression C. Sotto

1.2 Functional vs. Statistical Relation

Functional Relation

Suppose f is a known function.


Y = f (X)
Whenever X is known, Y is completely known.

Examples:
Y = 2X
1 2
Y = gX
2

Figure 1.1: Example of a Functional Relation.

4
Chapter 1. Simple Linear Regression C. Sotto

In general, we do not know the true relationship, f ,


between X and Y , but we would like to describe or
somehow use this relationship. For instance, when:
• X is easier or cheaper to observe than Y
• we wish to discover a relation between X and Y
• given a value of X, we might want to predict Y

X : independent or predictor/explanatory variable


Y : dependent or response or outcome variable

Oftentimes, a relation between two variables, X and Y ,


is not a functional one.

Figure 1.2: Statistical Relation between Midyear and Year-End


Performance Evaluation.

5
Chapter 1. Simple Linear Regression C. Sotto

Statistical Relation

Y = f (X) + ε

ε is the error committed by using f (X) to approximate


Y . We use this to describe a statistical relation.

Since we do not know Y exactly for every X, we will


often approximate the relation between X and Y .

Linear Regression Model

When we observe pairs (X, Y ), we would like to write a


relation like the previous one with uniformly small error.

However, we usually do not know f (·), so we start with a


simple guess.

If a scatterplot of (X, Y ) is approximately linear, then we


can let

f (X) = β0 + β1X.

The linear regression model (of the relation between X


and Y ) is thus

Y = β0 + β1X + ε.

6
Chapter 1. Simple Linear Regression C. Sotto

• This is a very simple model for describing the


relationship between two variables.

• f (X) is the equation for a straight line.

• β1 is the slope of the line and β0 is the intercept.

Nonlinear Regression Model

Example:
Suppose Y is the height of a tree and X is its age.
The Chapman-Richards growth model is

Y = a[1 − exp(bX)]c + ε.

Figure 1.3: Curvilinear Statistical Relation between Age


and Steroid Level in Healthy Females Aged 8 to 25.

7
Chapter 1. Simple Linear Regression C. Sotto

1.3 Formal Statement of the Model

A random experiment is repeated n times under identical


conditions. For each trial, i = 1, . . . , n, the value of Xi is
set (known) and the response Yi is recorded. We use a
Simple Linear Regression Model of the form:
Yi = β0 + β1Xi + εi.

Assumptions:

1. The value of X is precisely known.


2. Y is a continuous variable and it is random.
3. β0 and β1 are parameters. That is, they are:
(a) unknown
(b) constant, not random
(c) not dependent on the trial number
4. ε is a random error. It is not observable.

Additional Assumptions:
i.i.d.
εi ∼ N (0, σ 2), ∀ i
1. That is, εi is normally distributed, E(εi) = 0, and
V ar(εi) = σ 2, for all i.
2. For two different trials, i and j, the error terms εi and
εj are independent.
8
Chapter 1. Simple Linear Regression C. Sotto

The Simple Linear Regression Model (SLRM) is:


• simple since there is only 1 predictor/independent
variable
• linear or linear in the parameters since no
parameter appears as an exponent or is multiplied or
divided by another parameter

In addition to linearity in the parameters, a SLR model


can also be further linear in the predictor variable, when
the latter appears in the model in the first power.

A model that is linear in the parameters and in the


predictor variable is also called a first-order linear model.

Consider the following 2 models:

Model 1 : Yi = β0 + β1Xi + εi

Model 2 : Yi = β0 + β1Xi + β2Xi2 + εi

Models 1 and 2 are both linear in the parameters, and


can thus both be considered as linear models.

Model 1 is also linear in the predictor and is thus a


first-order linear model. Model 2, however, is not linear in
the predictor, though it is a linear model. Model 2 is a
second-order linear model.

9
Chapter 1. Simple Linear Regression C. Sotto

1.3.1 Graphical Representation of the Simple


Linear Regression Model

The SLR model essentially postulates that:


• there is a probability distribution for Y at each level
of X
• the means of these probability distributions vary in
some systematic function with X

Figure 1.4: Illustration of Simple Linear Regression Model.

10
Chapter 1. Simple Linear Regression C. Sotto

1.3.2 Interpretation of the Regression


Parameters

In the SLR model, we have


Yi = |
f (X
{z
i) }
+ εi
Yi = β0 + β1Xi + εi

where εi ∼ N (0, σ 2), for all i = 1, 2, . . . , n.

If we take expectations of the model above, this will yield


the true regression relation, which is, in this case, a true
regression line:
E(Yi) = β0 + β1Xi .

Of course the true model cannot be observed since β0 and


β1 are not known. We must estimate them from the data.
This gives the estimated or fitted regression line:
Ŷ i = β̂ 0 + β̂ 1Xi ,
where:
• β̂ 0 estimates β0
• β̂ 1 estimates β1
• Ŷ i estimates E(Yi)

Note that Ŷ i is actually an estimate for E(Yi), i.e.


d
Ŷ i ≡ E(Yi) .
11
Chapter 1. Simple Linear Regression C. Sotto

The estimated regression coefficients β̂ 0 and β̂ 1 are


respectively interpreted in the same way that the
intercept and the slope of a line are interpreted.

• β̂ 0 is the estimated mean response when X = 0.

• β̂ 1 is the estimated change in the mean response for a


unit increase in X.

Figure 1.5: Meaning of Parameters of Simple Linear Regression Model.

We will also estimate σ 2, the variance of ε, using σ̂ 2.

12
Chapter 1. Simple Linear Regression C. Sotto

1.4 Estimation of the Regression


Parameters

1.4.1 Method of Least Squares

The method of least squares is used to estimate β0 and


β1, and also σ 2 in an indirect way.

Example: Divorce Data

The table gives the annual number of divorces recorded


in England and Wales between 1975 and 1980.
Year 1975 1976 1977 1978 1979 1980
Divorces 120.5 126.7 129.1 143.7 138.7 148.3

They are plotted below.


Figure 1.6: Scatter Plot of Divorce Data.

145

140
Divorces

135

130

125

120

1975 1976 1977 1978 1979 1980


Year

13
Chapter 1. Simple Linear Regression C. Sotto

We want to find estimates for β0 and β1 that minimize


the “sum of squared errors” or SSE.

For each pair, (Xi, Yi), we know


εi = Yi − (β0 + β1Xi) .
So,
ε2i = [Yi − (β0 + β1Xi)]2 .

Summing over all observations,


n
X n
X
Q(β0, β1) = ε2i = [Yi − (β0 + β1Xi)]2 .
i=1 i=1

Taking partial derivatives of Q(β0, β1) w.r.t. β0 and β1


and setting the resulting expressions equal to zero, we
obtain the Normal Equations:
X X
Yi = nb0 + b1 Xi
X X X
Xi Yi = b 0 Xi + b 1 Xi2

Note that b0 and b1 denote the specific values that


minimize Q and replace the parameters (β0, β1). In other
words, b0 and b1 are the solutions to our minimization
problem.

Note: Some textbooks use b0 and b1, while others use


β̂ 0 and β̂ 1.
b0 ≡ β̂ 0 and b1 ≡ β̂ 1 .
14
Chapter 1. Simple Linear Regression C. Sotto

A little algebra yields the estimators for the parameters:


X
(Xi − X̄)(Yi − Ȳ ) SSXY
b1 = β̂ 1 = X =
(Xi − X̄)2 SSXX

b0 = β̂ 0 = Ȳ − b1X̄
where: X X
X X Xi Yi
SSXY = (Xi − X̄)(Yi − Ȳ ) = XiYi −
n
≡ (corrected) sum of the cross-products
X 2
X X ( X i )
SSXX = (Xi − X̄)2 = Xi2 −
n
≡ (corrected) sum of squares for X

The resulting estimators, b0 and b1 (or, equivalently, β̂ 0


and β̂ 1) are referred to as the least squares
estimators for β0 and β1, respectively.

Fitted Regression Line

With β̂ 0 and β̂ 1, we can compute the fitted regression line


Ŷ = β̂ 0 + β̂ 1X .
This line is observable and should be close to the true
regression line.

If we evaluate the fitted line at the point X = Xi, then


we can interpret Ŷ i as the estimated mean response at
X = Xi .
15
Chapter 1. Simple Linear Regression C. Sotto

Estimation of the Error Variance σ 2

The minimum value of Q(β0, β1) is obtained when


β0 = β̂ 0 and β1 = β̂ 1. The quantity Q(β̂ 0, β̂ 1) is SSE.

• SSE is the sum of squared deviations between the


observations and the fitted line.
• SSE is a measure of how well the linear regression
line fits the data.

n
X
SSE = Q(β̂ 0, β̂ 1) = (Yi − β̂ 0 − β̂ 1Xi)2
i=1
n
X n
X
= (Yi − Ŷ i)2 = e2i ,
i=1 i=1

where ei is called the residual for observation i.

Note that:
• ei = Yi − Ŷ i is the difference between observed and
predicted values at Xi
• we can think of ei as an “estimator” of the error εi

Now, back to estimating σ 2.

Recall that σ 2 is the common variance for ε1, ε2, . . . , εn.

Because e1, e2, . . . , en estimate the εi’s, SSE should


provide some information about the true regression line.
16
Chapter 1. Simple Linear Regression C. Sotto

In fact, an unbiased estimator of σ 2 is given by:

2 SSE
s = M SE = .
n−2

Unbiased means that E(s2) = σ 2. In other words, if the


experiment is repeated a “large” number of times, the
average of the s2 values will be equal to σ 2.

Think about the case n = 2.

Summary of Least Squares Estimators

Estimand Estimator
SSXY
β1 β̂ 1 =
SSXX

β0 β̂ 0 = Ȳ − β̂ 1X̄

E(Yh) = β0 + β1Xh Ŷ h = β̂ 0 + β̂ 1Xh

εi = Yi − E(Yi) ei = Yi − Ŷ i

2 2 SSE
σ s = M SE =
n−2

17
Chapter 1. Simple Linear Regression C. Sotto

Other Easy Formulae for SSE


n
X n
X n
X
SSE = Yi2 − β̂ 0 Yi − β̂ 1 Xi Yi
i=1 i=1 i=1

(SSXY )2
SSE = SSY Y −
SSXX
 
n 2
 X
n
X
 (Xi − X̄)(Yi − Ȳ )
SSE = (Yi − Ȳ )2 − i=1
n
X
i=1 (Xi − X̄)2
i=1
where:  
2
n
 X 
n
X n
X
 Y i
SSY Y = (Yi − Ȳ )2 = Yi2 − i=1
i=1 i=1 n
 
n 2
 X 
n
X n
X
 X i
2
SSXX = (Xi − X̄) = Xi2 − i=1
i=1 i=1 n
n
X
SSXY = (Xi − X̄)(Yi − Ȳ )
i=1
  
n
 X
n
 X
n
X
 Xi   Yi
i=1 i=1
= Xi Yi −
i=1 n

(Note: Maximum Likelihood Estimation, or MLE, is another way to


obtain estimates of the regression parameters. While LSE does not
require the distributional assumption of normality to obtain the
estimates, MLE does.)
18
Chapter 1. Simple Linear Regression C. Sotto

1.4.2 Properties of Least Squares Estimators

Gauss-Markov Theorem

Under the assumptions given in Section 1.3, the least


squares estimators, β̂ 0 and β̂ 1, are
1. unbiased, and
2. have a minimum variance among all linear
unbiased estimators of β0 and β1

The estimators β̂ 0 and β̂ 1 are said to be the Best Linear


Unbiased Estimators (BLUE) of β0 and β1, respectively.

Note that this theorem does not require the assumption


of normality of the errors. That is, the errors can have
any distribution and β̂ 0 and β̂ 1 are still BLUE.

Means and Variances


 
2
2 1 X̄ 
E(β̂ 0) = β0 and V ar(β̂ 0) = σ  +


n SSXX
σ2
E(β̂ 1) = β1 and V ar(β̂ 1) =
SSXX
 
2
1 (Xh − X̄)
E(Ŷ h) = β0 + β1Xh and V ar(Ŷ h) = σ 2  + 

n SSXX

19
Chapter 1. Simple Linear Regression C. Sotto

1.5 Example: Toluca Company

Figure 1.7: SYSTAT Scatter Plot and Fitted Regression Line


for Toluca Company Example.

20
Chapter 1. Simple Linear Regression C. Sotto

Simple Linear Regression with SAS

data TOLUCA;
input SIZE HOURS;
datalines;
80 399
30 121
50 221
90 376
70 361
60 224
120 546
80 352
100 353
50 157
40 160
70 252
90 389
20 113
110 435
100 420
30 212
50 268
90 377
110 421
30 273
90 468
40 244
80 342
70 323
;
run;

proc reg data = TOLUCA;


model HOURS = SIZE;
run;

21
Chapter 1. Simple Linear Regression C. Sotto

SAS Output

The REG Procedure


Model: MODEL1
Dependent Variable: HOURS

Number of Observations Read 25


Number of Observations Used 25

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 252378 252378 105.88 <.0001


Error 23 54825 2383.71562
Corrected Total 24 307203

Root MSE 48.82331 R-Square 0.8215


Dependent Mean 312.28000 Adj R-Sq 0.8138
Coeff Var 15.63447

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 62.36586 26.17743 2.38 0.0259


SIZE 1 3.57020 0.34697 10.29 <.0001

22
Chapter 2

Inferences in Regression Analysis

Chapter 2 in Kutner et al.

• Inference for Regression Coefficients

• Inference for Mean Response

• Predicting New Observations

• Analysis of Variance

• General Linear Model Test

• Coefficient of Determination, R2

23
Chapter 2. Inferences in Regression Analysis C. Sotto

In linear regression, estimation can be done using LSE,


which does not require the normality assumption. From
LSE, we obtain β̂ 1 and β̂ 0, which are the point estimators
of the parameters β1 and β0, respectively.

Statistical inference, however, is based on the normal


distribution.
• The errors, εi, are assumed to be mutually
independent and (identically) normally distributed.
• We rely on the Central Limit Theorem and assume
that the inference is approximately correct.

Inference generally takes the form of either of two related


techniques:
1. Hypothesis Tests
2. Confidence Intervals

2.1 Inference for β1

For most problems we are primarily interested in β1.


Why?
β1 = 0 ⇐⇒ no linear relation between X and Y
But,
β1 = 0 ⇐⇒
/ no relation between X and Y
24
Chapter 2. Inferences in Regression Analysis C. Sotto

2.1.1 Sampling Distribution of β̂ 1

For inference about β1, we need to know the variability of


β̂ 1. That is, if we take another sample and compute
another value of β̂ 1, how close will it be to the original
estimate?

The distribution of values of β̂ 1, resulting from repeated


sampling, is called the sampling distribution of β̂ 1.

Note that in Section 1.4.2, we said that


σ2
E(β̂ 1) = β1 and V ar(β̂ 1) = .
SSXX

(See Kutner et al., p. 42-43, for the proofs of these results.)

Note also that since s2 = M SE is an estimate for σ 2,


then an estimate for V ar(β̂ 1) is given by:
M SE
s2(β̂ 1) = .
SSXX

This gives us the mean and the variance for the


distribution of β̂ 1. However, it does not tell us the form
(shape) of the distribution.

So how do we determine the shape of the sampling


distribution of β̂ 1?

25
Chapter 2. Inferences in Regression Analysis C. Sotto

Normality

If the Y ’s (or equivalently, the ε’s) are normally


distributed, then the distribution of β̂ 1 is also normal.

The following are two key results that we will use to show
that β̂ 1 has a normal distribution:

1. β̂ 1 is a linear combination of the observed Yi’s.


This means that β̂ 1 is a weighted sum of the
observations, i.e.
n
X Xi − X̄
β̂ 1 = w i Yi , where wi = n
X 2
.
i=1 (Xi − X̄)
i=1
(See Kutner et al. for a formal proof.)

2. A linear combination of independent, normal random


variables is also normally distributed.
(This is a theorem that we do not prove in this course.)

As a result of this theorem,


• A sum of independent, normal random variables
has a normal distribution.

• Also, multiplying a normal random variable by a


constant (e.g., wi), results in a normal random
variable.
26
Chapter 2. Inferences in Regression Analysis C. Sotto

To show result (1) above, consider:


SSXY X (Xi − X̄)(Yi − Ȳ )
β̂ 1 = =
SSXX SSXX
X (Xi − X̄)Yi X (Xi − X̄)Ȳ
= −
SSXX SSXX
X (Xi − X̄)Yi X (Xi − X̄)
= − Ȳ
SSXX |
SS{z XX }

X (Xi − X̄)Yi
β̂ 1 = − 0
SSXX
Thus,
(Xi − X̄)Yi X
X Xi − X̄
β̂ 1 = ≡ wiYi, with wi = ,
SSXX SSXX
implying that it is a weighted sum of independent, normal
variables. Hence, by result (2) above, β̂ 1 has a normal
distribution.

Therefore,  
2
 σ 
β̂ 1 ∼ N β1 , X 2 .

(Xi − X̄)

Although the sampling distribution of β̂ 1 is known, σ 2 is


not. But, M SE provides an independent estimate of σ 2.

So, we can use the t-distribution for hypothesis tests and


confidence intervals for β1.
27
Chapter 2. Inferences in Regression Analysis C. Sotto

2.1.2 Confidence Interval for β1

The upper and lower bounds for a level (1 − α) × 100%


confidence interval are given by:
s

β̂ 1 ∓ t1− α2 , n−2 s2(β̂ 1)


or v
u
u
u
u
M SE
β̂ 1 ∓ t1− α2 , n−2 uX
t 2 ,
(Xi − X̄)

where t1− α2 , n−2 is the upper (α/2) quantile from a


t-distribution with (n − 2) degrees of freedom (df ).

2.1.3 Hypothesis Test for β1

To test the hypotheses H0 : β1 = c, the following test


statistic is used:
β̂ 1 − c H0

t =u
v ∼ tn−2 .
u
u
u
M SE
uX
(Xi − X̄)2
t

That is, under the null hypothesis, the test statistic t∗


has a t-distribution with (n − 2) df .

The exact form of the critical region for the test will be
determined by the alternative hypothesis, HA.
28
Chapter 2. Inferences in Regression Analysis C. Sotto

2.1.4 Example: MgCO3 Content of Sand Dollars

The skeletal chemical composition is believed to be


related to the mean (summer) water temperature. The
summary statistics and parameter estimates, from n = 10
observations, are as follows:

X̄ = 16.49 SSXX = 86.329


Ȳ = 8.94 SSY Y = 1.084
β̂ 1 = 0.0994 SSXY = 8.584
β̂ 0 = 7.3003 M SE = 0.0288

SAS Code

data SAND;
input mgco3 temp;
datalines;
9.20 17.50
9.20 21.00
9.40 20.00
9.00 15.30
8.50 14.00
8.50 13.10
8.80 13.30
8.50 13.00
9.30 19.00
9.00 18.70
;
run;

proc reg data = SAND;


model mgco3 = temp / CLB COVB;
run;

29
Chapter 2. Inferences in Regression Analysis C. Sotto

SAS Output

The REG Procedure


Model: MODEL1
Dependent Variable: mgco3

Number of Observations Read 10


Number of Observations Used 10

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 0.85354 0.85354 29.63 0.0006


Error 8 0.23046 0.02881
Corrected Total 9 1.08400

Root MSE 0.16973 R-Square 0.7874


Dependent Mean 8.94000 Adj R-Sq 0.7608
Coeff Var 1.89853

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits

Intercept 1 7.30034 0.30597 23.86 <.0001 6.59476 8.00592


temp 1 0.09943 0.01827 5.44 0.0006 0.05731 0.14156

Covariance of Estimates

Variable Intercept temp

Intercept 0.0936198873 -0.005502675


temp -0.005502675 0.0003336977

30
Chapter 2. Inferences in Regression Analysis C. Sotto

To evaluate if temperature has any explanatory ability for


MgCO3 content, we test

H0 : β1 = 0 versus HA : β1 6= 0 .

Constructing the test statistic,

β̂ 1 − β1 0.0994 − 0 0.0994
t∗ = = v
u
u 0.0288
= √ = 5.44 .
s(β̂ 1) u
u
0.0003336
t
86.329

Under H0, that is, if H0 is true, then the statistic t∗ has a


t-distribution with
df = n − 2 = 8.
This results in a p-value of
P (|t∗| ≥ 5.44) = 2P (t∗ ≥ 5.44) = 0.0006.
(See Appendix A for details about the p-value.)

So, we reject H0 : β1 = 0 and conclude that β1 6= 0.

This means that there is evidence to conclude that


summer temperature is linearly related to MgCO3
content.

31
Chapter 2. Inferences in Regression Analysis C. Sotto

2.2 Inference for β0

The sampling distribution of β̂ 0 is


  
H0 2
1 X̄ 
N β0 , σ 2  +

β̂ 0 ∼  .
n SSXX

Using exactly the same type of argument as for that for


β̂ 1, it can be shown that β̂ 0 is a linear combination of
independent, normally distributed variables.
YiX X
β̂ 0 = Ȳ − β̂1X̄ = − ( wiYi) X̄
n
X Yi X
= − wiX̄Yi
n
 
X 1
β̂ 0 =  − wiX̄  Yi
n
Thus,
 
X 1  X 1
β̂ 0 =  − wiX̄  Yi ≡ kiYi, with ki = − wiX̄.
n n

Note also that an estimate for V ar(β̂ 0) is given by:


 
2
1 X̄
s2 (β̂ 0) = M SE  +
 
 .

n SSXX

So we can use the usual form of a hypothesis test and/or


a confidence interval for β0.
32
Chapter 2. Inferences in Regression Analysis C. Sotto

A (1 − α) × 100% confidence interval for β0 is given by:


s

β̂ 0 ∓ t1− α2 , n−2 s2(β̂ 0) .

To test the hypotheses H0 : β0 = c, the following test


statistic is used:
β̂ 0 − c H0

t = s ∼ tn−2 .
s2 (β̂ 0)

2.3 Inference for Mean Response

Often, one of the major goals of regression analysis is to


estimate the mean of the distribution of responses for a
specific value of the predictor X. Typically, we use a
confidence interval for this task.
d
First, we need the sampling distribution of Ŷ h ≡ E(Yh)
at the value X = xh.

2.3.1 Sampling Distribution of Ŷ h

Recall that the estimated mean response at X = xh is


d
E(Yh) ≡ Ŷ h = β̂ 0 + β̂ 1xh .

In Sections 2.1 and 2.2, it was noted that β̂ 0 and β̂ 1 are


linear combinations of the Yi’s.
33
Chapter 2. Inferences in Regression Analysis C. Sotto

n
X n
X
Writing β̂ 0 = kiYi and β̂ 1 = wiYi, we obtain
i=1 i=1

Ŷ h = β̂ 0 + β̂ 1 xh
n
X n
X
= k i Yi + x h w i Yi
i=1 i=1
n
X
= (kiYi + xhwiYi)
i=1
n
X
= (ki + xhwi)Yi
i=1
Thus,
n
X n
X
Ŷ h = (ki + xhwi)Yi ≡ ciYi, with ci = ki + xhwi,
i=1 i=1
is also a linear combination of the observations.

Because the Yi’s are independent, normally distributed


variables, Ŷ h also has a normal distribution. In fact,

E(Ŷ h) = β0 + β1xh
 
2
1 (xh − X̄)
V ar(Ŷ h) = σ 2  +


 .
n SSXX

We can substitute M SE for σ 2 and use the t-distribution


to construct intervals for the mean value of Y at X = xh.

A (1 − α) × 100% confidence interval for E(Yh) is:


v
u  
u
u
u 1 (xh − X̄)2 
Ŷ h ∓ t1− α2 , n−2 M SE  +
u
t

 .
n SSXX
34
Chapter 2. Inferences in Regression Analysis C. Sotto

2.3.2 Example: MgCO3 Content of Sand Dollars

Suppose now we would like to estimate the mean MgCO3


content for sand dollars when the mean summer
temperature is 15◦C (i.e. Xh = 15).

A point estimate for this mean response is


Ŷ h = 7.30034 + 0.09943(15) = 8.7918.

But what is a likely range of values for the mean MgCO3


content?

The standard error for Ŷ h is


v
u  
1
u
u
u (xh − X̄)2 
s(Ŷ h) = M SE  +
u
t


n SSXX
v  
u
u
u
u  1 (15 − 16.49)2 
s(Ŷ h) = 0.0288 
u
t + 
 = 0.0602
10 86.329
From the t-table, t0.975 , 8 = 2.306. Hence, a 95%
confidence interval for the mean response when X = 15 is

8.7918 ∓ 2.306(0.0602) =⇒ (8.6530, 8.9306).

We are thus 95% confident that the true mean MgCO3


content when the mean summer temperature is 15◦C is
between 8.65 and 8.93.

35
Chapter 2. Inferences in Regression Analysis C. Sotto

SAS Code

data SAND;
input mgco3 temp;
datalines;
9.20 17.50
9.20 21.00
9.40 20.00
9.00 15.30
8.50 14.00
8.50 13.10
8.80 13.30
8.50 13.00
9.30 19.00
9.00 18.70
. 15
;
run;

proc reg data = SAND;


model mgco3 = temp / CLM CLI;
run;

36
Chapter 2. Inferences in Regression Analysis C. Sotto

SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: mgco3

Number of Observations Read 11


Number of Observations Used 10
Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 0.85354 0.85354 29.63 0.0006


Error 8 0.23046 0.02881
Corrected Total 9 1.08400

Root MSE 0.16973 R-Square 0.7874


Dependent Mean 8.94000 Adj R-Sq 0.7608
Coeff Var 1.89853

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 7.30034 0.30597 23.86 <.0001


temp 1 0.09943 0.01827 5.44 0.0006

Output Statistics

Dependent Predicted Std Error


Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual

1 9.2000 9.0404 0.0568 8.9095 9.1713 8.6277 9.4531 0.1596


2 9.2000 9.3884 0.0983 9.1617 9.6152 8.9361 9.8408 -0.1884
3 9.4000 9.2890 0.0836 9.0962 9.4818 8.8527 9.7253 0.1110
4 9.0000 8.8217 0.0579 8.6881 8.9552 8.4081 9.2352 0.1783
5 8.5000 8.6924 0.0704 8.5302 8.8546 8.2687 9.1161 -0.1924
6 8.5000 8.6029 0.0819 8.4139 8.7919 8.1683 9.0375 -0.1029
7 8.8000 8.6228 0.0792 8.4401 8.8055 8.1909 9.0547 0.1772
8 8.5000 8.5930 0.0833 8.4008 8.7852 8.1569 9.0290 -0.0930
9 9.3000 9.1896 0.0706 9.0268 9.3524 8.7657 9.6135 0.1104
10 9.0000 9.1597 0.0672 9.0049 9.3146 8.7388 9.5807 -0.1597
11 . 8.7918 0.0602 8.6531 8.9306 8.3766 9.2071 .

Sum of Residuals 0
Sum of Squared Residuals 0.23046
Predicted Residual SS (PRESS) 0.36696

37
Chapter 2. Inferences in Regression Analysis C. Sotto

2.4 Predicting New Observations

In this section, we consider two possibilities in the


prediction of the response for new observations:

• prediction of a single new observation, Yh(new),


corresponding to a given level X = xh of the
predictor variable

• prediction of the mean of m new observations on Y


for a given level X = xh of the predictor variable

2.4.1 Predicting a Single New Observation

To construct a prediction interval for a single new


observation at X = xh, we need to account for two
sources of variability:
1. the variability in the possible location of the
distribution of Y ; and,
2. the variability within the probability distribution of Y .

The first component is the variability in estimating the


line (i.e. the mean response).

The second is the variability of a new observation, given


its mean value (this corresponds to the variability of εi).

38
Chapter 2. Inferences in Regression Analysis C. Sotto

The point estimator for a new observation is just the


value of the fitted line at X = xh, i.e.
Ŷ h(new) = β̂ 0 + β̂ 1xh .

Note that this is the same point estimator that we use for
estimating the mean response:
Ŷ h = β̂ 0 + β̂ 1xh .
The difference now is that the prediction of a single new
observation is more variable than when estimating the
mean response, i.e.
 
!
2  1 (xh − X̄)2 
V ar Ŷ h(new) = σ 1 + +  .
n SSXX

This is estimated as
 
2
2
!
1 (xh − X̄)
 
s Ŷ h(new) = M SE 1 + + 



n SSXX
 
2
1 (xh − X̄)  
= M SE + M SE + 



n SSXX
!
s Ŷ h(new) = M SE + s2(Ŷ h) .
2

Note that the estimated variance of a single new


prediction (term on the LHS) is always larger than the
estimated variance of the estimated mean response
(second term on the RHS).
39
Chapter 2. Inferences in Regression Analysis C. Sotto

To construct the prediction interval, we just construct a


confidence interval based on the t-distribution with
(n − 2) df . The prediction interval is given by:
!
Ŷ h(new) ∓ t1− α2 , n−2 s Ŷh(new) .

This interval is called a prediction interval because we


are predicting a new observation of a random variable.

Example: Tree Growth Data (Juday)

Consider the scatter plot of data collected by Dr. Juday


showing the growth of white spruce trees and its relation
to the previous year’s mean annual temperature. The
least squares regression line is also shown on the graph.
Figure 2.1: Scatter Plot and Fitted Regression Line for Tree Growth Data.

40
Chapter 2. Inferences in Regression Analysis C. Sotto

To make the difference more clear between a confidence


interval for the mean response and a prediction interval,
look at the following graphs.
Figure 2.2: Confidence Intervals for the Figure 2.3: Prediction Intervals for
Mean Response for Tree Growth Data. Future Responses for Tree Growth Data.

The plot on the left shows the line of best fit along with
the 95% confidence interval for the mean response of
growth at each value of temperature (X).

The plot on the right shows the (pointwise) 95%


prediction interval for future growth observations at each
value of temperature (X).

SAS Code
proc reg corr simple;
model growth = temp;
plot growth*temp/conf95 vaxis=-1 to 5 by 1;
plot growth*temp/pred95 vaxis=-1 to 5 by 1;
run;

41
Chapter 2. Inferences in Regression Analysis C. Sotto

2.4.2 Predicting the Mean of m New


Observations

Consider m new observations at the same level X = xh.


Let Ȳh(new) denote the mean response of these new cases.

The estimator for Ȳh(new) is the predicted value itself,


since all cases with the same level of X take the same
prediction, i.e.
Ȳh(new) = Ŷ h .

The variance is given by


 
 
2  1 1 (xh − X̄)2 
V ar Ȳh(new) =σ  + +  ,
m n SSXX
and the estimated variance is
 
2
 
 1 1 (x h − X̄)
s2 Ȳh(new)

= M SE  + + 

m n SSXX

2
  M SE
s Ȳh(new) = + s2(Ŷ h) .
m
!
2
Note the difference between the latter and s Ŷ h(new) .

The prediction interval for the mean response of m new


observations at the same level X = xh is thus given by:
 
Ŷ h ∓ t1− α2 , n−2 s Ȳh(new) .

42
Chapter 2. Inferences in Regression Analysis C. Sotto

2.5 Confidence Band for the Regression


Line

In contrast to a confidence interval for a single mean


response, a confidence band for the entire regression
line,
E(Y ) = β0 + β1X,
can also be obtained. This band provides the region
within which the entire regression line is expected to lie.

Whereas the confidence limits for a single mean response


are expected to encompass E(Yh) at the single level
X = xh, the confidence band must encompass the entire
regression line.

Working-Hotelling Band

The Working-Hotelling (1 − α) × 100% confidence band


for the regression line has the following two boundary
values at any level X = xh :

Ŷ h ∓ W s(Ŷ h) ,
where: r
W = 2 F1−α ; 2 , n−2 and
 
2
1 (x h − X̄)
s2(Ŷ h) = M SE  +
 
 .

n SSXX
43
Chapter 2. Inferences in Regression Analysis C. Sotto

Note:

1. For the same confidence level, the confidence band


differs from a CI for the mean response only in the
constant multiple, W .
• CI for the mean response, E(Yh) ⇒ uses t-value
• confidence band for reg’n. line ⇒ uses W -value

2. The confidence band is wider than the pointwise


confidence intervals for the mean response,
• since, for the same α-level and df , W ≥ t
• this adjustment is necessary to ensure that the
confidence band encompasses the entire regression
line, and not just a single mean response

2.6 Analysis of Variance

Analysis of Variance (or ANOVA) is another tool for


interpreting the results from a linear regression.

We start with the total variation in the response, Y :


n
X
SST O ≡ SSY Y = (Yi − Ȳ )2 .
i=1

This is referred to as the “total sum of squares” or


“sum of squares total” (SST O) or “sum of
squares total (corrected)”.
44
Chapter 2. Inferences in Regression Analysis C. Sotto

In the latter, the word “corrected” means that the Y


values are corrected or adjusted for the mean, Ȳ . Hence,
SST O measures the variability of the Y values around
their observed mean.

Analysis of variance methods partition SST O into


different sources. Those sources that account for a large
part of the total variation are “important” in explaining
the responses, while those that account for small
amounts are less important. (The ANOVA F -test helps
us determine which are small and which are large.)

We now show how the SST O is partitioned into different


sources.

The deviation (Yi − Ȳ ) will be partitioned into different


component parts.

Figure 2.4: Illustration of Partitioning of Total Deviations (Yi − Ŷ ) for Toluca


Company Example (not drawn to scale; only observations Y1 and Y2 are shown).

45
Chapter 2. Inferences in Regression Analysis C. Sotto

So we begin with (Yi − Ȳ ), then add and subtract Ŷ i.

Yi −
| {z
Ȳ} = Yi −{z Ŷ i}
|
+ Ŷ i −
| {z
Ȳ}
deviation deviation of
total
= around fitted fitted regression
deviation
regression line value around mean

We now take the square of these equivalent expressions


for the deviation and then sum over all observations:
n
X
SST O = (Yi − Ȳ )2
i=1

n
X
= (Yi − Ŷ i + Ŷ i − Ȳ )2
i=1

n 
X 
2 2
= (Yi − Ŷ i) + (Ŷ i − Ȳ ) + 2(Yi − Ŷ i)(Ŷ i − Ȳ )
i=1

n
X n
X n
X
= (Yi − Ŷ i )2 + (Ŷ i − Ȳ )2 + 2 (Yi − Ŷ i)(Ŷ i − Ȳ )
i=1
| {z }
i=1
| {z } |
i=1 {z }

SST O = SSE + SSR + 0

The first term,


n
X 2 n
X
SSE = (Yi − Ŷ i) = e2i ,
i=1 i=1

is just the sum of the squared residuals and is called the


“error sum of squares” or “sum of squares due to
error” (SSE).
46
Chapter 2. Inferences in Regression Analysis C. Sotto

The second term,


n
X
SSR = (Ŷ i − Ȳ )2 ,
i=1

is called the “regression sum of squares” or “sum


of squares due to regression”. It describes the
variability accounted for by the fitted regression line.

The last term,


n
X
2 (Yi − Ŷ i)(Ŷ i − Ȳ ) = 0 .
i=1

It is not at all obvious that this should equal zero (but it


does!). The reason is that the least squares regression
estimators (β̂ 0 and β̂ 1) are constructed so that this sum
is equal to zero.

Note that we have divided or decomposed the total


variation in the response, i.e. the variation of the
responses Yi about their mean Ȳ , into 2 parts:
• that part that can be explained by the regression line
• that part which remains unexplained

Or, respectively,
• SSR ⇒ the “explained” part
• SSE ⇒ the “unexplained” part

47
Chapter 2. Inferences in Regression Analysis C. Sotto

To compute all these quantities, we use the following


formulae:
n
X
SST O = SSY Y = (Yi − Ȳ )2
i=1
 
n 2
 X 
n
X
 Y i
= Yi2 − i=1
i=1 n
(SSXY )2 2
SSR = = β̂ 1 SSXX
SSXX
SSE = SST O − SSR

Recall that this last equation is true because


SST O = SSR + SSE .

To summarize this partitioning, we use an analysis of


variance (ANOVA) table. It is a handy way of keeping
track of the sums of squares (SS).

Source of
df SS MS E(MS)
Variation
SSR
Regression 1 SSR MSR = σ 2 + β12 SSXX
1

SSE
Error n−2 SSE MSE = σ2
n−2

Total n−1 SST O

48
Chapter 2. Inferences in Regression Analysis C. Sotto

Usually, the expected mean squares, or E(M S), column


is not printed. You are supposed to figure that out for
yourself. Instead, an F -statistic is usually included as
part of the ANOVA table.

The F -statistic is defined as:

M SR H0

F = ∼ F1 , n−2 .
M SE

Hence, assuming H0 is true, F ∗ has an F -distribution


with 1 and (n − 2) degrees of freedom.
(Check Kutner et al., Table B.4, p. 1320, for tabled values
of the F -distribution, or use software capable of giving
quantiles and/or tail probabilities for the F -distribution.)

The F -statistic tests the null hypothesis that the


regression model is “significant”, that is, whether the
regression model has any predictive or explanatory ability.

Specifically, the F -statistic tests the following hypotheses:

H0 : β1 = 0 versus HA : β1 6= 0 .

(Note that this is the same set of hypotheses as that of the


t-test for the significance of β1.)

49
Chapter 2. Inferences in Regression Analysis C. Sotto

Motivation for the F -test can be gained by looking at the


expected mean squares, E(M S), for M SR and M SE.

Consider the ratio of E(M SR) to E(M SE):

E(M SR) σ 2 + β12SSXX


=
E(M SE) σ2

If β1 is close to 0, then the ratio will be approximately


equal to 1, and consequently, we can expect the statistic
F ∗ = M SR/M SE to be close to 1.

Conversely, if β1 6= 0, the numerator is larger and the


ratio will be greater than 1, which in turn will lead to an
F ∗-statistic that is larger than or very different from 1.

Hence, large values of the F ∗-statistic would support the


alternative hypothesis, while values of the F ∗ -statistic
close to 1 support the null hypothesis.

Note also that the F -test automatically does a two-sided


test. In contrast with the t-test, the F -test cannot test
for specific alternative hypotheses of the form

HA : β1 < 0 or HA : β1 > 0 .

50
Chapter 2. Inferences in Regression Analysis C. Sotto

Example: MgCO3 Content of Sand Dollars

Recall the summary statistics, for n = 10 observations,


for the regression of MgCO3 content on mean summer
temperature (◦C).
SSXX = 86.329 n = 10
SSY Y = 1.084 M SE = 0.0288
SSXY = 8.584

The ANOVA table and parameter estimates are shown


below.
Analysis of Variance

Source of Variation df SS MS F∗ p
Regression 1 0.85354 0.85354 29.63 0.0006
Error 8 0.23046 0.02881
Total 9 1.08400

Parameter Estimates

Predictor Coef Stdev t∗ p


Constant 7.3003 0.3060 23.86 <0.0001
Temperature 0.0994 0.0183 5.44 0.0006

51
Chapter 2. Inferences in Regression Analysis C. Sotto

2.7 F -Test versus t-Test

Since the F -test for simple linear regression and the


t-test both test the null hypothesis H0 : β1 = 0, one
might wonder if there is a link between those tests.

Indeed there is!


2
∗ M SR β̂ 1 SSXX
F = =
M SE M SE
2
β̂ 1
=
M SE/SSXX
 2
 β̂ 1 
F∗ = 


∗ 2
 = (t )

s(β̂ 1)

So, for the case of one explanatory variable, the


F -statistic is equal to the t-statistic squared.

Note that in our example (5.44)2 = 29.63.

Hence, in simple linear regression, either the F -test or


the t-test can be used to test

H0 : β1 = 0 versus HA : β1 6= 0 .

However, the t-test is more flexible, since it can be used


for one-sided alternatives.
52
Chapter 2. Inferences in Regression Analysis C. Sotto

2.8 General Linear Test

In addition to the F - and t-tests, there is one more


method for testing H0 : β1 = 0.

The General Linear Test is based on the reduction in


sum of squares, and can be used for more general types
of hypothesis for a linear model.
(We will also use this test later on for other forms of
hypotheses in the case of Multiple Linear Regression.)

The test is based on the following rationale:


By how much is the model improved by adding
another explanatory variable?

The full model,


Yi = β0 + β1Xi + εi ,
gives
n
X
SSE(F ) = (Yi − Ŷ i)2 = SSE .
i=1

The reduced model,


Yi = β0 + εi ,
yields
n
X
SSE(R) = (Yi − Ȳ )2 = SST O .
i=1
53
Chapter 2. Inferences in Regression Analysis C. Sotto

Note that the reduced model is the resulting model if we


assume that the null hypothesis, H0 : β1 = 0, is true.

To test if the full model fits the data significantly better


than the reduced model, we can use an F -test of a
different form.
SSE(R) − SSE(F )
dfR − dfF H0
F∗ = ∼ FdfR−dfF , dfF ,
SSE(F )
dfF
where dfR and dfF are the respective error df under the
reduced and full models.

If F ∗ is large in comparison to the F -quantile, we


conclude that the full model fits significantly better than
the reduced model. That is, β1 6= 0.

Note: The general linear test approach above can be


used for highly complex tests of linear statistical models,
as well as for simple tests.

How can we do this?


1. Fit the full model and obtain SSE(F ).
2. Fit the reduced model and obtain SSE(R).
3. Compute F ∗ to test whether the full model
significantly improves the reduced model.
54
Chapter 2. Inferences in Regression Analysis C. Sotto

SAS Code and SAS Output

proc reg data = SAND;


model1: model mgco3 = temp;
model2: model mgco3 = ;
run;

Model: model1 - Dependent Variable: mgco3

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 0.85354 0.85354 29.63 0.0006


Error 8 0.23046 0.02881
Corrected Total 9 1.08400

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 7.30034 0.30597 23.86 <.0001


temp 1 0.09943 0.01827 5.44 0.0006

Model: model2 - Dependent Variable: mgco3

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 0 0 . . .
Error 9 1.08400 0.12044
Corrected Total 9 1.08400

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 8.94000 0.10975 81.46 <.0001

55
Chapter 2. Inferences in Regression Analysis C. Sotto

SSE(R) − SSE(F )
dfR − dfF
F∗ =
SSE(F )
dfF
1.08400 − 0.23046 0.85354
= 9−8 = 1
0.23046 0.02881
8
F ∗ = 29.6265

For some pre-specified significance level, α, we compare


F ∗ = 29.6265 to the tabulated F -value, F1−α ; 1,8 .

Note that, in this situation, the General Linear Test and


the overall F -test (from the ANOVA table of Model 1
above) yield the same value. It is because they are both
testing the hypothesis H0 : β1 = 0.

If we consider another hypothesis, different form that


being tested by the overall ANOVA F -test, then the
latter and the General Linear Test will not necessarily
yield the same test statistic value.

(You will see some examples of this later in the course when
we discuss Multiple Linear Regression and more complex
types of hypotheses in Chapter 5.)
56
Chapter 2. Inferences in Regression Analysis C. Sotto

2.9 R2 Coefficient of Determination

The coefficient of determination, R2, measures the


proportion of total variability in Y that is explained by
the linear regression with X.

2 SSR SST O − SSE SSE


R = = =1−
SST O SST O SST O

Note:

1. 0 ≤ R2 ≤ 1. Why?
2. If all the data points fall exactly on a line having
non-zero slope, then R2 = 1.
3. If β̂ 1 = 0, then R2 = 0.
4. The square root of R2 yields the correlation
coefficient, r, between X and Y , i.e.

r = ± R2 ,
where the sign is chosen according to the sign of β̂ 1.

5. The estimated slope β̂ 1 and r = R2 are correlated
as follows:
sx
r = β̂ 1 ,
sy
where sx and sy are the standard deviations of X and
Y , respectively.
57
Chapter 2. Inferences in Regression Analysis C. Sotto

Recall that
SSXY
r=√ .
SSXX SSY Y

So, one may think of r as a kind of dimensionless


regression coefficient, where the “scale” of X and Y has
been changed to units of standard deviation.

That is, for a one standard deviation increase (sx) in X,


we expect an r standard deviation increase (sy ) in Y .

In other words, r measures the strength of the linear


relationship between X and Y .

Common Misconceptions About R2:

1. A high coefficient of determination indicates that


useful predictions can be made.
2. A high coefficient of determination indicates that the
estimated line is a good fit.
3. A coefficient of determination near zero indicates that
X and Y are not related.

58
Chapter 2. Inferences in Regression Analysis C. Sotto

Example: MgCO3 Content of Sand Dollars

SSR = 0.8535 and SST O = 1.0840


Thus,
0.8535
2

R = = 0.787 and r = R2 = 0.8871 .
1.0840
Therefore, about 79% of the variability in MgCO3
content is accounted for by its regression on the mean
summer temperature.

SAS Code and SAS Output

proc corr data = SAND;


var mgco3 temp; run;

The CORR Procedure


2 Variables: mgco3 temp

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

mgco3 10 8.94000 0.34705 89.40000 8.50000 9.40000


temp 10 16.49000 3.09711 164.90000 13.00000 21.00000

Pearson Correlation Coefficients, N = 10


Prob > |r| under H0: Rho=0

mgco3 temp

mgco3 1.00000 0.88735


0.0006

temp 0.88735 1.00000


0.0006

59
Chapter 3

Diagnostics and Remedial Measures

Chapter 3 in Kutner et al.

• Diagnostic Plots

• Test for Constancy of Variance

• Test for Normality

• Lack-of-Fit Test

• Remedial Measures

• Transformations

60
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.1 Departures from the Model

• Linearity
The regression function is not linear.

• Homoscedasticity
The error terms do not have constant variance.

• Independence
The error terms are not independent.

• Outliers
The fit is alright except for some outliers.

• Normality
The error terms are not normally distributed.

• Model Extension
Important independent variables are not in the model.

⇒ The assumptions must be checked. Why?

⇒ How can we check these assumptions?

61
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.2 Residuals
ei = Yi − Ŷ i
• the residual may be regarded as the “observed” error
• it is not the same as the unknown true error
εi = Yi − E(Yi)
• if the model is appropriate for the data at hand, the
residuals should reflect the properties assumed for εi
(i.e. independence, normality, zero mean and constant
variance)

Properties:

1. Mean:
n
Xei
ē = =0
i=1 n

2. (Sample) Variance:
2 n (ei − ē)2
X n
X e2i SSE
s = = = = M SE
i=1 n − 2 i=1 n − 2 n−2

3. Non-Independence:
The residuals are not independent and are subject to
two constraints:
n
X n
X
ei = 0 and Xiei = 0
i=1 i=1

62
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Semistudentized Residuals:
ei − ē ei
e∗i = √ = √
M SE M SE

• if M SE were an estimate of the standard deviation
of ei, then e∗i would be the studentized residual
(Studentized and other residuals will be discussed in Chapter 6.)

• the standard deviation of ei is not equal to M SE ;
it varies for each ei

• M SE is only an approximation of the standard
deviation ei; hence, we call e∗i a semistudentized
residual

3.3 Diagnostics Plots

Diagnostic plots are often as important as any of


the calculations.
⇒ they can tell you whether regression is even
appropriate
In the following sections, we will present a list of plots to
make and descriptions of what to look for. For all the
plots below, X denotes the generic explanatory variable,
while Y denotes the generic response.
63
Chapter 3. Diagnostics and Remedial Measures C. Sotto

A few questions to consider when you analyze plots.


1. Do the residuals follow any pattern indicating
nonlinearity?
2. Are there any outliers?
3. Does the assumption of constant variance look
correct?
4. Label any qualitative variables on the plot. Any
patterns?

3.3.1 Univariate Plots of X and Y

Useful for:
1. finding outliers and/or misrecorded values
2. examining the shape of the distribution

Typical Univariate Plots:


• box plots of X and Y
• stem-and-leaf plots for X and Y
• histograms for X and Y
• dot plots for X and Y
• time plots for X and Y

64
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.3.2 Bivariate Plots

Some of the more commonly used bivariate plots for


regression analysis are:
• plot of X vs. Y


• plot of e vs. X 







• plot of e vs. Ŷ 
residual plots







• plot of e vs. time order 

• plot of e∗ vs. X
• normal quantile plot

3.3.2.1 Plot of X vs. Y

• a plot of X vs. Y is called a scatter plot


• useful even before diagnostic checking, to provide
insight into the relationship between X and Y
• can indicate nonlinearity, outliers, and non-constancy
of the error variance

Some Considerations:

1. Is the relationship linear? Nonlinear?


2. Are there two-dimensional outliers?
3. Does the assumption of constant variance look
reasonable?
65
Chapter 3. Diagnostics and Remedial Measures C. Sotto

The relation between X and Y in the following scatter


plot looks “almost” linear.

Figure 3.1: Fairly Linear Scatter Plot.

Thus, for this data, the assumption of linearity does not


seem completely unreasonable.

66
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Note:

In some situations, a scatter plot distinguishing different


values of categorical variables can be informative.

In the scatter plot below, each group is given a different


plotting symbol.

Figure 3.2: Scatter Plot of Data by Sub-Groups.

Here, we observe a pattern that is clustered by subgroups.

67
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.3.2.2 Plot of the Residuals versus X

• more effective in detecting nonlinearity than a scatter


plot (e.g., nonlinearity may be more prominent on a
residual plot compared to on a scatter plot)
• can also indicate other forms of model departure
besides nonlinearity (e.g., non-constancy of variance)
• any observable pattern(s) in the plot of the residuals
versus X indicates a problem with the model
assumptions!

The residuals in this graph clearly show a pattern of


low-high-low, which is typical when the relationship
between X and Y is nonlinear.
Figure 3.3: Nonlinear Residuals.

68
Chapter 3. Diagnostics and Remedial Measures C. Sotto

This next graph shows a “random” pattern for the


residuals. A look at the histogram of these residuals
indicate them to be at least approximately normal.
Figure 3.4: Random Residuals.

Finally, the following graph shows some indication of


non-constant error variance.
Figure 3.5: Non-Constant Variance.

69
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Note:

Some specific patterns that may be evident in a residual


plot, as well as what they imply, are provided below.
Figure 3.6: Prototype Residual Plots.

• Figure 3.6(a) is ideal


• Figure 3.6(b) is indicative of nonlinearity of the
regression relation
• Figure 3.6(c) is indicative of non-constant error
variance
• Figure 3.6(d) is indicative of a linear time-related
trend effect
70
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.3.2.3 Plot of the Residuals versus Ŷ

• for simple linear regression (a single predictor), this


plot has the same information as the plot of the
residuals versus X, but on a different scale
• for multiple linear regression, this plot lets us examine
patterns in the residuals with increasing values of Y
• no systematic pattern(s) is ideal

3.3.2.4 Plot of the Residuals versus Time


Collection or Spatial Arrangement

• also called a sequence plot


• only meaningful when data are collected in a time
sequence or some other type of sequence
• any pattern indicates a lack of independence
Figure 3.7: Residual Time Sequence Plots
Illustrating Non-Independence of Error Terms.

71
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.3.2.5 Plot of the Semistudentized Residuals


versus X
ei
Semistudentized residuals: e∗i = √
M SE

• useful for identifying outliers


• using the semistudentized version of the residual
makes it a little easier to look for outliers
• cases with e∗i outside the range (-3,3) can be
considered outliers
(Other outlier detection methods will be discussed in Chapter 6.)

3.3.2.6 Plot of the Residuals in a Normal


Quantile Plot

• also called a quantile-quantile (QQ) plot


• ordered residuals are plotted against their
corresponding quantiles from a normal distribution
e.g., the 10th percentile in the data is plotted against
the 10th percentile from a normal distribution

• if the residuals are normally distributed, then the


points should lie approximately on a straight line
• non-normal residuals show up when observations in
the tails of the distribution deviate from a straight line

72
Chapter 3. Diagnostics and Remedial Measures C. Sotto

This example plot shows residuals that deviate slightly


from normality, but are, by no means, pathological.

When compared with a normal distribution, the largest


residuals are slightly too large, while the smallest residual
are slightly too small.

Figure 3.8: Normal Quantile Plot.

The next example shows a more problematic case.

73
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Here, the graph shows that the moderate/middle values


of the residuals are close to the line, but large residuals
are much too large (compared to what is expected from a
normal distribution) and small residuals are also too
large.

Figure 3.9: Non-Normal Residuals.

In fact, the distribution of these residuals has a truncated


lower tail and is skewed right.

74
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Normal Probability Plot (NPP)

• similar in principle to the QQ-plot


• each residual is plotted against its expected value
under normality
• the expected value, under normality, of the k th
smallest residual from a data set of size n, is
approximated by:
√ 
k − 0.375

M SE Φ  
 ,
n + 0.25
where: Φ(a) = P (Z ≤ a) = CDF of N (0, 1) at a
(See p. 110-112 in Kutner et al. for a detailed description.)

• as in a QQ-plot, a plot that is approximately or nearly


linear suggests agreement with normality

Figure 3.10: Normal Probability Plot When


Error Term Distribution is Not Normal.

75
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.3.2.7 Plot of the Residuals versus Other


Possible Predictors

• residuals can additionally be plotted against other


potential predictor variables that are not included in
the model but may have important effects on the
response
• any distinctive pattern(s) may indicate improvement
in the model by the addition of the omitted predictor
variable(s)

3.4 Formal Tests

Graphical analysis can often be subjective, especially


when patterns are not very distinctive. In such cases,
formal tests can be considered to assess potential
violations of the model assumptions.

3.4.1 Tests for Normality

• goodness-of-fit tests can be used to examine the


normality of the error terms
e.g., Chi-Square test, Shapiro-Wilk test,
Kolmogorov-Smirnov test, Lilliefors test
• a simple test based on the NPP, i.e. correlation test
for normality, can also be used

76
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Correlation Test for Normality

Procedure:
1. Fit a regression line and compute the residuals.
2. Calculate the expected value of the ei under normality.
a. Sort the residuals from lowest to highest.
b. For the k th ordered residual, its expected value
under normality is approximated by:
√ 
k − 0.375

M SE Φ  
.
n + 0.25
3. Compute the correlation coefficient between the
original ei and their expected values under normality.
4. Test the resulting correlation coefficient using
Table B.6 in Kutner et al.

SAS Code
proc reg data=dataset;
model1: model Y = X;
output out=resids residual=res; run; quit;
proc sort data=resids; by res; run;
data resids2; set resids;
/*** Suppose MSE = 2588 and n = 50. ***/
exp_res = sqrt(2588)*probit((_n_-0.375)/(50+0.25)); run;
proc corr data=resids2;
var exp_res res; run;
proc gplot data=resids2;
plot exp_res*res; run; quit;

77
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.4.2 Test for Autocorrelation

• usually only meaningful for data collected in a time


series or in some other sequence
• in cross-sectional studies, data collection should be
done so that observations are independent

Durbin-Watson Test (Kutner et al., p. 487)

Yi = β0 + β1xi + εi ,
εi = ρ εi−1 + ui , |ρ| < 1
ui ∼ N (0, σ 2) independent
ρ : autocorrelation parameter
Hypotheses:
H0 : ρ = 0 versus HA : ρ > 0
Statistic: n
X
(ei − ei−1)2
i=2
D= n
X
e2i
i=1
Decision Rule:
D > du ⇒ do not reject H0
D < dl ⇒ reject H0
dl ≤ D ≤ du ⇒ inconclusive
Values for dl and du can be found in Table B.7 in
Kutner et al.
78
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.4.3 Tests for Non-Constancy of Variance

3.4.3.1 Brown-Forsythe (Modified Levene) Test

• robust against departures from normality


• detects variance of megaphone type
• requires large sample size
Procedure:
1. Arrange the residuals by increasing values of X.
2. Split the sample into two (more or less equal) groups.
Group 1: n1 observations with X ≤ X̃
Group 2: n2 observations with X > X̃

3. Compute di1 = |ei1 − ẽ1| and di2 = |ei2 − ẽ2|, where


ẽ1 and ẽ2 are the group medians.
Hypotheses:
2 2 2 2
H0 : σgrp 1 = σgrp 2 versus HA : σgrp 1 6= σgrp 2

Statistic: Two-Sample t-Test


∗ d¯1 − d¯2 H0
tBF = v u
u 1
∼ tn−2 ,
u 1
spu
t +
n1 n2
where
n n
(di1 − d1) + (di2 − d¯2)2
¯
X 2 X

s2p = i=1 i=1


.
n−2
79
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Example: Toluca Company

Calculations for Modified Levene Test for Constancy of Error Variance

80
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Code

/* get residuals from linear regression of Y on X */


proc reg data=toluca;
model Y=X/P;
output out=resids p=pred r=res;
run; quit;

/* sort residuals by value of X */


proc sort data=resids;
by X; run;

/* create groups containing lower half and upper half of data */


data resid_grps;
set resids;
if _n_ <= 13 then grp=1; * _n_ is the obs. number;
if _n_ > 13 then grp=2; * 13 is half of n (n=25);
run;

/* get median of the groups created */


proc means data=resid_grps n median;
var res;
by grp; run;

* from output:
* median grp 1: -19.8759596;
* median grp 2: -2.6840404;

/* create data with deviations of residuals from the group median */


data levene;
set resid_grps;
if grp=1 then median = -19.8759596;
if grp=2 then median = - 2.6840404;
deviation = abs(res-median); * abs = absolute value;
run;

/* do a two sample t-test on the deviations */


proc ttest data=levene;
class grp;
var deviation; run;

* in the output:
* look at the t-value and p-value for ’Pooled’ method;
* this is the t-value and p-value for Levene’s test;

81
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Output

The MEANS Procedure

---------------------------- grp=1 --------------------------

Analysis Variable : res Residual

N Median

13 -19.8759596

---------------------------- grp=2 --------------------------

Analysis Variable : res Residual

N Median

12 -2.6840404

82
Chapter 3. Diagnostics and Remedial Measures C. Sotto

The TTEST Procedure

Variable: deviation

grp N Mean Std Dev Std Err Minimum Maximum

1 13 44.8151 32.3607 8.9753 4.04E-9 123.4


2 12 28.4503 29.5578 8.5326 3.2980 87.0000
Diff (1-2) 16.3647 31.0518 12.4307

grp Method Mean 95% CL Mean Std Dev

1 44.8151 25.2597 64.3705 32.3607


2 28.4503 9.6702 47.2305 29.5578
Diff (1-2) Pooled 16.3647 -9.3500 42.0795 31.0518
Diff (1-2) Satterthwaite 16.3647 -9.2533 41.9828

grp Method 95% CL Std Dev

1 23.2054 53.4190
2 20.9386 50.1855
Diff (1-2) Pooled 24.1339 43.5582
Diff (1-2) Satterthwaite

Method Variances DF t Value Pr > |t|

Pooled Equal 23 1.32 0.2010


Satterthwaite Unequal 22.999 1.32 0.1993

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 12 11 1.20 0.7710

83
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Code

/*************************************************/
/* IF YOU WANT TO DO THE COMPUTATIONS YOURSELF, */
/* YOU CAN GET THE NECESSARY VALUES USING THE FF */
/*************************************************/

proc means data=levene n mean css;


var deviation;
by grp;
output out=summaries n=ni mean=dbari css=cssi;
run;

* from output:
* ni = size of each group;
* dbari = mean of the absolute deviations, |ei - med(ei)|;
* cssi = corrected sum of squares, sum(di - dbari)^2;

* from these you can compute the statistic according


to the formula in the course notes (p. 79);

SAS Output

The MEANS Procedure

------------------------------- grp=1 --------------------------------

Analysis Variable : deviation

N Mean Corrected SS

13 44.8150738 12566.61

------------------------------- grp=2 --------------------------------

Analysis Variable : deviation

N Mean Corrected SS

12 28.4503367 9610.29

84
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.4.3.2 Breusch-Pagan Test

• requires εi to be independent and normally distributed


• requires large sample size
• can detect relationships such as
loge σi2 = γ0 + γ1xi
Procedure:
Regress the squared residuals, e2i , against Xi and
obtain SSR∗ from this regression.
Hypotheses:
H0 : γ1 = 0 versus HA : γ1 > 0
Statistic:
SSR∗
H0
χ2BP =  2 2 ∼ χ21

SSE 
 
n
SAS Code
proc reg data=dataset;
model1: model Y = X;
output out=resids residual=res;
run; quit;

data resids2; set resids;


ressq=res*res; run;

proc reg data=resids2;


model2: model ressq = X;
run; quit;

85
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.4.4 Outlier Identification

• semistudentized residuals that are > 3 or < −3


• observations with outlying residuals on a residual plot
• observations whose residual or Y value is an outlier
on a box plot
• refit the model to the data without the suspect
observation, then compute a prediction interval for a
new Y at the X-value of the suspect observation
Does the interval include the suspect observation?
• many other super techniques are available for
identifying outliers (see Barnett and Lewis, Outliers in
Statistical Data, 1984)

Figure 3.11: Residual Plot with Outlier.

86
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Figure 3.12: Distorting Effect on Residuals Caused by an Outlier


When Remaining Data Follow Linear Regression.

Procedure:
1. Fit a new regression line without the suspect
observation.

2. Calculate the probability that in n observations, a


deviation from the fitted line as great as that of the
outlier will be obtained by chance.

More details coming soon!

(This approach and other more formal outlier detections


methods will be discussed in more extensively in Chapter 6.)

87
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.4.5 Lack-of-Fit Test

This section describes a testing method to check if the


linear regression model is appropriate. That is,

H0 : E(Y ) = β0 + β1X
vs.
HA : E(Y ) 6= β0 + β1X.

Essentially, this test answers the question:

Is the linearity assumption appropriate?

This test procedure is only possible if repeated responses


are available at some of the X values.

Figure 3.13: Scatter Plot and Fitted Regression Line for Bank Example.

88
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Assumptions

The responses, Y , for a given X, are assumed to be:


• independent
• normally distributed
• with same variance, σ 2

Notation

1. Let Yij be the ith replicate for the j th value of X.

2. Let c be the number of distinct values of X. So,


j = 1, 2, . . . , c.

3. Let nj be the number of replicates at Xj . So


i = 1, 2, . . . , nj .

Test Procedure

1. Using the replicate responses, estimate the error


variance at each distinct value, Xj , of X.

2. Pool these estimates to form an estimated variance of


“pure error.”

3. Compare this pure error with SSE using an F -test.

89
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Procedure

Step 1:
Yij = µj + εij ,
with i = 1, . . . , nj and j = 1, . . . , c

c nj
X X
SSE(F ) = (Yij − Ȳj )2 = SSP E
j=1 i=1
c
X
dfF = (nj − 1) = n − c
j=1

Step 2:
Yij = β0 + β1Xj + εij ,
with i = 1, . . . , nj and j = 1, . . . , c

c nj
X X
SSE(R) = (Yij − β̂ 0 − β̂ 1Xj )2 = SSE
j=1 i=1

dfR = n − 2

Step 3:
(SSE − SSP E)/ [(n − 2) − (n − c)]
F∗ =
SSP E/(n − c)
SSLF/(c − 2)
=
SSP E/(n − c)
M SLF H0

F = ∼ Fc−2 , n−c
M SP E
90
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Idea of the Test

To perform the test, we decompose (partition) SSE into


the sum of squares due to “pure error” (SSP E) and the
sum of squares due to “lack of fit” (SSLF ).

Yij − Ŷ ij = Yij − Ȳj + Ȳj − Ŷ ij

Squaring both sides and summing over all observations,


we get the following sums of squares:

XX XX XX
(Yij − Ŷ ij )2 = (Yij − Ȳj )2 + (Ȳj − Ŷ ij )2 ,

where the sums are over i and j.

This gives
SSE = SSP E + SSLF .

The second term, SSLF , can actually be computed


more easily in the following way.

For each X level, Xj , Ŷ ij is the same for all replicates.

So,
c nj c
X X 2 X
SSLF = (Ȳj − Ŷ ij ) = nj (Ȳj − Ŷ j )2 .
j=1 i=1 j=1

91
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Figure 3.14: Illustration of Decomposition of


Error Deviation (Yij − Ŷ j ) for Bank Example.

The actual test statistic is an F -test


SSLF H0
∗ M SLF c−2
F = = SSP E ∼ Fc−2 , n−c .
M SP E n−c

Decision Rule:

if F ∗ ≤ F1−α ; c−2 , n−c ⇒ conclude H0


if F ∗ > F1−α ; c−2 , n−c ⇒ conclude HA

92
Chapter 3. Diagnostics and Remedial Measures C. Sotto

ANOVA Table

General ANOVA Table for Testing Lack of Fit of


Simple Linear Regression Function

Note that the expected mean squares are given by:


[E(Yij ) − (β0 + β1Xj )]2
2 c
X
E(M SLF ) = σ + nj
j=1 c−2
E(M SP E) = σ 2

If the regression function is not linear, i.e.


E(Yij ) 6= β0 + β1Xj ,
then M SLF is expected to be larger than M SP E, thus
leading to a rejection of the null hypothesis of linearity.

93
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Example

We demonstrate this test with an artificial example,


shown in the following table and graph.
j i X Y ȲX Ŷ Y − ȲX (Y − ȲX )2
1 1 50 1530 60 3600
1470 1487
1 2 50 1410 -60 3600
2 1 100 1690 70 4900
1620 1599
2 2 100 1550 -70 4900
3 1 150 1680 -40 1600
1720 1711
3 2 150 1760 40 1600
4 1 200 1850 40 1600
1810 1823
4 2 200 1770 -40 1600
Total 23400

Figure 3.15: Lack-of-Fit Test.

94
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Least squares regression gives the following equation for


the fitted line:
Ŷ = 1375 + 2.24X .
The residual sum of squares is SSE = 25360 with
n − 2 = 6 degrees of freedom.

Note that in this example, we have n = 8 observations


and c = 4 distinct values of X.

We now construct the F -test for the hypothesis that X


and Y are linearly related.

H0 : E(Y ) = β0 + β1X vs. HA : E(Y ) 6= β0 + β1X.

The values needed for the test are:


SSP E = 23400

SSLF = SSE − SSP E = 25360 − 23400 = 1960


SSLF 1960
M SLF = = = 980
c−2 4−2
SSP E 23400
M SP E = = = 5850
n−c 8−4

M SLF 980 H0

F = = = 0.168 ∼ F2 , 4
M SP E 5950
95
Chapter 3. Diagnostics and Remedial Measures C. Sotto

So, if H0 is true, F ∗ has an F -distribution with 2 and 4


degrees of freedom.
• Clearly 0.168 is not large when compared to 1,
so we fail to reject H0. (F0.95;2,4 = 6.94)
• There is no evidence to suggest that the linear
model is inappropriate.
• The straight line relationship seems OK.

SAS Code

data LOFFIT;
input X Y;
cards;
50 1530
50 1410
100 1690
100 1550
150 1680
150 1760
200 1850
200 1770
;
run;

proc reg data=LOFFIT;


model Y = X / LACKFIT;
run;

96
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Output

The SAS System

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Number of Observations Read 8


Number of Observations Used 8

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 125440 125440 29.68 0.0016


Error 6 25360 4226.66667
Lack of Fit 2 1960.00000 980.00000 0.17 0.8514
Pure Error 4 23400 5850.00000
Corrected Total 7 150800

Root MSE 65.01282 R-Square 0.8318


Dependent Mean 1655.00000 Adj R-Sq 0.8038
Coeff Var 3.92827

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1375.00000 56.30275 24.42 <.0001


X 1 2.24000 0.41118 5.45 0.0016

97
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.5 Remedial Measures

Overview

• Linearity
– modify the regression model
– use a transformation on X and/or Y
• Homoscedasticity
– use the method of weighted least squares
– use a variance stabilizing transformation
• Independence
– use a time series model
– use generalized least squares
– special transformations
• Normality
– use a Generalized Linear Model (GLM)
– use a transformation on Y
• Outliers
– discard outliers from recording errors (be careful!)
– add interaction(s) between independent variables
– use a robust estimation method
• Model Extension
– include more independent variables in the model
(latter part of this course)
98
Chapter 3. Diagnostics and Remedial Measures C. Sotto

If a linear model is not appropriate, then one can either


• use a more appropriate model, or
• employ some transformation and analyse the
transformed data. But be careful when making
conclusions!
(In the following subsections, we provide an overview of what to do.
Details for some of the more advanced remedial methods can be
found in Chapters 11-15 of Kutner et al.)

3.5.1 Nonlinearity

What do we do if the model is nonlinear?


1. Transform the variables so that a linear relationship
holds for the transformed variables.
2. Use a nonlinear regression model.
E.g.: nonlinear regression model for logistic growth
a
Y = +ε
1 + b exp (−cX )

Some Considerations:
• Find an appropriate model for the problem via
literature search.
• Develop a relationship that makes sense with
regard to the problem. Check diagnostic plots.
This is an iterative process.
• Non-parametric regression methods can be helpful.
99
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.5.2 Non-Constancy of Error Variance

What do we do if the error variance is not constant?


1. Some transformations, not all, can be effective in
stabilizing the error variance.

2. Use weighted least squares (WLS) regression.


• Observations with high variance are given less
weight in the estimation procedure, while those
with low variance are given higher weights.

3.5.3 Outliers

What do we do if outlier(s) are found?


1. Check if these are simply incorrectly recorded data.

2. Fit the regression model with and without the outlier.


Do the results change much?
• If not, report the results including the outlier, but
note that it is present.
• If results do change substantially, report both.

3. Use a robust estimation procedure.

100
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.5.4 Non-Independence

What do we do if there is non-independence in time or


space?
There are a number of techniques that directly
model the dependence structure.
E.g.: autoregressive error models

3.5.5 Non-Normality

What do we do if the errors are non-normal?


1. Transformations
• Non-normality and heterscedasticity usually occur
together.
• Sometimes, a single transformation can address
both.
2. Rank Regression
• does not require normality of error terms
3. Non-Parametric Regression
E.g.: Regression Trees
4. Generalized Linear Models
• useful for discrete types of response variables
• does not require normality of error terms
E.g.: Logistic Regression
101
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.6 Transformations

The idea is to replace X and/or Y with transformed


values, then fit a simple linear regression. The purpose of
the transformation is to meet the assumptions for linear
regression. Tests of significance and confidence intervals
are performed on the scale of the transformed values.
Figure 3.16: Scatter Plots and Residual Plots for Sales Training Example.

102
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.6.1 Linearizing Transforms

If the plot of the data indicates some form of nonlinear


relationship between X and Y , try to find a suitable
transformation that will produce a more linear type of
relationship. Examples of such transformations are shown
below.
Figure 3.17: Prototype Regression Patterns
and Possible Transformations for X.

′ ′

(a) X = log(X) or X = X
(b) X′ = X2 or X ′ = exp (X)
1
(c) X′ = or X ′ = exp (−X)
X
103
Chapter 3. Diagnostics and Remedial Measures C. Sotto

If the only problem we observe is nonlinearity, we


transform X only. The reason is that transformations on
the Y variable will induce a problem of non-constant
error variance.

3.6.2 Non-Normality or Unequal Error Variance

Often we observe non-constant error variance along with


nonlinearity and non-normality. For such cases, we
transform Y . Examples of such transformations are
shown below.
Figure 3.18: Prototype Regression Patterns with Unequal Error Variances.


√ 1
(a) Y = Y (b) Y ′ = log(Y ) (c) Y ′ =
Y

Note: Whenever 0 is within the range of the variable to


be transformed, it may be necessary to add a constant
shift to the transformation.
reciprocal transformation: X ′ = 1/(X
√ + k)
square root transformation: X′ = X + k
logarithmic transformation: Y ′ = log (Y + k)
104
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.6.3 Box-Cox Family of Transformations

Can correct for:


• Skewness
• Non-Normality
• Unequal Variances
• Nonlinearity

An appropriate transform is chosen from the power family






 Y λ, if λ 6= 0
Y = 


 ln Y, if λ = 0 (by definition)

Special Cases

λ Transform Y ′
2.0 Y2

0.5 Y
0.0 loge(Y )
1
−0.5 √
Y
1
−1.0
Y

105
Chapter 3. Diagnostics and Remedial Measures C. Sotto

The estimates for the parameters β0, β1, σ 2 and λ are


obtained by maximizing the following likelihood function.

Likelihood Function:
 
 (Yiλ − β0 − β1Xi)2 
n
X
exp − 2

n
Y i=1 σ
fY λ (Yiλ, xi, β0, β1, σ 2) =
i=1 i (2πσ 2)n/2

It can be shown that the ML estimate for λ is given by:


λ̂ML = argmin SSE(W |X) ,
with:




 K1 (Y λ − 1), if λ 6= 0
W = 

,

 K2 ln Y, if λ = 0 (by def.)
where:
 
n 1/n 1
 Y 
K2 =  Y i and K1 = .
i=1 λK2λ−1

Idea

Try a handful of values for λ within some range (e.g.,


-2.0, -1.8, -1.6,..., 1.6, 1.8, 2.0) and choose the one that
minimizes SSE.
(Since SSE depends on the value of λ, a “standardized”
variable W is used, that is, one that has the same SSE for
all λ. See p. 135 of Kutner et al. for details.)
106
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Procedure

For each value of λ,


1. Compute the transformation W = K1 (Y λ − 1) .
2. Regress W on X.
3. Compute SSE.

Find the approximate value of λ that minimizes the


resulting SSE.

Then the model becomes


Yiλ = β0 + β1Xi + εi .

Example: Plasma Levels

n = 25 children

Y : plasma level

X: age

107
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Code
data plasma;
input age plasma logplasma;
cards;
0 13.44 1.1284
0 12.84 1.1086
0 11.91 1.0759
0 20.09 1.3030
0 15.60 1.1931
1.0 10.11 1.0048
1.0 11.38 1.0561
1.0 10.28 1.0120
1.0 8.96 .9523
1.0 8.59 .9340
2.0 9.83 .9926
2.0 9.00 .9542
2.0 8.65 .9370
2.0 7.85 .8949
2.0 8.88 .9484
3.0 7.94 .8998
3.0 6.01 .7789
3.0 5.14 .7110
3.0 6.90 .8388
3.0 6.77 .8306
4.0 4.86 .6866
4.0 5.10 .7076
4.0 5.67 .7536
4.0 5.75 .7597
4.0 6.23 .7945
;
run;

proc transpose data=plasma(keep=plasma) out=transplasma; run;


data constants;
set transplasma;
do lambda = 1, 0.6, 0.5, 0.1, 0, -0.3, -0.5, -0.7, -0.9, -1;
K2 = geomean(of col1-col25);
K1 = 1/(lambda*(K2**(lambda-1)));
output;
end;
keep lambda K1 K2; run;

proc print data=constants;


var lambda K1 K2; run;

/***** SEE OUTPUT *****/

108
Chapter 3. Diagnostics and Remedial Measures C. Sotto

data transformed;
set plasma(rename=(plasma=Y));
W1 = 1.0000*((Y** 1 )-1);
W2 = 3.9260*((Y** 0.6)-1);
W3 = 5.8365*((Y** 0.5)-1);
W4 = 68.7428*((Y** 0.1)-1);
W5 = 8.51632*log(Y);
W6 = -53.9767*((Y**-0.3)-1);
W7 = -49.7059*((Y**-0.5)-1);
W8 = -54.4917*((Y**-0.7)-1);
W9 = -65.0484*((Y**-0.9)-1);
W10= -72.5278*((Y**-1 )-1);
run;

proc reg data=transformed;


model1 : model W1 = age;
model2 : model W2 = age;
model3 : model W3 = age;
model4 : model W4 = age;
model5 : model W5 = age;
model6 : model W6 = age;
model7 : model W7 = age;
model8 : model W8 = age;
model9 : model W9 = age;
model10: model W10= age;
ods output ANOVA=WANOVA;
run;

data SSEboxcox; set WANOVA;


if source ne ’Error’ then delete;
drop FValue ProbF;
run;

data BoxCoxRes;
merge constants(keep=lambda) SSEboxcox;
run;

proc print data=BoxCoxRes;


var lambda--SS;
run;

/***** SEE OUTPUT *****/

109
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Output
Obs lambda K1 K2

1 1.0 1.0000 8.51632


2 0.6 3.9260 8.51632
3 0.5 5.8365 8.51632
4 0.1 68.7428 8.51632
5 0.0 . 8.51632
6 -0.3 -53.9767 8.51632
7 -0.5 -49.7059 8.51632
8 -0.7 -54.4917 8.51632
9 -0.9 -65.0484 8.51632
10 -1.0 -72.5278 8.51632

Obs lambda Model Dependent Source DF SSE

1 1.0 model1 W1 Error 23 77.98306


2 0.6 model2 W2 Error 23 52.76359
3 0.5 model3 W3 Error 23 48.36992
4 0.1 model4 W4 Error 23 36.36943
5 0.0 model5 W5 Error 23 34.51942
6 -0.3 model6 W6 Error 23 31.17628
7 -0.5 model7 W7 Error 23 30.55957
8 -0.7 model8 W8 Error 23 31.09062
9 -0.9 model9 W9 Error 23 32.70442
10 -1.0 model10 W10 Error 23 33.90890

Box-Cox Results for Plasma Levels Example.

110
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.6.4 Two Special Cases

In this section, we look at two types of response variable


that usually necessitate a special type of transformation:
1. Count Data
2. Proportion Data
Count data are often beset by overdispersion, i.e.
increasing variance with increasing levels of the mean.
Proportion data, on the other hand, are not only bounded
between 0 and 1, but are also less likely to satisfy the
constant variance assumption.
111
Chapter 3. Diagnostics and Remedial Measures C. Sotto

1. Count Data

Data that represent counts are often Poisson distributed.


For such variables, E(Y ) = V ar(Y ), which, of course,
immediately violates the constant variance assumption.

In such cases, √
Y = β0 + β1X
is often a good point to start. A slightly, better version of
this is the Freedman-Tukey transform for stabilizing
variance (see Snedecor and Cochran, 1980, p. 447-453),
i.e. √ √
Y + Y + 1 = β0 + β1X .
For these types of transformations, as well as for any
other types of transformations of Y , you must be able to
interpret your scientific questions in terms of parameters
for the transformed variable.

2. Proportion Data

Data denoting proportions (or percentages) of a certain


characteristic arise quite frequently in biological, as well
as non-biological, studies.
For example,
Y = % cover
Y = proportion of diseased cases
112
Chapter 3. Diagnostics and Remedial Measures C. Sotto

These types of variables have at least two characteristics


that cause difficulties in linear regression.
1. The proportion, Y ∈ [0, 1]. The linear regression
equation should somehow account for this.

2. The variance of Y typically depends on E(Y ).


Recall the binomial variance:
V ar(Y ) = np(1 − p), where E(Y ) = np.

There are two ways around this problem:


√ !
1. arcsin Y transform
• quite old-fashioned, but seems to work okay
• handles 0 and 1 values without additional fuss
 
′ 
Y 
2. logit transformation: Y = logit(Y ) = log  
1−Y

• more modern approach than arcsin( Y )
transform
• has some nice statistical properties, useful for
modeling
• log(·) operation can be problematic for 0 and 1
values; sometimes, a small value is added or
subtracted to zeros/ones

113
Chapter 3. Diagnostics and Remedial Measures C. Sotto

3.7 Case Study: SENIC Data

Variables

X: Available Facilities and Services (afs)


• percent of 35 potential facilities and services
that are provided by the hospital

Y : Infection Risk (ir)


• average probability of acquiring infection in
the hospital (in percent)

SAS Code

data senic;
input id ls age ir rcr rcxr nb msa reg adc nn afs;
datalines;
1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241 60.0
2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52 40.0
...........................................................
112 17.94 56.2 5.9 26.4 91.8 835 1 1 791 407 62.9
113 9.41 59.5 3.1 20.6 91.7 29 2 3 20 22 22.9
;
run;

proc reg data=senic corr simple;


model ir = afs / I P CLM CLI;
output out=senic P=pir R=res
U95=PU L95=PL
U95M=MU L95M=ML;
plot ir*afs / conf95 pred95;
test afs=0;
run;

114
Chapter 3. Diagnostics and Remedial Measures C. Sotto

/***** RESIDUAL PLOT *****/


proc gplot data=senic;
plot res*afs=1 / vref=0;
run;

/***** LACK-OF-FIT TEST *****/


proc reg;
model ir = afs / lackfit;
run;

/***** BREUSCH-PAGAN TEST *****/


proc sort data=senic;
by afs;
run;

data senic2; set senic;


ressq=res*res;
run;

proc reg data=senic2;


model ressq = afs;
run;

/***** NORMALITY PROBABILITY PLOT *****/


proc sort data=senic;
by res;
run;

*** warning: 1.50538 and 113 are not constants!!! ***;


data senic3; set senic;
expres=sqrt(1.50538)*probit((_n_-0.375)/(113+0.25));
run;

proc gplot data=senic3;


plot expres*res=1;
run;

/***** CORRELATION TEST FOR NORMALITY *****/


proc corr data=senic3;
var expres res;
run;

115
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Output

The REG Procedure

Number of Observations Read 113


Number of Observations Used 113

Descriptive Statistics

Uncorrected Standard
Variable Sum Mean SS Variance Deviation

Intercept 113.00000 1.00000 113.00000 0 0


AFS 4877.00000 43.15929 236367 231.06619 15.20086
IR 492.10000 4.35487 2344.41000 1.79803 1.34091

Correlation

Variable AFS IR

AFS 1.0000 0.4126


IR 0.4126 1.0000

116
Chapter 3. Diagnostics and Remedial Measures C. Sotto

The REG Procedure


Model: MODEL1
Dependent Variable: IR

Number of Observations Read 113


Number of Observations Used 113

X’X Inverse, Parameter Estimates, and SSE

Variable Intercept AFS IR

Intercept 0.0808266347 -0.001667708 2.7840161759


AFS -0.001667708 0.0000386408 0.0363965906
IR 2.7840161759 0.0363965906 167.09705947

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 34.28276 34.28276 22.77 <.0001


Error 111 167.09706 1.50538
Corrected Total 112 201.37982

Root MSE 1.22694 R-Square 0.1702


Dependent Mean 4.35487 Adj R-Sq 0.1628
Coeff Var 28.17397

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 2.78402 0.34882 7.98 <.0001


AFS 1 0.03640 0.00763 4.77 <.0001

117
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Output Statistics

Dependent Predicted Std Error


Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual

1 4.1000 4.9678 0.1727 4.6256 5.3100 2.5126 7.4230 -0.8678


2 1.6000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 -2.6399
3 2.7000 3.5119 0.2110 3.0938 3.9301 1.0450 5.9789 -0.8119
4 5.6000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 1.3601
5 5.7000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 1.4601
6 5.1000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 0.8601
7 4.6000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 0.3601
8 5.4000 4.9678 0.1727 4.6256 5.3100 2.5126 7.4230 0.4322
9 4.3000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 0.0601
10 6.3000 4.2399 0.1179 4.0062 4.4735 1.7974 6.6823 2.0601
11 4.9000 5.6957 0.3038 5.0938 6.2977 3.1911 8.2004 -0.7957
..................................................................................
..................................................................................
101 2.6000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 -1.0175
102 4.3000 4.3454 0.1154 4.1167 4.5742 1.9034 6.7874 -0.0454
103 2.7000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 -0.9175
104 6.6000 5.0734 0.1897 4.6974 5.4493 2.6132 7.5335 1.5266
105 4.5000 4.3454 0.1154 4.1167 4.5742 1.9034 6.7874 0.1546
106 2.9000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 -0.7175
107 1.4000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 -2.2175
108 2.1000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 -1.5175
109 5.7000 5.0734 0.1897 4.6974 5.4493 2.6132 7.5335 0.6266
110 5.8000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 2.1825
111 4.4000 5.0734 0.1897 4.6974 5.4493 2.6132 7.5335 -0.6734
112 5.9000 5.0734 0.1897 4.6974 5.4493 2.6132 7.5335 0.8266
113 3.1000 3.6175 0.1929 3.2353 3.9997 1.1564 6.0786 -0.5175

Sum of Residuals 0
Sum of Squared Residuals 167.09706
Predicted Residual SS (PRESS) 173.54345

Test 1 Results for Dependent Variable IR

Mean
Source DF Square F Value Pr > F

Numerator 1 34.28276 22.77 <.0001


Denominator 111 1.50538

118
Chapter 3. Diagnostics and Remedial Measures C. Sotto

/***** LACK-OF-FIT-TEST *****/

The REG Procedure


Model: MODEL1
Dependent Variable: IR

Number of Observations Read 113


Number of Observations Used 113

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 34.28276 34.28276 22.77 <.0001


Error 111 167.09706 1.50538
Lack of Fit 24 43.97440 1.83227 1.29 0.1925
Pure Error 87 123.12266 1.41520
Corrected Total 112 201.37982

Root MSE 1.22694 R-Square 0.1702


Dependent Mean 4.35487 Adj R-Sq 0.1628
Coeff Var 28.17397

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 2.78402 0.34882 7.98 <.0001


AFS 1 0.03640 0.00763 4.77 <.0001

119
Chapter 3. Diagnostics and Remedial Measures C. Sotto

/***** BREUSCH-PAGAN TEST *****/

The REG Procedure


Model: MODEL1
Dependent Variable: ressq

Number of Observations Read 113


Number of Observations Used 113

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 4.21231 4.21231 0.86 0.3564


Error 111 545.21169 4.91182
Corrected Total 112 549.42400

Root MSE 2.21626 R-Square 0.0077


Dependent Mean 1.47874 Adj R-Sq -0.0013
Coeff Var 149.87553

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 2.02936 0.63008 3.22 0.0017


AFS 1 -0.01276 0.01378 -0.93 0.3564

120
Chapter 3. Diagnostics and Remedial Measures C. Sotto

/***** CORRELATION TEST FOR NORMALITY *****/

The CORR Procedure

2 Variables: expres res

Simple Statistics

Variable N Mean Std Dev Sum

expres 113 0 1.21492 0


res 113 0 1.22145 0

Simple Statistics

Variable Minimum Maximum Label

expres -3.11828 3.11828


res -2.74734 3.87313 Residual

Pearson Correlation Coefficients, N = 113


Prob > |r| under H0: Rho=0

expres res

expres 1.00000 0.99427


<.0001

res 0.99427 1.00000


Residual <.0001

121
Chapter 3. Diagnostics and Remedial Measures C. Sotto

SAS Graphs

Fitted Regression Line

Prediction and Confidence Intervals

122
Chapter 3. Diagnostics and Remedial Measures C. Sotto

Plot of Semistudentized Residuals

Normal Probability Plot

123
Chapter 4

Simultaneous Inference
and Other Topics

Chapter 4 in Kutner et al.

• Joint Estimation of Regression Parameters

• Regression Through the Origin

• Effect of Measurement Errors

124
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

4.1 Joint Estimation of β0 and β1

Section 4.1 in Kutner et al.

In Chapter 2, we looked at inferences (e.g., tests of


hypothesis, confidence intervals) for β0 or for β1.

Sometimes, we want to make joint inferences about both


parameters, β0 and β1. This type of inference is naturally
different from individual inferences. Why?

Consider a 95% CI for β0, as well as a 95% CI for β1.


The probability of the first CI being correct for β0 is 0.95,
and the same is true for the second CI. But...
What is the probability of both CIs being correct?
Assuming that the inferences for β0 and for β1 are
independent, the probability that both CIs are correct
would be
(0.95)2 = 0.9025,
which is already different from the specified confidence
level of 95% for each of the CIs. This value may decrease
even further as more parameters are considered.

Note also that inferences are not independent, since they


are based on the same data set, and this makes
computation of the joint probability that both CIs are
correct even more difficult.
125
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

4.1.1 Statement vs. Family Confidence

A family of estimates (or tests) is an entire


series/set of estimates (or tests).

A statement confidence coefficient represents the


proportion of correct estimates that are obtained when
repeated samples are selected and the specified
confidence interval is calculated for each sample.

A family confidence coefficient indicates the


proportion of families of estimates that are entirely
correct when repeated samples are selected and the
specified confidence intervals for the entire family are
calculated for each sample. In other words, it is
the probability, in advance of sampling, that the
entire family of statements will be correct.

Example: Consider the family of estimates for (β0, β1),


with a family confidence coefficient of 0.95.

• If repeated samples are selected and interval estimates


for both β0 and β1 are calculated for each sample by
specified procedures, then 95% of the samples would
lead to a family of estimates where both confidence
intervals are correct.
• For 5% of the samples, either one or both of the
interval estimates would be incorrect.
126
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

4.1.2 Bonferroni Joint Confidence Intervals

To obtain Bonferroni joint confidence intervals for β0 and


β1, we start with ordinary confidence intervals for each of
these parameters, each with statement confidence
coefficients (1 − α), i.e.
β̂ 0 ∓ t1− α2 ,n−2 s(β̂ 0)
β̂ 1 ∓ t1− α2 ,n−2 s(β̂ 1)
Suppose we let

A1 = the event that the first CI does not cover β0


A2 = the event that the second CI does not cover β1

We know that P (A1) = α and P (A2) = α.

The probability that one or both of the intervals are


incorrect can be expressed as
P (A1 ∪ A2) = P (A1) + P (A2 ) − P (A1 ∩ A2) .

We want the probability that both of the intervals are


correct, i.e.
1 − P (A1 ∪ A2) = 1 − P (A1 ) − P (A2) + P (A1 ∩ A2)
P (A1 ∪ A2)c = 1 − P (A1 ) − P (A2) + P (A1 ∩ A2)
P (Ac1 ∩ Ac2) = 1 − P (A1 ) − P (A2) + P
|
(A1{z∩ A2)}
≥0
P (Ac1 ∩ Ac2) ≥ 1 − P (A1 ) − P (A2)
127
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

Bonferroni inequality:
P (Ac1 ∩ Ac2) ≥ 1−P (A1)−P (A2)

Since P (A1) = P (A2) = α, we obtain:


P (Ac1 ∩ Ac2) ≥ 1 − 2P (A1) ,
or
P (Ac1 ∩ Ac2) ≥ 1 − 2α .

So if β0 and β1 are separately estimated with 95%


confidence intervals, then the Bonferroni inequality
guarantees us a family confidence coefficient of at least
90% that both intervals based on the same sample are
correct.

To obtain Bonferroni joint confidence intervals for β0 and


β1, we estimate, separately, β0 and β1, by means of
ordinary confidence

intervals,

with statement confidence
coefficients of 1 − α2 each. Using the Bonferroni
inequality, this will yield a family confidence coefficient of
(1 − α).

Thus, the (1 − α) family confidence limits for β0 and β1


for regression using the Bonferroni procedure are:
β̂ 0 ∓ t1− α4 ,n−2 s(β̂ 0)
β̂ 1 ∓ t1− α4 ,n−2 s(β̂ 1)

128
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

Remarks:

1. The approach provides a lower bound for the family


confidence coefficient. That is, the probability, in
advance of sampling, that the entire family of CIs is
correct is at least (1 − α).
2. For g simultaneous CIs with a family confidence
coefficient of (1 − α), each CI should be constructed
with a statement confidence coefficient of (1 − αg ).
3. The technique is best used when the number of
simultaneous estimates is not too large, since for a
fixed α, very large values of g would yield interval
estimates that may be too wide to be useful.
4. The statement confidence coefficients need not all be
equal, e.g., when some parameters are more important
and thus stricter levels are preferred for them.

SAS Code

/***** TO GET 90% JOINT CIS *****/

proc reg data=toluca;


model Y=X / clb;
run; quit;

/***** TO GET 95% JOINT CIS *****/

proc reg data=toluca alpha=0.025;


model Y=X / clb;
run; quit;

129
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

SAS Output

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 62.36586 26.17743 2.38 0.0259


X 1 3.57020 0.34697 10.29 <.0001

Parameter Estimates

Variable DF 95% Confidence Limits

Intercept 1 8.21371 116.51801


X 1 2.85244 4.28797

**********************************************************************

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 62.36586 26.17743 2.38 0.0259


X 1 3.57020 0.34697 10.29 <.0001

Parameter Estimates

Variable DF 97.5% Confidence Limits

Intercept 1 -0.40436 125.13607


X 1 2.73821 4.40220

130
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

4.2 Regression Through the Origin

Section 4.4 in Kutner et al.

Sometimes, the nature of the variables for regression


naturally prescribe that whenever X = 0, then Y must
also be zero. Thus, the regression function passes
through the origin at (0,0).

Figure 4.1: Scatter Plot and Regression Through Origin


for Warehouse Example.

Example:

X = number of shoes in stock at a store


Y = volume of shoe sales in that store

131
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

The general idea is that we fit the model

Yi = β1Xi + εi ,

where β1 is a parameter, Xi are known constants and εi


are independent N (0, σ 2) (just as they are in the typical
simple linear regression model).

This means that we force β0 to be 0.

The true regression function is

E(Yi) = β1Xi .

We minimize the SSE in the usual way. This yields the


following parameter estimate for β1:
n
X
Xi Yi
i=1
β̂ 1 = n
X
Xi2
i=1

Everything follows as with the usual least squares


regression line, except that we now have
n
X e2i
M SE = ,
i=1 n−1

with (n − 1) degrees of freedom for estimating variance.

132
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

Remarks:

1. We no longer have the property that


n
X
ei = 0 .
i=1
Thus, residuals in a residual plot with usually not be
balanced around the zero line.
X X
2. SSE = e2i may exceed SST O = (Yi − Ȳ )2, and
thus, R2 can be negative!

Hence, R2 is not meaningful for regression through


the origin.

3. The variability of the regression line and for prediction


of new observations increases with distance away from
X = 0. This is a consequence of our assumption that
E(Y ) = 0 when X = 0.

We have to be very careful about using this model even


when we KNOW that the line passes through the origin.
The reason is that near the origin, the relationship may
be nonlinear, and thus, might not be well described by a
straight line.

133
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

4.3 Effects of Measurement Errors

Section 4.5 in Kutner et al.

4.3.1 Errors in the Response Variable

If there are random measurement errors in the response


variable, Y , it is not a big deal as long as these errors are
uncorrelated and unbiased.

Measurement errors in Y that are of a random nature,


i.e. are uncorrelated and unbiased, are absorbed by the
model error term ε.
• ε reflects the combined effects of a large number
of factors not considered in the model

• one such factor would be the random variation


due to inaccuracy in the process of measuring Y

Because of the “lumping” of all types of errors in Y into


the error term ε, it is not possible to distinguish between
measurement error for Y and errors arising from
“unexplained variability.”

134
Chapter 4. Simultaneous Inference and Other Topics C. Sotto

4.3.2 Errors in the Predictor Variable

Measurement error in the predictor variable, X, can


cause big problems.

If X is measured with substantial error,

• β̂ 1 will be biased (downwards) for β1

• β̂ 1 will not be consistent for β1, i.e.


as n → ∞, β̂ 1 does not converge to β1

Note:

1. The magnitude of the problem depends on the


magnitude of the measurement error and the spread
of the Xs, i.e. on SSXX .

2. The problem is worse if SSY Y is relatively large


compared to SSXX .

135
Chapter 5

Multiple Linear Regression

Chapters 6 & 7 in Kutner et al.

• Multiple Regression Models

• Matrix Formulation

• Parameter Estimation

• Inference and Predictions

• Extra Sum of Squares

• Multicollinearity

136
Chapter 5. Multiple Linear Regression C. Sotto

5.1 Multiple Regression Models

Oftentimes, a regression model with a single predictor


variable is not able to provide an adequate description of
the response variable. This occurs when several key
variables affect the response variable in important and
predictive ways.

In such cases, Multiple Regression Models may be


more useful than Simple (Linear) Regression Models.

First-Order Model with Two Predictors

Suppose we have two predictor variables X1 and X2 and


assume the following model:
Yi = β0 + β1Xi1 + β2Xi2 + εi .

• Yi is response value of case i (or subject i)


• Xi1 and Xi2 are the values of the two predictor
variables for case i (or subject i)
• β0, β1 and β2 are the parameters of the model
• the coefficients β1 and β2 are called partial regression
coefficients
The two predictor variables in the model are said to have
additive effects or not to interact.
137
Chapter 5. Multiple Linear Regression C. Sotto

If we assume E(εi) = 0, then the regression function is:


E(Y ) = β0 + β1X1 + β2X2 ,
which is a plane.
(Recall that for SLR, the regression function is a line.)

It further follows that


εi = Yi − E(Yi) .

The regression function, usually called the regression


surface or response surface, can be other types of
surfaces, but for a first-order regression model with two
predictors, it is a plane.

Figure 5.1: Response Function – A Plane – for Sales Promotion Example.

138
Chapter 5. Multiple Linear Regression C. Sotto

Meaning of Regression Coefficients

• β0 represents the mean response when X1 = 0 and


X2 = 0, provided the scope of the model (as well as
the data) includes the point (X1 = 0, X2 = 0)

• β1 represents the change in the mean response per


unit increase in X1, when X2 is held constant

• β2 represents the change in the mean response per


unit increase in X2, when X1 is held constant

Example:

β0 = 10 






β1 = 2 


E(Y ) = 10 + 2X1 + 5X2


β2 = 5 

If X2 is held constant at X2 = 2, then


E(Y ) = 10 + 2X1 + 5(2) = 20 + 2X1 .
When X1 is increased by 1 unit, the mean response,
E(Y ), increases by β1 = 2 units, keeping X2 constant.

Similarly, when X2 is increased by 1 unit, the mean


response, E(Y ), increases by β2 = 5 units, keeping X1
constant.

What is the change in the mean response, E(Y ), when


X1 and X2 are each increased by 1 unit?
139
Chapter 5. Multiple Linear Regression C. Sotto

5.2 General Linear Regression Model

Let X1, X2, . . . , Xp−1, for p > 1, denote a set of (p − 1)


predictor variables.

5.2.1 General Form

Yi = β0 + β1Xi1 + β2Xi2 + . . . + βp−1Xi,p−1 + εi , (5.1)

εi ∼ N (0, σ 2), for i = 1, . . . , n

β0, β1, . . . , βp−1 : parameters


Xi1, Xi2, . . . , Xi,p−1 : known constants

The response function is a hyperplane.

Each parameter βk indicates the change in the mean


response, E(Y ), with a unit increase in the predictor
variable Xk , when all other predictor variables in the
regression model are held constant.

Note: If we let Xi0 ≡ 1, then we can write the model as:

Yi = β0Xi0 + β1Xi1 + β2Xi2 + . . . + βp−1Xi,p−1 + εi,


or equivalently,
p−1
X
Yi = βk Xik + εi .
k=0
140
Chapter 5. Multiple Linear Regression C. Sotto

The term linear in General Linear Regression Model


refers to the fact that model (5.1) is linear in the
parameters; it does NOT refer to the shape of the
response surface.

A regression model is linear in the parameters if it can be


written in the form:
Yi = ci0β0 + ci1β1 + ci2β2 + · · · + ci,p−1βp−1 + εi ,
where the terms ci0, ci1, . . . , ci,p−1 are coefficients
involving the predictor variables.

An example of a nonlinear regression model is the


following:
Yi = β0 exp(β1Xi) + εi .

5.2.2 Specific Forms

Model (5.1) encompasses many different types of models


for a variety of situations. We consider a few of these
now.

5.2.2.1 Distinct Predictors

If X1, X2, . . . , Xp−1 represent different predictor


variables, then model (5.1) is first-order model with no
interaction effects between the predictors.

141
Chapter 5. Multiple Linear Regression C. Sotto

5.2.2.2 Qualitative Predictors

• qualitative predictor variables, such as gender or race,


can be included in model (5.1) by incorporating
so-called indicator variables or dummy variables that
take on the values 0 or 1 to identify the categories of
the qualitative predictor variable
• for a categorical variable with m levels, you need to
create (m − 1) dummy variable
(Read Section 8.3 of Kutner et al.)

Example:
Y : length in hospital stay
X1 : age of patient
X2 : gender of patient

Model: Yi = β0 + β1Xi1 + β2Xi2 + εi ,


Xi1 : age of patient i 


 1 , if patient i is female
Xi2 : gender of patient i = 

 0 , if patient i is male








β0 + β1Xi1 + εi , for male patients

Yi = 






β0 + β1Xi1 + β2 + εi , for female patients

• β2 is the difference between the mean response of


female and male patients having the same age
142
Chapter 5. Multiple Linear Regression C. Sotto

Suppose we want to add another predictor,


representing disability status, which has 3 levels:
(a) not disabled
(b) partially disabled
(c) fully disabled
Let


 1, not disabled
X3 =  Disability Status X3 X4
 0, otherwise

not disabled 1 0

 1, partially disabled partially disabled 0 1
X4 = 
 0, otherwise fully disabled 0 0

Model:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + εi
or,






β0 + β1Xi1 + β2Xi2 + εi , for fully disabled




Yi = 

β0 + β1Xi1 + β2Xi2 + β4 + εi , for partially disabled






 β0 + β1Xi1 + β2Xi2 + β3 + εi , for not disabled

• β3 is the difference between the mean response of


not disabled and fully disabled patients having the
same age and gender
• β4 is the difference between the mean response of
partially disabled and fully disabled patients having
the same age and gender
143
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code

data sample; data rehab;


input Y X Group; input Y X1 $;
cards; label
106 43 1 Y=’Days for Physical Therapy’
106 41 3 X=’Physical Fitness Status’;
97 47 1 cards;
113 46 2 29.0 Below_Av
96 45 3 42.0 Below_Av
119 41 2 38.0 Below_Av
92 47 3 43.0 Below_Av
112 41 1 40.0 Below_Av
92 48 2 30.0 Below_Av
102 48 2 35.0 Average
107 42 3 31.0 Average
107 47 1 31.0 Average
102 43 1 29.0 Average
115 44 2 35.0 Average
101 42 3 33.0 Average
; 26.0 Above_Av
run; 32.0 Above_Av
21.0 Above_Av
data newsample; 20.0 Above_Av
set sample; 23.0 Above_Av
D1 = (group=1); 22.0 Above_Av
D2 = (group=2); ;
run; run;

proc print; data rehab2;


run; set rehab;
X1a = (X1=’Below_Av’);
X1b = (X1=’Average’);
run;

proc print;
run;

144
Chapter 5. Multiple Linear Regression C. Sotto

SAS Output

NEWSAMPLE dataset

Obs Y X Group D1 D2

1 106 43 1 1 0
2 106 41 3 0 0
3 97 47 1 1 0
4 113 46 2 0 1
5 96 45 3 0 0
6 119 41 2 0 1
7 92 47 3 0 0
8 112 41 1 1 0
9 92 48 2 0 1
10 102 48 2 0 1
11 107 42 3 0 0
12 107 47 1 1 0
13 102 43 1 1 0
14 115 44 2 0 1
15 101 42 3 0 0

REHAB2 dataset

Obs Y X1 X1a X1b

1 29 Below_Av 1 0
2 42 Below_Av 1 0
3 38 Below_Av 1 0
4 43 Below_Av 1 0
5 40 Below_Av 1 0
6 30 Below_Av 1 0
7 35 Average 0 1
8 31 Average 0 1
9 31 Average 0 1
10 29 Average 0 1
11 35 Average 0 1
12 33 Average 0 1
13 26 Above_Av 0 0
14 32 Above_Av 0 0
15 21 Above_Av 0 0
16 20 Above_Av 0 0
17 23 Above_Av 0 0
18 22 Above_Av 0 0

145
Chapter 5. Multiple Linear Regression C. Sotto

5.2.2.3 Polynomial Regression

• polynomial regression models are general linear


regression models containing squared and higher-order
terms of the predictor variable(s)
• such models have a curvilinear response function
(Read Section 8.1 of Kutner et al.)

Examples:
Yi = β0 + β1Xi + β2Xi2 + εi
Yi = β0 + β1Xi + β2Xi2 + β3Xi3 + εi

5.2.2.4 Transformed Variables

• models with transformed variables involve complex,


curvilinear response functions, but can still be
considered as general linear regression models if they
can be expressed in the form of (5.1)
Examples:
log Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + εi
=⇒ Let Yi′ = log Yi .
1
Yi =
β0 + β1Xi1 + β2Xi2 + εi
′ 1
=⇒ Let Yi = .
Yi
146
Chapter 5. Multiple Linear Regression C. Sotto

5.2.2.5 Interaction Effects

• interaction effects between predictor variables can


also be included as terms in model (5.1), particularly
when the effects of the predictors are nonadditive
• interpretation of the regression coefficients for models
with interaction effects differs from the additive case!!
(Read Section 8.2 of Kutner et al.)

Example:
Yi = β0 + β1Xi1 + β2Xi2 + β3|Xi1{zXi2} + εi
Xi3

When Xi1 is increased by 1 unit, the change in the


mean response, E(Y ), is (β1 + β3X2).

When Xi2 is increased by 1 unit, the change in the


mean response, E(Y ), is (β2 + β3X1).
The change in the mean response for a unit increase
in one predictor variable is dependent on the level of
the other predictor.
The effect, on the mean response, of increasing X1 by
1 unit varies according to the level of X2 and vice
versa.
Thus, the two predictors interact and do not have
additive effects.

147
Chapter 5. Multiple Linear Regression C. Sotto

5.2.2.6 Combination of Cases

• several specific cases can be combined and still yield


a general linear regression model of the form (5.1)
• for instance, squares (or higher-ordered powers) of
predictor variables can be further combined with
interaction effects in the model
(Read Section 8.5 & 8.6 of Kutner et al.)

Example:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi22 + β4Xi1Xi2 + εi

Is it a linear model?

5.3 Matrix Formulation

Consider the following multiple regression model:

Yi = β0 + β1Xi1 + β2Xi2 + . . . + βp−1Xi,p−1 + εi,


εi ∼ N (0, σ 2), for i = 1, . . . , n
This model can be written using vectors and matrices as:
       


Y1 



1 X11 X12 · · · X1,p−1 



β0 



ε1 

       

 Y2 


 1 X21 X22 · · · X2,p−1 


 β1 


 ε2 



 .. 


= 

 .. .. .. ... .. 


× 

 .. 


+ 

 .. 


       
       
       
Yn 1 Xn1 Xn2 · · · Xn,p−1 βp−1 εn

148
Chapter 5. Multiple Linear Regression C. Sotto

General Linear Regression Model


Y = X · β + ε
(n × 1) (n × p) (p × 1) (n × 1)

Y : vector of response values


β : vector of unknown regression parameters
X : matrix of known constants
ε : vector of independent and identically
distributed error terms εi
εi ∼ N (0, σ 2), for all i = 1, 2, . . . , n
E(ε) = 0 and σ 2(ε) = σ 2 · I
(Read Chapter 5 of Kutner et al. to review matrix algebra.)

5.3.1 Estimation of Regression Coefficients

Least Squares Criterion:


n
X
Q= (Yi − β0 − β1Xi1 − . . . − βp−1Xi,p−1)2
i=1
Find the estimators β̂ 0, β̂ 1, . . . , β̂ p−1 that minimize Q.

X ′X β̂ = X ′Y ⇒ β̂ = (X ′X)−1X ′Y

Maximum Likelihood (ML) estimation leads to the same


estimators.
149
Chapter 5. Multiple Linear Regression C. Sotto

5.3.2 Fitted Values and Residuals

Fitted Values: Ŷ i = β̂ 0 + β̂ 1Xi1 + . . . + β̂ p−1Xi,p−1


Residuals: ei = Yi − Ŷ i

In matrix formulation:
′ −1 ′
Ŷ = X β̂ = X(X
|
X)
{z
X }
Y = HY
H

e = Y − Ŷ = Y − X β̂
′ −1 ′
= Y − X(X
|
X)
{z
X }
Y
H
= Y − HY = (I − H)Y

E(e) = E{(I − H)Y }


= (I − H)E(Y )
 
′ −1 ′
= I − X(X X) X E(Y )
 
′ −1 ′
= E(Y ) − X(X X) X E(Y )
 
′ −1 ′
= E(Y ) − X(X X) X Xβ
′ −1 ′
= Xβ − X (X |
X) {z
X X} β
= Xβ − X I β
= Xβ − Xβ
E(e) = 0
150
Chapter 5. Multiple Linear Regression C. Sotto

V ar(e) = V ar{(I − H)Y }


= (I − H) V ar(Y ) (I − H)′

= (I − H) V |
ar(ε)
{z }
(I − H)
= (I − H) σ 2I (I − H)′
= σ 2 (I − H) I (I − H)′

= σ 2 (I
|
− H) (I
{z
− H) }

V ar(e) = σ2 (I − H)

E(e) = 0 and V ar(e) ≡ σ 2(e) = σ 2(I − H)


var(e)
ˆ ≡ s2(e) = M SE(I − H)

5.3.3 ANOVA Results

Source of
SS df MS
Variation
SSR
Regression SSR p−1 M SR =
p−1
SSE
Error SSE n−p M SE =
n−p

Total SST O n−1

151
Chapter 5. Multiple Linear Regression C. Sotto

The sums of squares can be written as quadratic forms


(e.g., y ′Ay):
     
1 1
SST O = Y ′Y −   Y ′J Y = Y ′ I −   J  Y
n n
     
′ 1 1
SSR = β̂ X ′Y −   Y ′J Y = Y ′ H −   J  Y
n n
SSE = e′e
= (Y − X β̂)′(Y − X β̂)

= Y Y − β̂ X ′Y

= Y ′(I − H)Y

The expected value of the mean squares help in


formulating tests of hypotheses:
E(M SE) = σ 2
E(M SR) ⇒ depends on the form of the model

Consider the following model where p = 3:


Yi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + εi .
1 2
E(M SR) = σ + G
2
where:
n
X n
X
G = β12 (Xi1 − X̄1)2 + β22 (Xi2 − X̄2)2
i=1 i=1
n
X
+ 2β1β2 (Xi1 − X̄1)(Xi2 − X̄2)
i=1

Note: β1 = β2 = 0 ⇒ E(M SR) = σ 2


152
Chapter 5. Multiple Linear Regression C. Sotto

5.3.4 F -Test for Regression Relation

Is there a regression relation between Y and the set of


predictors X1, X2, . . . , Xp−1?

Hypotheses:

H0 : β1 = β2 = . . . = βp−1 = 0
vs.
HA : βk 6= 0, for some k

Test Statistic:
M SR H0

F = ∼ Fp−1 , n−p
M SE

Decision Rule:

F ∗ ≤ F1−α ; p−1 , n−p ⇒ conclude H0


F ∗ > F1−α ; p−1 , n−p ⇒ conclude HA

153
Chapter 5. Multiple Linear Regression C. Sotto

5.4 Coefficients

5.4.1 Multiple Determination

SSR SSE
R2 = =1− ∈ [0, 1]
SST O SST O
• measures the proportionate reduction of the total
variation in Y associated with the use of the set of
predictors X1, X2, . . . , Xp−1
• a large value of R2 does not necessarily imply that
the model is a useful one
• adding more predictors to the model increases R2

Adjusted Coefficient

SSE   
n−p n − 1 SSE
Ra2 = 1 − SST O = 1 − 
 
 
 ∈ [0, 1]

n − p SST O
n−1

• adjusts R2 for the number of predictors in the model


• can actually become smaller when another predictor is
introduced into the model

5.4.2 Multiple Correlation



R = R2
154
Chapter 5. Multiple Linear Regression C. Sotto

5.5 Inference About the Regression


Parameters

 ′
E(β̂) = β = β0 β1 β2 · · · βp−1

σ 2(β̂) = σ 2(X ′X)−1

σ̂ 2(β̂) = M SE(X ′X)−1

Note that:  
2


σ (β̂ 0) σ(β̂ 0, β̂ 1) · · · σ(β̂ 0, β̂ p−1) 

 
 2 

 σ(β̂ 0, β̂ 1) σ (β̂ 1) · · · σ(β̂ 1, β̂ p−1) 

2  
σ (β̂) = 
 ... ... ... ... 

 
 
 
 

σ(β̂ p−1, β̂ 0) σ(β̂ p−1 , β̂ 0) · · · σ 2(β̂ p−1) 

(An estimate of σ 2(β̂), i.e. σ̂ 2(β̂), can be obtained using


the covb option in the MODEL statement. See pp. 29-30.)

5.5.1 Interval Estimation of βk

β̂ k − βk H0
∼ tn−p, for k = 0, 1, . . . , p − 1
s(β̂ k )

Hence, a (1 − α)100% confidence interval for βk is:

β̂ k ∓ t1− α2 , n−p s(β̂ k )


155
Chapter 5. Multiple Linear Regression C. Sotto

5.5.2 Test for βk

We want to test:
H0 : βk = 0 versus HA : βk 6= 0 .

Test Statistic:
β̂ k H0

t = ∼ tn−p
s(β̂ k )

Decision Rule:

|t∗| ≤ t1− α2 , n−p ⇒ conclude H0


|t∗| > t1− α2 , n−p ⇒ conclude HA

5.5.3 Joint Inferences

Bonferroni Joint Confidence Intervals:

For g (≤ p) parameters to be estimated simultaneously,

β̂ k ∓ B s(β̂ k ), for k = 1, 2, . . . , g ,
where:
B = t1− 2gα , n−p .

156
Chapter 5. Multiple Linear Regression C. Sotto

5.6 Inference About Mean Response

5.6.1 Interval Estimation of E(Yh )

Let X h be the vector of known constants corresponding


to subject h, i.e.

 ′
Xh = 1 Xh1 Xh2 · · · Xh,p−1 .
(p × 1)

E(Yh) = X ′h β
Ŷ h = X ′h β̂
" #
2
σ (Ŷ h) = X ′h 2
σ (β̂) X h
 
2
s (Ŷ h) = M SE X ′h(X ′X)−1X h
" #
= X ′h 2
s (β̂) X h

Hence, the (1 − α)100% confidence limits are:

Ŷ h ∓ t1− α2 , n−p s(Ŷ h) .

157
Chapter 5. Multiple Linear Regression C. Sotto

5.6.2 Confidence Region for Regression Surface

Working-Hotelling Band

Boundary points of the confidence region at X h are


obtained from
Ŷ h ∓ W s(Ŷ h) ,
where:
r
W = p F1−α ; p , n−p .

5.6.3 Simultaneous Confidence Intervals for


Several Mean Responses

To estimate, simultaneously, several E(Yh) corresponding


to different Xh:

1. Use W-H Bands


2. Use Bonferroni (g intervals)
Ŷ h ∓ B s(Ŷ h) ,
where:
B = t1− 2gα , n−p .

158
Chapter 5. Multiple Linear Regression C. Sotto

5.7 Predictions

5.7.1 New Observation, Yh(new)

Let Yh(new) correspond to X h.


!
Ŷ h ∓ t1− α2 , n−p s Ŷ h(new) ,
!
where: s Ŷ h(new) = M SE + s2(Ŷ h) .
2

5.7.2 Mean of m New Observations

Consider m new observations at the same level X h. Let


Ȳh(new) denote the mean response of these new cases.
 
Ŷ h ∓ t 1− α2 , n−p s Ȳh(new) ,
  M SE
2
where: s Ȳh(new) = + s2(Ŷ h) .
m

5.7.3 g New Observations

Scheffé
!
Ŷ k ∓ S s Ŷ h(new) , k = 1, 2, . . . , g ,
where: r
S = g F1−α ; g , n−p
!
s Ŷ h(new) = M SE + s2(Ŷ h) .
2

159
Chapter 5. Multiple Linear Regression C. Sotto

5.8 Diagnostics and Remedial Measures

5.8.1 Diagnostic Plots

5.8.1.1 Scatter Plot Matrix

• visual representation of the correlation matrix for the


entire set of variables
• provides information about:
⇒ the relationship between the response and each
of the predictor variables
⇒ the bivariate relationships among the predictors
⇒ any gaps in the data points
⇒ outlying data points

5.8.1.2 Residual Plots

• residuals can be plotted against Ŷ , against each Xk ,


against time or against normal quantiles (e.g., NPP)
• provides information about:
⇒ constancy of variance
⇒ outliers
⇒ possible correlations between error terms
⇒ adequacy of the regression function of each Xk
⇒ normality (from NPP)
160
Chapter 5. Multiple Linear Regression C. Sotto

• residuals can also be plotted against important


predictor variables that were omitted from the model
⇒ to see if the omitted variables have substantial
additional effects on the response
• residuals can also be plotted against interaction terms
⇒ for potential interaction effects not included in the
regression model
Figure 5.2: Scatter Plot Matrix for Y and Two Predictor Variables, X1 and X2 .

16 17 18 19

90

70
X1

50

30
19

18
X2
17

16

245

220

Y 195

170

145

30 50 70 90 145 170 195 220 245

161
Chapter 5. Multiple Linear Regression C. Sotto

5.8.2 Formal Tests

5.8.2.1 Correlation Test for Normality

• same as for simple linear regression


• compute the correlation coefficient between the
ordered residuals and their expected values under
normality (as in Section 3.4.1) and compare with
values in Table B.6 in Kutner et al.

5.8.2.2 Test of Constancy of Error Variance

Levene Test

• applied in the same way as in simple linear regression,


but for each of the predictor variables in the multiple
linear regression model

Breusch-Pagan Test

• if non-constancy of error variance is observed for only


one predictor, then the test is applied in the same way
as in simple linear regression
• if non-constancy of error variance is a function of
several predictors, say X1, X2, . . . , Xq−1, for q ≤ p,
then consider the regression:
loge σi2 = γ0 + γ1xi1 + . . . + γq−1xi,q−1
162
Chapter 5. Multiple Linear Regression C. Sotto

We want to test:
H0 : γ1 = γ2 = · · · = γq−1 = 0
vs.
HA : γk > 0, for some k .

Regress the squared residuals, e2i , on the predictors of


interest, Xi1, Xi2, . . . , Xi,q−1, and obtain SSR∗ from
this regression.
Statistic:
SSR∗
H0
χ2BP =  2 2 ∼ χ2q−1

SSE 
 
n

5.8.2.3 F -Test for Lack-of-Fit

• the lack-of-fit test for simple linear regression can be


extended to test whether the multiple regression
response function
E(Y ) = β0 + β1X1 + β2X2 + . . . + βp−1Xp−1
is an appropriate response surface
• the test requires replicate observations on Y
corresponding to levels of each of the X variables
that are constant from trial to trial
163
Chapter 5. Multiple Linear Regression C. Sotto

Hypotheses:

H0 : E(Y ) = β0 + β1X1 + β2X2 + . . . + βp−1Xp−1


versus
HA : E(Y ) =
6 β0 + β1X1 + β2X2 + . . . + βp−1Xp−1

Test Statistic:

M SLF SSLF/(c − p) H0

F = = ∼ Fc−p , n−c ,
M SP E SSP E/(n − c)
where:
c = the number of groups with distinct
sets of levels for the X variables
SSLF = SSE − SSP E

Decision Rule:

F ∗ ≤ F1−α ; c−p , n−c ⇒ conclude H0


F ∗ > F1−α ; c−p , n−c ⇒ conclude HA

5.8.3 Remedial Measures

• inclusion of curvature effects (e.g., higher-order


terms) and/or interaction effects
• simple transformations on Y and/or on some X’s
• Box-Cox Transformation
164
Chapter 5. Multiple Linear Regression C. Sotto

5.9 Case Study: Dwaine Studios Data

Dwaine Studios, Inc.

n = 21 cities

Variables

Y : sales
X1 : persons aged 16 or younger in the community
X2 : per capita disposable personal income in the
community

Question:

Can we predict sales, Y , using X1 and X2?

Model:
Yi = β0 + β1Xi1 + β2Xi2 + εi

To Do:

1. Fit the model.


2. Estimate β0, β1 and β2.
3. Analyze the residuals.
4. Analyze the appropriateness of the model.
165
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code

data DWAINE;
input X1 X2 Y;
cards;
68.5 16.7 174.4
45.2 16.8 164.4
91.3 18.2 244.2
47.8 16.3 154.6
46.9 17.3 181.6
66.1 18.2 207.5
49.5 15.9 152.8
52.0 17.2 163.2
48.9 16.6 145.4
38.4 16.0 137.2
87.9 18.3 241.9
72.8 17.1 191.1
88.4 17.4 232.0
42.9 15.8 145.3
52.5 17.8 161.1
85.7 18.4 209.7
41.3 16.5 146.4
51.7 16.3 144.0
89.6 18.1 232.6
82.7 19.1 224.1
52.3 16.0 166.5
;
run;

proc reg data=DWAINE;


model Y = X1 X2;
plot residual.*(predicted. X1 X2);
plot residual.*nqq./noline;
run; quit;

166
Chapter 5. Multiple Linear Regression C. Sotto

SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: Y

Number of Observations Read 21


Number of Observations Used 21

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 24015 12008 99.10 <.0001


Error 18 2180.92741 121.16263
Corrected Total 20 26196

Root MSE 11.00739 R-Square 0.9167


Dependent Mean 181.90476 Adj R-Sq 0.9075
Coeff Var 6.05118

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -68.85707 60.01695 -1.15 0.2663


X1 1 1.45456 0.21178 6.87 <.0001
X2 1 9.36550 4.06396 2.30 0.0333

Estimated Regression Function


Ŷ = −68.8571 + 1.4546X1 + 9.3655X2

Fitted Values and Residuals

Ŷ = X β̂
e = Y − Ŷ
167
Chapter 5. Multiple Linear Regression C. Sotto

Appropriateness of the Model

• residuals versus fitted values


• normal probability plot
• residuals versus X1
• residuals versus X2

SAS Graphs

Figure 5.3: Residuals vs. Fitted Values for Dwaine Studios Example.

5
20
10
Residuals
0
-10

18

16

140 160 180 200 220


Yhat

168
Chapter 5. Multiple Linear Regression C. Sotto

SAS Graphs

Figure 5.4: Residuals vs. Predictor X1 for Dwaine Studios Example.

20

10
res

-10

-20

30 40 50 60 70 80 90
X1

Figure 5.5: Residuals vs. Predictor X2 for Dwaine Studios Example.

20

10
res

-10

-20

15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0


X2

169
Chapter 5. Multiple Linear Regression C. Sotto

SAS Graphs

Figure 5.6: Normal Quantile Plot for Dwaine Studios Example.

5
20
10
Residuals
0
-10

18

16

-2 -1 0 1 2
Quantiles of Standard Normal

170
Chapter 5. Multiple Linear Regression C. Sotto

ANOVA Table

Test the regression relation:

H0 : β1 = 0 and β2 = 0
versus
HA : β1 6= 0 or β2 6= 0 .

Test Statistic:
∗M SR
F = = 99.10
M SE

F ∗ = 99.1 > F (0.95 ; 2, 18) = 3.55

We can say that average sales are related to size of the


target population and per capita income.

Can we use this model to predict?

We calculate the coefficient of multiple determination.

2 SSR
R = = 0.9167 and Ra2 = 0.9075
SST O

We can say that the reduction in the total variation in


sales is about 91% when accounting for the size of the
target population (X1) and per capita income (X2).
171
Chapter 5. Multiple Linear Regression C. Sotto

Estimation of Regression Parameters

Estimate β1 and β2 jointly with (1 − α) = 0.90.

s2(β̂ 1) = (0.2117)2 and s2(β̂ 2) = (4.0639)2

B = t1− 0.10 , 18 = t1−0.025 , 18 = 2.101


2(2)

1.455 ∓ 2.101(0.212) and 9.366 ∓ 2.101(4.064)

1.01 ≤ β1 ≤ 1.90 and 0.83 ≤ β2 ≤ 17.9

β1 and β2 are positive!

SAS Code
proc reg data=DWAINE alpha=0.05;
model Y = X1 X2 / clb;
run; quit;

SAS Output
Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits

Intercept 1 -68.85707 60.01695 -1.15 0.2663 -194.94801 57.23387


X1 1 1.45456 0.21178 6.87 <.0001 1.00962 1.89950
X2 1 9.36550 4.06396 2.30 0.0333 0.82744 17.90356

172
Chapter 5. Multiple Linear Regression C. Sotto

Estimation of Mean Response

Estimate the mean sales of a city when Xh1 = 65.4 and


Xh2 = 17.6 with (1 − α) = 0.95.

95% Confidence Interval:

Ŷ h = −68.85 + 1.4545(65.4) + 9.365(17.6)


s2(Ŷ h) = 7.656

t1− 0.05 , 18 = t1−0.025 , 18 = 2.101


2

191.10 ∓ 2.101(2.77)

185.3 ≤ E(Yh) ≤ 196.9

SAS Code

data DWAINE2;
set DWAINE;
output;
if _n_=21 then do;
X1 = 65.4; X2 = 17.6; Y = .; output;
end;
run;

proc reg data=DWAINE2 alpha=0.05;


model Y = X1 X2 / clm;
run; quit;

173
Chapter 5. Multiple Linear Regression C. Sotto

SAS Output

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Output Statistics

Dependent Predicted Std Error


Obs Variable Value Mean Predict 95% CL Mean Residual

1 174.4000 187.1841 3.8409 179.1146 195.2536 -12.7841


2 164.4000 154.2294 3.5558 146.7591 161.6998 10.1706
3 244.2000 234.3963 4.5882 224.7569 244.0358 9.8037
4 154.6000 153.3285 3.2331 146.5361 160.1210 1.2715
5 181.6000 161.3849 4.4300 152.0778 170.6921 20.2151
6 207.5000 197.7414 4.3786 188.5424 206.9404 9.7586
7 152.8000 152.0551 4.1696 143.2952 160.8150 0.7449
8 163.2000 167.8666 3.3310 160.8684 174.8649 -4.6666
9 145.4000 157.7382 2.9628 151.5136 163.9628 -12.3382
10 137.2000 136.8460 4.0074 128.4268 145.2653 0.3540
11 241.9000 230.3874 4.2012 221.5610 239.2137 11.5126
12 191.1000 197.1849 3.4109 190.0188 204.3510 -6.0849
13 232.0000 222.6857 5.3808 211.3810 233.9904 9.3143
14 145.3000 141.5184 4.1735 132.7502 150.2866 3.7816
15 161.1000 174.2132 5.0377 163.6294 184.7971 -13.1132
16 209.7000 228.1239 4.1214 219.4652 236.7826 -18.4239
17 146.4000 145.7470 3.7331 137.9041 153.5899 0.6530
18 144.0000 159.0013 3.2529 152.1672 165.8354 -15.0013
19 232.6000 230.9870 4.4176 221.7059 240.2681 1.6130
20 224.1000 230.3161 5.8120 218.1054 242.5267 -6.2161
21 166.5000 157.0644 4.0792 148.4944 165.6344 9.4356
22 . 191.1039 2.7668 185.2911 196.9168 .

Sum of Residuals 0
Sum of Squared Residuals 2180.92741
Predicted Residual SS (PRESS) 3002.92331

174
Chapter 5. Multiple Linear Regression C. Sotto

Prediction Limits for New Observations

Obtain 90% prediction limits for two new cities, A and


B, with the following values for the predictor variables:
XA1 = 65.4 and XA2 = 17.6

XB1 = 53.1 and XB2 = 17.7


So, (1 − α) = 0.90 and g = 2.

Scheffé Bonferroni

S 2 = 2 F0.90 ; 2 , 18 B = t1− 0.10 , 18


2(2)

S 2 = 5.24 = t0.975 , 18
S = 2.29 B = 2.101

Bonferroni is preferred here.


City A City B
Ŷ A = 191.10 Ŷ B = 174.15
s2(Ŷ A) = 7.656 s2(Ŷ A) = 21.1623
M SE = 121.1626 M SE = 121.1626
s2(Ŷ Anew ) = 128.82 s2(Ŷ Bnew ) = 142.32
s(Ŷ Anew ) = 11.35 s(Ŷ Bnew ) = 11.93

167.3 ≤ YAnew ≤ 214.9 149.1 ≤ YBnew ≤ 199.2

175
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code
data DWAINE3;
set DWAINE;
output;
if _n_=21 then do;
X1 = 65.4; X2 = 17.6; Y = .; output;
X1 = 53.1; X2 = 17.7; Y = .; output;
end;
run;

proc reg data=DWAINE3 alpha=0.05;


model Y = X1 X2 / cli;
run; quit;

SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: Y

Output Statistics

Dependent Predicted Std Error


Obs Variable Value Mean Predict 95% CL Predict Residual

1 174.4000 187.1841 3.8409 162.6910 211.6772 -12.7841


2 164.4000 154.2294 3.5558 129.9271 178.5317 10.1706
.......................................................................
.......................................................................
.......................................................................
20 224.1000 230.3161 5.8120 204.1647 256.4675 -6.2161
21 166.5000 157.0644 4.0792 132.4018 181.7270 9.4356
22 . 191.1039 2.7668 167.2589 214.9490 .
23 . 174.1494 4.5986 149.0867 199.2121 .

Sum of Residuals 0
Sum of Squared Residuals 2180.92741
Predicted Residual SS (PRESS) 3002.92331

176
Chapter 5. Multiple Linear Regression C. Sotto

5.10 Extra Sum of Squares

Consider a regression model with a number of predictors.

Now suppose we want to include one or a few more


predictors into the model.

The extra sum of squares provides a way of quantifying:

• the marginal reduction in the error sum of squares by


including these new predictor(s) to the set of
predictors that are already in the base model

• the marginal increase or gain in the regression sum of


squares when one or several additional predictors are
added to the regression model

For example, SSR(X2|X1) denotes the extra sum of


squares due to regression when adding the predictor X2
to a model already containing X1.

Similarly, SSR(X3, X4|X1, X2) represents the extra sum


of squares due to regression by adding the predictors X3
and X4 to a model containing only X1 and X2.

177
Chapter 5. Multiple Linear Regression C. Sotto

5.10.1 Example: Body Fat Data

Variables


Y : amount of body fat 










X1 : skinfold thickness 





n = 20 subjects
X2 : thigh circumference 










X3 : mid-arm circumference 

Model 1:

E(Yi) = β0 + β1Xi1
SSE(X1) = 143.12

Model 2:

E(Yi) = β0 + β1Xi1 + β2Xi2


SSE(X1, X2) = 109.95

Hence,

SSR(X2|X1) = SSE(X1) − SSE(X1, X2)


= 143.12 − 109.95
SSR(X2|X1) = 33.17

178
Chapter 5. Multiple Linear Regression C. Sotto

Note that:

SSR(X2|X1) = SSR(X1, X2) − SSR(X1)


= 385.44 − 352.27
SSR(X2|X1) = 33.17

marginal reduction marginal increase



in SSE in SSR

Why?
SST O = SSR + SSE

More Examples



 SSE(X1 , X2) − SSE(X1 , X2, X3)
SSR(X3 |X1, X2) = 

 SSR(X1 , X2, X3) − SSR(X1 , X2)



 109.95 − 98.41
= 

= 11.54
 396.98 − 385.44




 SSE(X1 ) − SSE(X1 , X2, X3)
SSR(X3 , X2|X1) = 

 SSR(X1 , X2, X3) − SSR(X1 )



 143.12 − 98.41
= 

= 44.71
 396.98 − 352.27

179
Chapter 5. Multiple Linear Regression C. Sotto

5.10.2 Decomposition of SSR

The regression sum of squares, SSR, in multiple


regression can be decomposed in various ways into extra
sums of squares.

For 2 predictors:






SSR(X1|X2) + SSR(X2)
SSR(X1, X2) = 



 SSR(X2|X1) + SSR(X1)

Hence,

SST O = SSR(X1) + SSE(X1)


SST O = SSR(X1) + SSR(X2|X1) + SSE(X1, X2)
SST O = SSR(X1, X2) + SSE(X1, X2)

For 3 predictors, some decompositions for SSR are:

SSR (X1, X2, X3)










SSR(X1) + SSR(X2|X1) + SSR(X3|X2, X1)





= 


SSR(X2) + SSR(X3|X2) + SSR(X1|X2, X3)








 SSR(X1) + SSR(X2, X3|X1)

180
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code

*** bodyfat.sas ***;

data BODY;
input X1 X2 X3 Y;
cards;
19.5 43.1 29.1 11.9
24.7 49.8 28.2 22.8
30.7 51.9 37.0 18.7
29.8 54.3 31.1 20.1
19.1 42.2 30.9 12.9
25.6 53.9 23.7 21.7
31.4 58.5 27.6 27.1
27.9 52.1 30.6 25.4
22.1 49.9 23.2 21.3
25.5 53.5 24.8 19.3
31.1 56.6 30.0 25.4
30.4 56.7 28.3 27.2
18.7 46.5 23.0 11.7
19.7 44.2 28.6 17.8
14.6 42.7 21.3 12.8
29.5 54.4 30.1 23.9
27.7 55.3 25.7 22.6
30.2 58.6 24.6 25.4
22.7 48.2 27.1 14.8
25.2 51.0 27.5 21.1
;
run;

proc reg data=BODY;


MODEL1 : model Y = X1 ;
MODEL2 : model Y = X2 ;
MODEL3 : model Y = X1 X2 ;
MODEL4 : model Y = X1 X2 X3 ;
run; quit;

181
Chapter 5. Multiple Linear Regression C. Sotto

SAS Output

Model 1

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Number of Observations Read 20


Number of Observations Used 20

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 352.26980 352.26980 44.30 <.0001


Error 18 143.11970 7.95109
Corrected Total 19 495.38950

Root MSE 2.81977 R-Square 0.7111


Dependent Mean 20.19500 Adj R-Sq 0.6950
Coeff Var 13.96271

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -1.49610 3.31923 -0.45 0.6576


X1 1 0.85719 0.12878 6.66 <.0001

182
Chapter 5. Multiple Linear Regression C. Sotto

Model 2

The REG Procedure


Model: MODEL2
Dependent Variable: Y

Number of Observations Read 20


Number of Observations Used 20

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 381.96582 381.96582 60.62 <.0001


Error 18 113.42368 6.30132
Corrected Total 19 495.38950

Root MSE 2.51024 R-Square 0.7710


Dependent Mean 20.19500 Adj R-Sq 0.7583
Coeff Var 12.43002

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -23.63449 5.65741 -4.18 0.0006


X2 1 0.85655 0.11002 7.79 <.0001

183
Chapter 5. Multiple Linear Regression C. Sotto

Model 3

The REG Procedure


Model: MODEL3
Dependent Variable: Y

Number of Observations Read 20


Number of Observations Used 20

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 385.43871 192.71935 29.80 <.0001


Error 17 109.95079 6.46769
Corrected Total 19 495.38950

Root MSE 2.54317 R-Square 0.7781


Dependent Mean 20.19500 Adj R-Sq 0.7519
Coeff Var 12.59305

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -19.17425 8.36064 -2.29 0.0348


X1 1 0.22235 0.30344 0.73 0.4737
X2 1 0.65942 0.29119 2.26 0.0369

184
Chapter 5. Multiple Linear Regression C. Sotto

Model 4

The REG Procedure


Model: MODEL4
Dependent Variable: Y

Number of Observations Read 20


Number of Observations Used 20

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 396.98461 132.32820 21.52 <.0001


Error 16 98.40489 6.15031
Corrected Total 19 495.38950

Root MSE 2.47998 R-Square 0.8014


Dependent Mean 20.19500 Adj R-Sq 0.7641
Coeff Var 12.28017

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 117.08469 99.78240 1.17 0.2578


X1 1 4.33409 3.01551 1.44 0.1699
X2 1 -2.85685 2.58202 -1.11 0.2849
X3 1 -2.18606 1.59550 -1.37 0.1896

185
Chapter 5. Multiple Linear Regression C. Sotto

5.10.3 ANOVA Table with SSR Decomposition

For 3 predictors:

Source of Variation SS df MS

Regression SSR(X1, X2 , X3) 3 MSR(X1 , X2, X3 )

X1 SSR(X1) 1 MSR(X1 )

X2 |X1 SSR(X2|X1 ) 1 MSR(X2 |X1 )

X3 |X1 , X2 SSR(X3|X1 , X2) 1 MSR(X3 |X1 , X2)

Error SSE(X1, X2, X3) n − 4 MSE(X1, X2 , X3)

Total SST O n−1

Example: Body Fat Data

Source of Variation SS df MS

Regression 396.98 3 132.23


X1 352.27 1 352.27
X2|X1 33.17 1 33.17
X3|X1, X2 11.54 1 11.54
Error 98.41 16 6.15

Total 495.39 19

186
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code

The option SS1 in the MODEL statement gives the extra


sum of squares of each predictor in the model, following
the sequence specified in the MODEL statement.

When SS1 option is specified, the Type I SS in the SAS


output gives the SS of each predictor given all other
predictors that precede it in the MODEL statement.

proc reg data=BODY;


MODELA : model Y = X1 X2 / ss1 ;
MODELB : model Y = X2 X1 / ss1 ;
run; quit;

For the above SAS code, the Type I SS in the SAS


output gives:

• for MODELA: first SSR(X1), then SSR(X2|X1)


• for MODELB: first SSR(X2), then SSR(X1|X2)

When the SS2 option is specified, the Type II SS in the


SAS output gives the SS of each predictor given all other
predictors are in the model.

proc reg data=BODY;


MODELC : model Y = X1 X2 / ss2;
run; quit;

187
Chapter 5. Multiple Linear Regression C. Sotto

SAS Output

Model: MODELA
Dependent Variable: Y

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 -19.17425 8.36064 -2.29 0.0348 8156.76050


X1 1 0.22235 0.30344 0.73 0.4737 352.26980
X2 1 0.65942 0.29119 2.26 0.0369 33.16891

---------------------------------------------------------------------------

Model: MODELB
Dependent Variable: Y

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type I SS

Intercept 1 -19.17425 8.36064 -2.29 0.0348 8156.76050


X2 1 0.65942 0.29119 2.26 0.0369 381.96582
X1 1 0.22235 0.30344 0.73 0.4737 3.47289

---------------------------------------------------------------------------

Model: MODELC
Dependent Variable: Y

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Type II SS

Intercept 1 -19.17425 8.36064 -2.29 0.0348 34.01785


X1 1 0.22235 0.30344 0.73 0.4737 3.47289
X2 1 0.65942 0.29119 2.26 0.0369 33.16891

188
Chapter 5. Multiple Linear Regression C. Sotto

5.10.4 Extra Sum of Squares in Test for


Regression

In multiple linear regression, different types of tests for


the regression parameters can be conducted.



 all βk : β1 , β2, . . . , βp−1 : (p − 1) ⇒ overall F -test







 t-test




 one βk : βk : 1 ⇒
TEST (partial F -test)









general linear



 several βk : βi , βj , . . . , βl : ≥2 ⇒ model F -test



(using extra SS)

Test for a Single βk

H0 : βk = 0 versus HA : βk 6= 0

Statistic: β̂ k H0

t = ∼ tn−p
s(β̂ k )

Consider k = 3.

H0 : β3 = 0 versus HA : β3 6= 0

General Linear Test Statistic:

SSE(R) − SSE(F ) SSE(F )


F∗ = ÷
dfR − dfF dfF
189
Chapter 5. Multiple Linear Regression C. Sotto

Using now the Extra Sum of Squares for terms in F ∗,


SSE(X1 , X2) − SSE(X1 , X2, X3) SSE(X1 , X2, X3)
F∗ = ÷
(n − 3) − (n − 4) n−4

SSR(X3 |X1, X2) SSE(X1 , X2, X3)


= ÷
1 n−4
M SR(X3|X1, X2)
F∗ =
M SE(X1, X2, X3)

Example: Body Fat Data

SSR(X3|X1, X2) SSE(X1, X2, X3)


F∗ = ÷
1 n−4
11.54 98.41
= ÷
1 16
F ∗ = 1.88 < F (0.99; 1, 16) = 8.53

Therefore X3 can be dropped from the model.

Also, considering the t-test statistic for testing

H0 : β3 = 0 versus HA : β3 6= 0 ,

we obtain:
−2.186
t∗ = = −1.37
1.596
(t∗)2 = (−1.37)2 = 1.8769 = F ∗
190
Chapter 5. Multiple Linear Regression C. Sotto

Test for Several βk

Consider the case of 3 predictors and the following test of


hypotheses:
H0 : β2 = β3 = 0 versus HA : β2 6= 0 or β3 6= 0 .

Full Model:
Y = β0 + β1X1 + β2X2 + β3X3 + ε

Reduced Model:
Y = β0 + β1X1 + ε

Statistic:
SSE(X1 ) − SSE(X1 , X2, X3) SSE(X1 , X2, X3)
F∗ = ÷
(n − 2) − (n − 4) n−4

SSR(X2 , X3|X1) SSE(X1 , X2, X3)


= ÷
2 n−4
M SR(X2, X3|X1)
F∗ =
M SE(X1 , X2, X3)

Example: Body Fat Data

∗44.71 98.41
F = ÷ = 3.63 = F (0.95; 2, 16)
2 16
F ∗ is in the boundary ⇒ borderline case
191
Chapter 5. Multiple Linear Regression C. Sotto

In general, we can test hypotheses of the form:


H0 : βq = βq+1 = · · · = βp−1 = 0
versus
HA : not all βk in H0 equal to zero .

Statistic:
M SR(Xq , Xq+1, . . . , Xp−1|X1, X2, . . . , Xq−1)
F∗ =
M SE(X1, X2, . . . , Xp−1)

Apart from the test for significance of a subset of the


regression parameters, i.e.

(1) H0 : β2 = β3 = 0 vs. HA : β2 6= 0 or β3 6= 0,
other, more general forms of tests on the βk are also
possible. For instance, for 3 predictors:

(2) H0 : β1 = β2 vs. HA : β1 6= β2
(3) H0 : β2 + β3 = 0 vs. HA : β2 + β3 6= 0
(4) H0 : β2 + β3 = 5 vs. HA : β2 + β3 6= 5
(5) H0 : β1 = 3 and β3 = 5 vs. HA : β1 6= 3 or β3 6= 5

For all such tests, one can use the general linear F -test
SSE(R) − SSE(F )
dfR − dfF H0
F∗ = ∼ FdfR−dfF , dfF .
SSE(F )
dfF
192
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code

proc reg data=BODY;


model Y = X1 X2 X3;
TEST_1a : test X2, X3;
TEST_1b : test X2=0, X3=0;
TEST_2 : test X1=X2;
TEST_3a : test X2+X3;
TEST_3b : test X2+X3=0;
TEST_4 : test X2+X3=5;
TEST_5 : test X1=3, X3=5;
run; quit;

SAS Output

The REG Procedure


Model: MODEL1

Test TEST_1a Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 2 22.35741 3.64 0.0500


Denominator 16 6.15031

Test TEST_1b Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 2 22.35741 3.64 0.0500


Denominator 16 6.15031

193
Chapter 5. Multiple Linear Regression C. Sotto

Test TEST_2 Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 1 10.15473 1.65 0.2171


Denominator 16 6.15031

Test TEST_3a Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 1 8.98811 1.46 0.2443


Denominator 16 6.15031

Test TEST_3b Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 1 8.98811 1.46 0.2443


Denominator 16 6.15031

Test TEST_4 Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 1 35.64718 5.80 0.0285


Denominator 16 6.15031

Test TEST_5 Results for Dependent Variable Y

Mean
Source DF Square F Value Pr > F

Numerator 2 5274.58186 857.61 <.0001


Denominator 16 6.15031

194
Chapter 5. Multiple Linear Regression C. Sotto

5.11 Coefficient of Partial


Determination

• provides a descriptive measure of relationship


• measures the marginal contribution of one X variable
when all others are already in the model

For example, RY2 1|2 denotes the additional contribution of


X1 given that X2 is already in the model.

SSE(X2) − SSE(X1, X2) SSR(X1|X2)


RY2 1|2 = =
SSE(X2) SSE(X2)

RY2 1|2 measures the proportionate reduction in the


variation in Y remaining after X2 is included in the
model that is gained by also including X1 in the model.

Note:
(rY 2 − rY 1r12)2
RY2 2|1 =
(1 − rY2 1)(1 − r12
2 )

The (positive or negative) square root of the coefficient


of partial determination is called the coefficient of partial
correlation. It takes the same sign as the corresponding
regression coefficient in the fitted regression function.
s
RY 2|1 = ± RY2 2|1
195
Chapter 5. Multiple Linear Regression C. Sotto

More Examples

SSR(X1|X2X3)
RY2 1|23 =
SSE(X2, X3)

SSR(X2|X1X3)
RY2 2|13 =
SSE(X1, X3)

SSR(X3|X1X2)
RY2 3|12 =
SSE(X1, X2)

SSR(X4|X1X2X3)
RY2 4|123 =
SSE(X1, X2, X3)

Example: Body Fat Data

SSR(X2|X1) 33.17
RY2 2|1 = = = 0.232
SSE(X1) 143.12

SSR(X3|X1X2) 11.54
RY2 3|12 = = = 0.105
SSE(X1, X2) 109.95

SSR(X1|X2) 3.47
RY2 1|2 = = = 0.031
SSE(X2) 113.42

196
Chapter 5. Multiple Linear Regression C. Sotto

5.12 Standardized Multiple Regression

Sometimes, problems can be encountered when fitting a


GLRM (5.1):
• roundoff errors in solving the normal equations
⇒ normal equations calculations are sensitive to
rounding of data in intermediate steps, especially
when the number of predictors is small
⇒ such roundoff errors occur when taking (X ′X)−1
⇒ effects are particularly great when (X ′X) has a
determinant close to 0 and/or the elements of
(X ′X) differ substantially in magnitude

• lack of comparability of regression coefficients


⇒ some variables are measured in very small units,
while others are measured in large units
⇒ comparing regression coefficients requires
knowledge of the units of measurement of the
predictors
Example: Ŷ = 200 + 20, 000X1 + 0.2X2
Y in dollars
X1 in thousand dollars
X2 in cents

Possible Solution: apply correlation transformation


197
Chapter 5. Multiple Linear Regression C. Sotto

Correlation Transform

First, standardize, i.e. center and scale, the data:


Yi − Ȳ Xik − X̄k
Ỹi = and X̃ik = ,
SY Sk
where SY and Sk are the standard deviations of Y and
Xk , respectively.

Then, apply the correlation transform:


Ỹi X̃ik
Yi∗ = √ ∗
and Xik =√ .
n−1 n−1

Standardized Regression Model

Yi∗ = β1∗Xi1∗ + β2∗Xi2∗ + . . . + βp−1


∗ ∗
Xi,p−1 + ε∗i

• there is no intercept parameter


• (X ′X) = r XX ⇒ correlation matrix of the X’s
• the (estimated) parameters of the standardized
regression model, β1∗, β2∗, . . . , βp−1

, are related to
those of the original, unstandardized, regression model
Sy ∗
βk = β , ∀ k = 1, 2, . . . , p − 1
Sk k
β0 = Ȳ − β1X̄1 − β2X̄2 − . . . − βp−1X̄p−1
198
Chapter 5. Multiple Linear Regression C. Sotto

5.13 Multicollinearity

In multiple regression, we examine

• the relative importance of the effects of the different


predictor variables
• the magnitude of the effect of a given predictor on
the response variable
• whether any predictor can be dropped from the model
because it has little or no effect on the response
variable
• whether any predictor(s) not yet included in the
model should be considered for possible inclusion

All of these are fairly easily answered when the predictor


variables are uncorrelated. But, a problem arises if the
predictor variables are correlated among themselves.

Multicollinearity or intercorrelation among predictor


variables is said to exist when the predictors are
correlated among themselves.
Note: The term multicollinearity is usually reserved for
instances when the correlation among predictors is very high.
Multicollinearity should not be conceived as something that
either “exists” or “does not.” Rather, multicollinearity exists in
degrees, which will determine how problematic the situation is.
199
Chapter 5. Multiple Linear Regression C. Sotto

5.13.1 Uncorrelated Predictors

Let us first consider the case when the predictor variables


are uncorrelated.
• Regression coefficients corresponding to each
predictor remain the same regardless of which other
predictors are included in the model.
⇒ effects ascribed to a predictor are the same no
matter which other predictors are in the model

Example: If X1 and X2 are uncorrelated and we


consider the following two models:

Model 1: Yi = β0 + β1X1 + εi

Model 2: Yi = β0 + β1′ X1 + β2X2 + εi



⇒ β̂ 1 and β̂ 1 are the same

• The marginal contribution of one predictor in


reducing the error sum of squares when the other
predictors are in the model is exactly the same as
when this predictor is in the model alone.

Example: If X1 and X2 are uncorrelated,

SSR(X1|X2) = SSR(X1)

SSR(X2|X1) = SSR(X2)
200
Chapter 5. Multiple Linear Regression C. Sotto

5.13.2 Example: Crew Productivity Data

Variables

Y : crew productivity

X1 : crew size 







uncorrelated
X2 : level of bonus pay 

We consider fitting different models.

Model 1: Yi = β0 + β1X1i + β2X2i + εi

Model 2: Yi = β0 + β1X1i + εi

Model 3: Yi = β0 + β2X2i + εi

Take note of the estimated regression coefficients in each


of the models.

Model Predictors β̂ 1 β̂ 2
1 X1 , X2 5.375 9.25
2 X1 5.375 –
3 X2 – 9.25

201
Chapter 5. Multiple Linear Regression C. Sotto

SAS Code

data CREW;
input X1 X2 Y;
cards;
4 2 42
4 2 39
4 3 48
4 3 51
6 2 49
6 2 53
6 3 61
6 3 60
;
run;

proc corr data=CREW;


run;

proc reg data=CREW;


MODEL1 : model Y = X1 X2 ;
MODEL2 : model Y = X1 ;
MODEL3 : model Y = X2 ;
run; quit;

202
Chapter 5. Multiple Linear Regression C. Sotto

SAS Output

The CORR Procedure

3 Variables: X1 X2 Y

Simple Statistics

Variable N Mean Std Dev Sum

X1 8 5.00000 1.06904 40.00000


X2 8 2.50000 0.53452 20.00000
Y 8 50.37500 7.74481 403.00000

Simple Statistics

Variable Minimum Maximum

X1 4.00000 6.00000
X2 2.00000 3.00000
Y 39.00000 61.00000

Pearson Correlation Coefficients, N = 8


Prob > |r| under H0: Rho=0

X1 X2 Y

X1 1.00000 0.00000 0.74193


1.0000 0.0351

X2 0.00000 1.00000 0.63841


1.0000 0.0885

Y 0.74193 0.63841 1.00000


0.0351 0.0885

203
Chapter 5. Multiple Linear Regression C. Sotto

Model 1

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Number of Observations Read 8


Number of Observations Used 8

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 402.25000 201.12500 57.06 0.0004


Error 5 17.62500 3.52500
Corrected Total 7 419.87500

Root MSE 1.87750 R-Square 0.9580


Dependent Mean 50.37500 Adj R-Sq 0.9412
Coeff Var 3.72704

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 0.37500 4.74045 0.08 0.9400


X1 1 5.37500 0.66380 8.10 0.0005
X2 1 9.25000 1.32759 6.97 0.0009

204
Chapter 5. Multiple Linear Regression C. Sotto

Model 2

The REG Procedure


Model: MODEL2
Dependent Variable: Y

Number of Observations Read 8


Number of Observations Used 8

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 231.12500 231.12500 7.35 0.0351


Error 6 188.75000 31.45833
Corrected Total 7 419.87500

Root MSE 5.60877 R-Square 0.5505


Dependent Mean 50.37500 Adj R-Sq 0.4755
Coeff Var 11.13404

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 23.50000 10.11136 2.32 0.0591


X1 1 5.37500 1.98300 2.71 0.0351

205
Chapter 5. Multiple Linear Regression C. Sotto

Model 3

The REG Procedure


Model: MODEL3
Dependent Variable: Y

Number of Observations Read 8


Number of Observations Used 8

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 171.12500 171.12500 4.13 0.0885


Error 6 248.75000 41.45833
Corrected Total 7 419.87500

Root MSE 6.43881 R-Square 0.4076


Dependent Mean 50.37500 Adj R-Sq 0.3088
Coeff Var 12.78177

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 27.25000 11.60774 2.35 0.0572


X2 1 9.25000 4.55293 2.03 0.0885

206
Chapter 5. Multiple Linear Regression C. Sotto

5.13.3 Correlated Predictors

We now look at the case of perfectly correlated predictor


variables.

Consider a response variable, Y , and two predictors, X1


and X2, for which
X2 = 1.5X1 + 3 .

How many planes can we fit?

When the (some) predictors in a regression model are


perfectly correlated,
• the data do not contain any random error component
• many different response functions lead to the same
perfectly fitted values for any (X1, X2) values that
follow the relation, but different fitted values for any
(X1, X2) values that do not follow the relation

In practice,
1. perfect correlation is difficult to find
2. multicollinearity does not inhibit our ability to obtain
a good fit
3. the interpretations of the regression coefficients are
not fully applicable and are somewhat less meaningful

207
Chapter 5. Multiple Linear Regression C. Sotto

Figure 5.7: Illustration of Effects of Multicolinearity.

208
Chapter 5. Multiple Linear Regression C. Sotto

5.13.4 Effects of Multicollinearity

1. Fitted Values, Inferences on Mean Response and


Predictions
• does not inhibit ability to obtain a good fit
• does not affect inferences on the mean response
• does not affect predictions for new observations
2. Estimated Regression Coefficients
• are dependent on which predictors are included in
the model
• tend to have large sampling variability
• tend to vary widely from one sample to another
• may be individually statistically not significant
even though a definite statistical relation exists
⇒ imprecise information about the true regression
coefficients
3. Extra Sum of Squares
• the marginal contribution of any one predictor in
reducing the error sum of squares varies depending
on which other predictors are already in the model
• reduction in the total variation ascribed to a
predictor must be viewed in the context of the
other correlated predictors included in the model
SSR(X1 |X2) 6= SSR(X1 )
SSR(X2 |X1) 6= SSR(X2 )
209
Chapter 5. Multiple Linear Regression C. Sotto

4. Coefficients of Partial Determination


• are also affected because they are functions of the
extra sums of squares
RY2 1 6= RY2 1|2 and RY2 2 6= RY2 2|1
5. Standard Errors of the Estimated Regression
Coefficients
• are inflated by multicollinearity
6. Inferences on Regression Coefficients
• unreliable because of large variability
• it is possible that individual regression coefficients
are not significant, but are jointly significant (e.g.,
the overall F -test for the regression relation is
significant)
Figure 5.8: Typical Confidence Interval for the Partial Slope Coefficients
for Two Highly Correlated Independent Variables.

210
Chapter 5. Multiple Linear Regression C. Sotto

7. (X ′X) Matrix
• severe multicollinearity has the effect of making
the determinant of (X ′X) come close to zero

Hence, we can use all this information to detect


multicollinearity. How?
• correlation exceeding 0.80 (rule of thumb)
• drastic changes in the estimated parameters
• significant overall F -test, but insignificant partial
t-tests
• large values of s(β̂ k )
• regress each predictor against the other predictors and
check R2 from this regression
• etc....
(In Chapter 6, we look at a more formal way to detect
multicollinearity.)

5.13.5 Example: Body Fat Data

Variables

Y : amount of body fat 









X1 : skinfold thickness 





n = 20 subjects
X2 : thigh circumference 









X3 : mid-arm circumference 

211
Chapter 5. Multiple Linear Regression C. Sotto

The correlation matrix for the predictors indicate high


correlation between X1 and X2.
 
1.000


0.924 0.458 
 
 
r XX = 0.924




1.000 0.085 

 
0.458

0.085 1.000 

Recall the 4 models we fitted earlier.

Model 1: Yi = β0 + β1X1i + εi
Model 2: Yi = β0 + β1X2i + εi
Model 3: Yi = β0 + β2X1i + β2X2i + εi
Model 4: Yi = β0 + β2X1i + β2X2i + β2X3i + εi

The estimated regression coefficients vary largely


depending on which predictors are included in the model.

Model Predictors β̂ 1 β̂ 2 s(β̂ 1) s(β̂ 2)

1 X1 0.8572 − 0.1288 −
2 X2 − 0.8566 − 0.1100
3 X1 , X2 0.2224 0.6594 0.3034 0.2912
4 X1 , X2 , X3 4.3341 −0.2859 3.0155 2.5820

212
Chapter 5. Multiple Linear Regression C. Sotto

The extra sum of squares, as well as the coefficient of


partial determination, for one predictor also depends on
which predictors are included in the model.
SSR(X1) = 352.27 RY2 1 = 0.71
SSR(X1|X2) = 3.47 RY2 1|2 = 0.03

SSR(X2) = 381.97
SSR(X2|X1) = 33.17

The fitted values do not vary much across the models.

Model Predictors Xh Ŷ h s(Ŷ h)


1 X1 Xh1 = 25 19.93 0.632

Xh1 = 25
3 X1 , X2 19.36 0.624
Xh2 = 50

Xh1 = 25
4 X1 , X2 , X3 Xh2 = 50 19.19 0.621
Xh3 = 29

In Model 4 (p. 185), the overall F -test is significant


(p < 0.001) but none of the partial F -tests (i.e. the
individual t-tests) for each of the predictors is significant.

213
Chapter 5. Multiple Linear Regression C. Sotto

5.13.6 Example: Life Satisfaction Data

Variables
Y : measure of satisfaction with life, Y ∈ [1, 20]
X1 : family income in 1000 dollars
X2 : measure of occupational prestige, X2 ∈ [0, 100]
X3 : number of years of education
X4 : frequency of attendance to religious services
X5 : population of current residence

Some additional information:

For N = 665 cases,


rX1X2 = 0.73, rX1X3 = 0.57, rX2X3 = 0.58 .

Model:

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + β5Xi5 + εi

Fitted Regression Function:

E(Y ) = 10.51 + 0.065X1 + 0.011X2 + 0.116X3 + 0.265X4 − 0.056X5

214
Chapter 5. Multiple Linear Regression C. Sotto

The table below shows that multicollinearity seems to be


present.

Bivariate Correlations Among the 5 Predictors and R2 -Values when


Regressing Each Independent Variable on the Other Four (n = 50)

About 71% of the total variability in X2 (occupational


prestige) is “explained” by the other 4 predictors.

Less that 10% of the total variability in X4 (religious


attendance) and in X5 (population) is “explained” by the
other 4 predictors.

There are fairly high correlations among X1, X2 and X3,


whereas X4 and X5 are relatively independent of each
other and the other three independent variables.

215
Chapter 6

Model Building, Diagnostics and


Remedial Measures

Chapters 9, 10 & 11 in Kutner et al.

• Model Building Process

• Automatic Search Procedures for Model Selection

• Outlier Detection

• Diagnostics Techniques

• Remedial Measures

216
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.1 Model Building Process

1. Data Collection and Preparation


2. Reduction of Explanatory or Predictor Variables
3. Model Refinement and Selection
4. Model Validation

(Read Sections 9.1, 9.5 and 9.6 of Kutner et al. for a


detailed discussion of these steps.)

Introductory Example: Surgical Unit Data

Objective

To predict survival time (Y ) of patients with a particular


type of liver operation.

Data

The following variables were recorded for 54 patients.

X1 : blood clotting score


X2 : prognostic index
X3 : enzyme function test score
X4 : liver function test
217
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Figure 6.1: Strategy for Building a Regression Model.

218
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Graphs
Figure 6.2: Scatter Plot Matrix for Surgical Unit Example.

0 20 40 60 80 100 0 2 4 6
11
9
X1 7
5
3
100
80
60
X2
40
20
0

110
90
X3 70
50
30
10
6

4
X4
2

0
800
600
Time 400
200
0
3 5 7 9 11 10 30 50 70 90 110 0 200 400 600 800

*** Correlations ***

X1 X2 X3 X4 Time LogTime
X1 1.00000000 0.09011973 -0.14963411 0.5024157 0.3725187 0.3464042
X2 0.09011973 1.00000000 -0.02360544 0.3690256 0.5539760 0.5928888
X3 -0.14963411 -0.02360544 1.00000000 0.4164245 0.5802438 0.6651216
X4 0.50241567 0.36902563 0.41642451 1.0000000 0.7223266 0.7262058
Time 0.37251865 0.55397598 0.58024382 0.7223266 1.0000000 0.9130965
LogTime 0.34640419 0.59288884 0.66512160 0.7262058 0.9130965 1.0000000

219
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Dependent Variable: Time


Figure 6.3: Normal Quantile Plot for Surgical Unit Example.

13
300
200
Residuals

18
100
0

-2 -1 0 1 2
Quantiles of Standard Normal

Figure 6.4: Residual Plot for Surgical Unit Example.

13
300
200
Residuals

18
100
0

-100 0 100 200 300 400 500


Fitted : X1 + X2 + X3 + X4

220
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Dependent Variable: Log(Time)


Figure 6.5: Normal Quantile Plot for Surgical Unit Example.

22

27
0.10
0.05
Residuals
0.00
-0.05
-0.10

-2 -1 0 1 2
Quantiles of Standard Normal

Figure 6.6: Residual Plot for Surgical Unit Example.

22

27
0.10
0.05
Residuals
0.00
-0.05
-0.10

1.6 1.8 2.0 2.2 2.4 2.6 2.8


Fitted : X1 + X2 + X3 + X4

221
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.2 Criteria for Model Selection

Suppose X is a set of predictor variables. How can we


identify a good subset of predictor variables?

⇒ We need some criteria.

If we have four variables, let’s say, X1, X2, X3 and X4,


how many models can we fit?

24 = 16 (including an intercept-only model)

In general, for a set of k predictors, X1, X2, . . . , Xk ,


there are a total of

2k different models.

In order to make a choice, we need to have some criteria


for evaluating all these candidate models.

Model selection procedures, also known as subset


selection or variable selection procedures, have been
developed to identify a small group of regression models
that are “good” according to a specific criterion.

222
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Criteria for Comparing Regression Models

Different criteria can be used to compare several


regression models:

• Rp2 or SSEp

• Ra2 or M SEp

• Cp

• P RESSp

• AIC and SBC

In our discussion of each of the above criteria, we will


assume the following:

• There are (P − 1) potential variables in the pool,


implying P parameters in the full model.

• All the models considered contain an intercept, β0.

• Each subset model with (p − 1) predictors has p


parameters, 1 ≤ p ≤ P .

• The number of observations is larger than the number


of parameters under the full model, i.e. n > P .

223
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.2.1 Rp2 or SSEp Criterion

• makes use of the coefficient of multiple


determination, R2 (hereinafter denoted as Rp2 )

• identifies subsets of predictors with high Rp2

• Rp2 is equivalent to SSEp criterion since


SSEp
Rp2 =1−
SST O

⇒ Rp2 varies inversely with SSEp

Since SST O is constant for all possible models,


looking for models with a high Rp2 is equivalent
looking for models with small SSEp.

2
6.2.2 Ra,p or M SEp Criterion

• uses the adjusted coefficient of multiple


determination, Ra2 , to account for the number of
parameters in the model
 

2  n − 1  SSEp M SEp
Ra,p =1−

.

=1−
n − p SST O SST O/(n − 1)
2
• Ra,p gives essentially the same information as M SEp
2
Ra,p increases ⇐⇒ M SEp decreases

224
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.2.3 Mallows Cp Criterion

• identifies models with smaller bias in the regression

SSEp
Cp = − (n − 2p)
M SE(X1, X2, . . . , XP −1 )
where:
M SE(X1, X2, . . . , XP −1) is the M SE for the
model containing all (P − 1) predictors; and,

SSEp is the error sum of the squares for the


specific fitted regression model

• no bias implies E(Cp) ∼


=p

• by definition, under the full model, Cp = P

• in a plot of Cp vs. p
⇒ models with little bias fall near the line Cp = p
⇒ models with substantial bias fall considerably
above the line Cp = p
⇒ Cp values below the line Cp = p are interpreted as
showing no bias

• we search for models with


1. small values for Cp , and,
2. Cp is near p
225
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.2.4 AIC and SBC Criteria

AIC: Akaike’s Information Criterion

SBC: Schwarz Bayesian Information Criterion


 
SSEp 
AIC = n ln   + 2p
n
 

SSEp 
SBC = n ln   + p ln(n)
n

• measures of the goodness-of-fit of an estimated


statistical model

• AIC and BIC make use of both SSEp and p in


their formulae
⇒ first term decreases as p increases
⇒ second term increases as p increases

• competing models may be ranked according to their


AIC or SBC

• smaller AIC or SBC is better

• AIC and SBC differ only in the penalty term for the
number of parameters in the model
⇒ for n ≥ 8, SBC penalty > AIC penalty
⇒ SBC favors more parsimonious models
226
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.2.5 P RESSp Criterion

• P RESS stands for prediction sum of squares

• measures how well the use of the fitted values for a


specific subset model can predict the observed
responses Yi

• differs from SSE in that the fitted value Ŷi for the
P RESS is obtained by deleting the ith case from the
data set

Yi : observed values
Ŷ i(i) : fitted value for the ith case obtained by
fitting a model without the ith case, i.e.
using only (n − 1) cases

Predict Sum of Squares:


n !
X 2
P RESSp = Yi − Ŷ i(i) ,
i=1
whereas SSE is
n !
X 2
SSEp = Yi − Ŷ i .
i=1

• models with small P RESSp are chosen by this


criterion

227
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.2.6 Example: Surgical Unit Data

Recall that there 4 predictor variables.

Thus, P = 5 and 1 ≤ p ≤ 5.

R2p , M SEp , Cp and P RESSp Values for All Possible Regression


Models for Surgical Unit Example

228
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Rp2 Plot

Figure 6.7: R2p Plot for Surgical Unit Example.

• model with (X1, X2, X3) gives the highest Rp2


• from the scatter plot matrix, it can be observed that
X4 is highly correlated with X1, X2, X3

229
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

M SEp Plot

Figure 6.8: M SEp Plot for Surgical Unit Example.

230
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Cp Plot

Figure 6.9: Cp Plot for Surgical Unit Example.

231
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

P RESSp Plot

Figure 6.10: P RESSp Plot for Surgical Unit Example.

232
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Code

*** SURGICAL ***;

data SURG;
input X1 X2 X3 X4 X5 X6 X7 X8 TIME logTime;
cards;
6.7 62 81 2.59 50 0 1 0 695 6.544
5.1 59 66 1.70 39 0 0 0 403 5.999
7.4 57 83 2.16 55 0 0 0 710 6.565
6.5 73 41 2.01 48 0 0 0 349 5.854
7.8 65 115 4.30 45 0 0 1 2343 7.759
...........................................
...........................................
...........................................
...........................................
...........................................
3.9 82 103 4.55 50 0 1 0 1078 6.983
6.6 77 46 1.95 50 0 1 0 405 6.005
6.4 85 40 1.21 58 0 0 1 579 6.361
6.4 59 85 2.33 63 0 1 0 550 6.310
8.8 78 72 3.20 56 0 0 0 651 6.478
;
run;

proc reg data=SURG;


model logTIME = X1 X2 X3 X4 /
selection=rsquare
best=2 cp adjrsq;
run; quit;

233
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output

The SAS System

The REG Procedure


Model: MODEL1
Dependent Variable: logTime

R-Square Selection Method

Number of Observations Read 54


Number of Observations Used 54

Number in Adjusted
Model R-Square R-Square C(p) Variables in Model

1 0.4276 0.4166 66.4889 X3


1 0.4215 0.4104 67.7148 X4
---------------------------------------------------------------
2 0.6633 0.6501 20.5197 X2 X3
2 0.5995 0.5838 33.5041 X3 X4
---------------------------------------------------------------
3 0.7573 0.7427 3.3905 X1 X2 X3
3 0.7178 0.7009 11.4237 X2 X3 X4
---------------------------------------------------------------
4 0.7592 0.7396 5.0000 X1 X2 X3 X4

234
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Options

label: MODEL dependent=regressors / BEST=n ;

The BEST= option of the MODEL statement is used with the RSQUARE,
ADJRSQ and CP model-selection methods. A small value of the BEST= option
greatly reduces the CPU time required for large problems.

If the BEST= option is omitted and the number of regressors is less than
eleven, all possible subsets are evaluated. If the BEST= option is omitted and
the number of regressors is greater than ten, the number of subsets selected is
at most equal to the number of regressors.

label: MODEL dependent=regressors / CP ;

The CP option of the MODEL statement computes Mallows’ Cp statistic for


each model selected. (Only available when SELECTION=RSQUARE, ADJRSQ
or CP.)

label: MODEL dependent=regressors / SELECTION=name ;

The SELECTION= option of the MODEL statement specifies the method used
to select the model, where name can be FORWARD (or F), BACKWARD (or
B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP or NONE (uses the
full model). Default: NONE

label: MODEL dependent=regressors / SLENTRY= value ;

The SLENTRY= or SLE= option of the MODEL statement specifies the


significance level for entry into the model used in the FORWARD and
STEPWISE methods. Defaults: 0.50 for FORWARD; 0.15 for STEPWISE.

label: MODEL dependent=regressors / SSE ;

The SSE option of the MODEL statement computes the error sum of squares
for each model selected. (Only available when SELECTION=RSQUARE,
ADJRSQ or CP.)

235
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.3 Automatic Search Procedures for


Model Selection

When a large number k of predictors is being considered,


the number of possible models, 2k , can become extremely
large. In such a case, evaluation of the various criteria for
all possible models can be computationally intensive. To
simplify the task, a number of automatic search
procedures have been developed.

6.3.1 “Best” Subsets Algorithms

• only the “best” subsets of predictors according to


some specified criterion are identified
• does not require fitting all possible subset regression
models
• also identifies several “good” subsets for each possible
number of predictors
Example:
If the Cp criterion is to be employed and the 5 best
subsets according to this criterion are to be identified,
the “best” subsets algorithms search for 5 subsets of
predictors with the smallest Cp values, using much
less computational resources than when all possible
subsets are evaluated.
236
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.3.2 Stepwise Regression

• automatic search procedure that develops the “best”


subset of predictors sequentially
• develops a sequence of regression models, at each
step adding or deleting a predictor
• criterion for adding or deleting a predictor is based on
the reduction in SSE, or equivalently, on the
coefficient of partial correlation or t- or F -statistic
• differs from “best” subsets approach in that the
stepwise procedure ends with the identification of a
single regression model as “best”
⇒ major weakness of stepwise selection procedure
⇒ hides the fact that several other regression models
may also be “good”
• sometimes, stepwise selection procedures identifies a
suboptimal regression model as “best”
⇒ the “goodness” of the resulting “best” model
should still be evaluated by a thorough
examination using a variety of diagnostics
⇒ the resulting “best” model might not be
scientifically meaningful
Stepwise regression, as well as other automatic
selection procedures, should be used with
extreme caution.
237
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Step 1

• Calculate the coefficient of simple determination


between Y and each xk in the following way:
2 SSR(xk )
RY.k = .
SST O
2
• Select xk for which RY.k is maximum.
⇒ select xk such that SSR(xk ) is maximal
⇒ select xk such that Fk is maximal
SSR(xk )
Fk =
M SE(xk )

Rule for Step 1

• Include, in the model, the xk predictor for which Fk is


maximal.
• Include xk if
Fk > predetermined F -value.
⇒ p-value corresponding to Fk is less than some
predetermined level of significance (SLE in SAS)
Note: Default in SAS: SLE=0.15.

238
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Step 2

• Suppose xS1 is selected in Step 1.

• For each of the remaining predictors, obtain


2 SSR(xk |xS1 )
RY.k|S = .
1
SSE(xS1)
2
• Select xk such that RY.k|S 1
is maximum.
⇒ select xk such that Fk|S1 is maximal
SSR(xk |xS1 )
Fk|S1 =
M SE(xS1 , xk )

Rule for Step 2

• Include, in the model, the xk predictor for which Fk|S1


is maximal.
• Include xk if
Fk|S1 > predetermined F -value.
⇒ p-value corresponding to Fk|S1 is less than some
predetermined level of significance (SLE in SAS)
Note: Default in SAS: SLE=0.15.

239
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Step 3

• Suppose xS1 is selected in Step 1 and xS2 is selected


in Step 2.

• Calculate
M SR(xS1 |xS2 )
FS1|S2 = .
M SE(xS1 , xS2 )
• Using FS1|S2 , determine if xS1 can be removed from
the model now that xS2 is included in the model.

Rule for Step 3

• Delete xS1 from the model if


FS1|S2 < predetermined F -value.
⇒ p-value corresponding to FS1|S2 is greater than
some predetermined level of significance
(SLS in SAS)
Note: Default in SAS: SLS=0.15.

This process continues until no further predictors can be


added or removed from the model.

240
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.3.3 Forward Selection

• simplified version of stepwise regression


• starts with no predictors in the model
• at each step, only one predictor at a time can enter
the model (based on p-values and SLE)
Note: Default in SAS: SLE=0.50.
• after a variable enters the model, it can no longer be
removed at a later step
• procedure continues until no other predictors can
enter the model

6.3.4 Backward Selection

• reverse of forward selection


• starts with all predictors in the model
• at each step, one predictor is removed from the model
(based on p-values and SLS)
Note: Default in SAS: SLS=0.10.
• after a variable is removed from the model, it can no
longer re-enter at a later step
• procedure continues until no other predictors can be
removed from the model

241
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.3.5 Example: HALD Data

Consider the HALD data given in Montgomery & Peck


(1982), p. 256.

SAS Code

*** HALD DATA ***;

data HALD;
input X1 X2 X3 X4 Y;
cards;
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
;
run;

242
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

title "Cp SELECTION PROCEDURE";


proc reg data=HALD;
model Y = X1 X2 X3 X4 / selection=cp;
run; quit;

title "BEST SUBSETS SELECTION PROCEDURE";


proc reg data=HALD;
model Y = X1 X2 X3 X4 / cp adjrsq
selection=rsquare best=2;
run; quit;

title "STEPWISE SELECTION PROCEDURE";


proc reg data=HALD;
model Y = X1 X2 X3 X4 / selection=stepwise;
run; quit;

title "FORWARD SELECTION PROCEDURE";


proc reg data=HALD;
model Y = X1 X2 X3 X4 / selection=forward;
run; quit;

title "BACKWARD SELECTION PROCEDURE";


proc reg data=HALD;
model Y = X1 X2 X3 X4 / selection=backward;
run; quit;

243
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output

Cp SELECTION PROCEDURE

The REG Procedure


Model: MODEL1
Dependent Variable: Y

C(p) Selection Method

Number of Observations Read 13


Number of Observations Used 13

Number in
Model C(p) R-Square Variables in Model

2 2.6782 0.9787 X1 X2
3 3.0182 0.9823 X1 X2 X4
3 3.0413 0.9823 X1 X2 X3
3 3.4968 0.9813 X1 X3 X4
4 5.0000 0.9824 X1 X2 X3 X4
2 5.4959 0.9725 X1 X4
3 7.3375 0.9728 X2 X3 X4
2 22.3731 0.9353 X3 X4
2 62.4377 0.8470 X2 X3
2 138.2259 0.6801 X2 X4
1 138.7308 0.6745 X4
1 142.4864 0.6663 X2
2 198.0947 0.5482 X1 X3
1 202.5488 0.5339 X1
1 315.1543 0.2859 X3

244
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output

BEST SUBSETS SELECTION PROCEDURE

The REG Procedure


Model: MODEL1
Dependent Variable: Y

R-Square Selection Method

Number of Observations Read 13


Number of Observations Used 13

Number in Adjusted
Model R-Square R-Square C(p) Variables in Model

1 0.6745 0.6450 138.7308 X4


1 0.6663 0.6359 142.4864 X2
---------------------------------------------------------------
2 0.9787 0.9744 2.6782 X1 X2
2 0.9725 0.9670 5.4959 X1 X4
---------------------------------------------------------------
3 0.9823 0.9764 3.0182 X1 X2 X4
3 0.9823 0.9764 3.0413 X1 X2 X3
---------------------------------------------------------------
4 0.9824 0.9736 5.0000 X1 X2 X3 X4

245
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output
STEPWISE SELECTION PROCEDURE

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Number of Observations Read 13


Number of Observations Used 13

Stepwise Selection: Step 1

Variable X4 Entered: R-Square = 0.6745 and C(p) = 138.7308

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 1831.89616 1831.89616 22.80 0.0006


Error 11 883.86692 80.35154
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 117.56793 5.26221 40108 499.16 <.0001


X4 -0.73816 0.15460 1831.89616 22.80 0.0006

Bounds on condition number: 1, 1


---------------------------------------------------------------------------

Stepwise Selection: Step 2

Variable X1 Entered: R-Square = 0.9725 and C(p) = 5.4959

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 2641.00096 1320.50048 176.63 <.0001


Error 10 74.76211 7.47621
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 103.09738 2.12398 17615 2356.10 <.0001


X1 1.43996 0.13842 809.10480 108.22 <.0001
X4 -0.61395 0.04864 1190.92464 159.30 <.0001

Bounds on condition number: 1.0641, 4.2564


---------------------------------------------------------------------------

246
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Selection: Step 3

Variable X2 Entered: R-Square = 0.9823 and C(p) = 3.0182

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 2667.79035 889.26345 166.83 <.0001


Error 9 47.97273 5.33030
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 71.64831 14.14239 136.81003 25.67 0.0007


X1 1.45194 0.11700 820.90740 154.01 <.0001
X2 0.41611 0.18561 26.78938 5.03 0.0517
X4 -0.23654 0.17329 9.93175 1.86 0.2054

Bounds on condition number: 18.94, 116.36


---------------------------------------------------------------------------

Stepwise Selection: Step 4

Variable X4 Removed: R-Square = 0.9787 and C(p) = 2.6782

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 2657.85859 1328.92930 229.50 <.0001


Error 10 57.90448 5.79045
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 52.57735 2.28617 3062.60416 528.91 <.0001


X1 1.46831 0.12130 848.43186 146.52 <.0001
X2 0.66225 0.04585 1207.78227 208.58 <.0001

Bounds on condition number: 1.0551, 4.2205


---------------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the model.

Summary of Stepwise Selection

Variable Variable Number Partial Model


Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 X4 1 0.6745 0.6745 138.731 22.80 0.0006


2 X1 2 0.2979 0.9725 5.4959 108.22 <.0001
3 X2 3 0.0099 0.9823 3.0182 5.03 0.0517
4 X4 2 0.0037 0.9787 2.6782 1.86 0.2054

247
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output
FORWARD SELECTION PROCEDURE

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Number of Observations Read 13


Number of Observations Used 13

Forward Selection: Step 1

Variable X4 Entered: R-Square = 0.6745 and C(p) = 138.7308

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 1831.89616 1831.89616 22.80 0.0006


Error 11 883.86692 80.35154
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 117.56793 5.26221 40108 499.16 <.0001


X4 -0.73816 0.15460 1831.89616 22.80 0.0006

Bounds on condition number: 1, 1


-----------------------------------------------------------------------------

Forward Selection: Step 2

Variable X1 Entered: R-Square = 0.9725 and C(p) = 5.4959

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 2641.00096 1320.50048 176.63 <.0001


Error 10 74.76211 7.47621
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 103.09738 2.12398 17615 2356.10 <.0001


X1 1.43996 0.13842 809.10480 108.22 <.0001
X4 -0.61395 0.04864 1190.92464 159.30 <.0001

Bounds on condition number: 1.0641, 4.2564


-----------------------------------------------------------------------------

248
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Forward Selection: Step 3

Variable X2 Entered: R-Square = 0.9823 and C(p) = 3.0182

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 2667.79035 889.26345 166.83 <.0001


Error 9 47.97273 5.33030
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 71.64831 14.14239 136.81003 25.67 0.0007


X1 1.45194 0.11700 820.90740 154.01 <.0001
X2 0.41611 0.18561 26.78938 5.03 0.0517
X4 -0.23654 0.17329 9.93175 1.86 0.2054

Bounds on condition number: 18.94, 116.36


-----------------------------------------------------------------------------

No other variable met the 0.5000 significance level for entry into the model.

Summary of Forward Selection

Variable Number Partial Model


Step Entered Vars In R-Square R-Square C(p) F Value Pr > F

1 X4 1 0.6745 0.6745 138.731 22.80 0.0006


2 X1 2 0.2979 0.9725 5.4959 108.22 <.0001
3 X2 3 0.0099 0.9823 3.0182 5.03 0.0517

249
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output
BACKWARD SELECTION PROCEDURE

The REG Procedure


Model: MODEL1
Dependent Variable: Y

Number of Observations Read 13


Number of Observations Used 13

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.9824 and C(p) = 5.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 2667.89944 666.97486 111.48 <.0001


Error 8 47.86364 5.98295
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 62.40537 70.07096 4.74552 0.79 0.3991


X1 1.55110 0.74477 25.95091 4.34 0.0708
X2 0.51017 0.72379 2.97248 0.50 0.5009
X3 0.10191 0.75471 0.10909 0.02 0.8959
X4 -0.14406 0.70905 0.24697 0.04 0.8441

Bounds on condition number: 282.51, 2489.2


-----------------------------------------------------------------------------

Backward Elimination: Step 1

Variable X3 Removed: R-Square = 0.9823 and C(p) = 3.0182

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 2667.79035 889.26345 166.83 <.0001


Error 9 47.97273 5.33030
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 71.64831 14.14239 136.81003 25.67 0.0007


X1 1.45194 0.11700 820.90740 154.01 <.0001
X2 0.41611 0.18561 26.78938 5.03 0.0517
X4 -0.23654 0.17329 9.93175 1.86 0.2054

Bounds on condition number: 18.94, 116.36


-----------------------------------------------------------------------------

250
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Backward Elimination: Step 2

Variable X4 Removed: R-Square = 0.9787 and C(p) = 2.6782

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 2657.85859 1328.92930 229.50 <.0001


Error 10 57.90448 5.79045
Corrected Total 12 2715.76308

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 52.57735 2.28617 3062.60416 528.91 <.0001


X1 1.46831 0.12130 848.43186 146.52 <.0001
X2 0.66225 0.04585 1207.78227 208.58 <.0001

Bounds on condition number: 1.0551, 4.2205


-----------------------------------------------------------------------------

All variables left in the model are significant at the 0.1000 level.

Summary of Backward Elimination

Variable Number Partial Model


Step Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 X3 3 0.0000 0.9823 3.0182 0.02 0.8959


2 X4 2 0.0037 0.9787 2.6782 1.86 0.2054

251
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Regression

Step 1

Consider each predictor.


F p-value
SSR(x1) 1450.076
Fx 1 = MSE(x1 )
= 115.062
= 12.60 0.0046
SSR(x2) 1809.427
Fx 2 = MSE(x2 )
= 82.394
= 21.96 0.0007
SSR(x3) 776.363
Fx 3 = MSE(x3 ) = 176.309 = 4.40 0.0598
SSR(x4) 1831.896
Fx 4 = MSE(x4 ) = 80.352 = 22.80 0.0006

Since 0.0006 < SLE (=0.15), predictor x4 enters the


model.

Step 2

Predictor x4 is in the model.


F p-value
SSR(x1|x4 ) 809.105
Fx1|x4 = MSE(x1 ,x4) = 7.476 = 108.22 0.0001
SSR(x2|x4 ) 14.987
Fx2|x4 = MSE(x2 ,x4)
= 86.888
= 0.17 0.6867
SSR(x3|x4 ) 708.129
Fx3|x4 = MSE(x3 ,x4) = 17.574 = 40.30 0.0001

Since 0.0001 < SLE (=0.15), predictor x1 enters the


model.
252
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Step 3

Can we remove x4 now that x1 is in the model?

SSR(x4|x1) 1190.925
Fx4|x1 = = = 159.29,
M SE(x1, x4) 7.476

with p-value = 0.0001. Since

0.0001 is not bigger than SLS (=0.15),

we do not remove x4 from the model.

Step 4

Predictors x1 and x4 are in the model.


F p-value
SSR(x2|x1,x4) 26.789
Fx2|x1,x4 = MSE(x1,x2,x4) = 5.330 = 5.026 0.0517
SSR(x3|x1,x4) 23.926
Fx3|x1,x4 = MSE(x1,x3,x4) = 5.648 = 4.235 0.0697

Since 0.0517 < SLE (=0.15), predictor x2 enters the


model.

253
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Step 5

Can we remove x1 or x4 now that x2 is in the model?


F p-value
SSR(x1|x2 ,x4) 820.907
Fx1|x2,x4 = MSE(x1 ,x2,x4) = 5.330 = 154.02 0.0001
SSR(x4|x1 ,x2) 9.932
Fx4|x1,x2 = MSE(x1 ,x2,x4) = 5.330 = 1.86 0.2054

Since the p-value for x4 is 0.2054 > SLS (=0.15), we


remove x4 from the model containing x1, x2, x4.

Step 6

Predictors x1 and x2 are in the model.

F p-value
SSR(x3|x1,x2) 9.794
Fx3|x1,x2 = MSE(x1,x2,x3)
= 5.346
= 1.833 0.2089
SSR(x4|x2,x3) 9.932
Fx4|x1,x2 = MSE(x1,x2,x4) = 5.330 = 1.86 0.2054

Since both p-values are > SLE (=0.15), no predictor


enters the model.

Note:
It is of course clear in advance that at this step x4 cannot
enter the model, since it is removed from the model in
the previous step based on the same Fx4|x1,x2 -value.
254
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.3.6 Danger of Automatic Selection


Procedures

SAS Code

data one;
retain seed 31491711;
nobs = 30;
p = 0.5;
mu = 20;
var = 100;
do i=1 to nobs;
array x{25} x1-x25;
e = sqrt(var)*rannor(seed);
x1 = sqrt(var)*rannor(seed);
do k = 2 to 24;
x[k] = sqrt(var)*
( sqrt(0.35)*x[k-1]/sqrt(var) +
sqrt(0.65)*rannor(seed) ) ;
end;
x25 = ranbin(seed,1,p);
y = mu + e;
output;
end;
run;

proc reg data=one;


model1: model y=x1-x25/ selection=stepwise;
model2: model y=x1-x25/ selection=backward;
model3: model y=x1-x25/ selection=forward;
run; quit;

255
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output: Stepwise


The REG Procedure
Model: model1
Dependent Variable: y

Number of Observations Read 30


Number of Observations Used 30

Stepwise Selection: Step 1

Variable x21 Entered: R-Square = 0.0978 and C(p) = -0.0284

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 226.67954 226.67954 3.04 0.0925


Error 28 2091.17987 74.68500
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 18.01303 1.59486 9527.15175 127.56 <.0001


x21 0.27484 0.15776 226.67954 3.04 0.0925

Bounds on condition number: 1, 1


---------------------------------------------------------------------------

Stepwise Selection: Step 2

Variable x22 Entered: R-Square = 0.2834 and C(p) = -3.3711

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 656.85943 328.42972 5.34 0.0111


Error 27 1660.99997 61.51852
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 17.58780 1.45637 8971.92920 145.84 <.0001


x21 0.69146 0.21289 648.97779 10.55 0.0031
x22 -0.49660 0.18780 430.17990 6.99 0.0135

Bounds on condition number: 2.2107, 8.843


---------------------------------------------------------------------------

256
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Selection: Step 3

Variable x23 Entered: R-Square = 0.4097 and C(p) = -5.0060

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 949.53425 316.51142 6.01 0.0030


Error 26 1368.32515 52.62789
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 16.68731 1.40011 7475.95627 142.05 <.0001


x21 0.71110 0.19708 685.15532 13.02 0.0013
x22 -0.86588 0.23386 721.45902 13.71 0.0010
x23 0.45163 0.19151 292.67482 5.56 0.0262

Bounds on condition number: 4.0075, 26.694


---------------------------------------------------------------------------

Stepwise Selection: Step 4

Variable x1 Entered: R-Square = 0.4612 and C(p) = -4.4885

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 1068.90210 267.22552 5.35 0.0030


Error 25 1248.95730 49.95829
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 16.23272 1.39547 6760.00263 135.31 <.0001


x1 -0.23973 0.15509 119.36785 2.39 0.1347
x21 0.75034 0.19369 749.74708 15.01 0.0007
x22 -0.86495 0.22785 719.90355 14.41 0.0008
x23 0.42123 0.18763 251.80252 5.04 0.0339

Bounds on condition number: 4.0075, 39.986


---------------------------------------------------------------------------

257
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Selection: Step 5

Variable x14 Entered: R-Square = 0.5165 and C(p) = -4.0826

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 5 1197.25870 239.45174 5.13 0.0024


Error 24 1120.60070 46.69170
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 16.03505 1.35434 6545.24532 140.18 <.0001


x1 -0.29898 0.15413 175.68893 3.76 0.0642
x14 0.27890 0.16821 128.35661 2.75 0.1103
x21 0.66180 0.19472 539.37204 11.55 0.0024
x22 -0.83316 0.22111 662.94902 14.20 0.0009
x23 0.53114 0.19312 353.17658 7.56 0.0111

Bounds on condition number: 4.0379, 60.313


---------------------------------------------------------------------------

Stepwise Selection: Step 6

Variable x24 Entered: R-Square = 0.5683 and C(p) = -3.5740

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 6 1317.34216 219.55703 5.05 0.0020


Error 23 1000.51724 43.50075
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 16.27480 1.31518 6661.26705 153.13 <.0001


x1 -0.34130 0.15094 222.42062 5.11 0.0335
x14 0.43155 0.18655 232.77901 5.35 0.0300
x21 0.59625 0.19204 419.34091 9.64 0.0050
x22 -0.94208 0.22326 774.53446 17.81 0.0003
x23 0.82791 0.25817 447.35281 10.28 0.0039
x24 -0.30875 0.18583 120.08346 2.76 0.1102

Bounds on condition number: 5.883, 107.66


---------------------------------------------------------------------------

258
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Selection: Step 7

Variable x25 Entered: R-Square = 0.6349 and C(p) = -3.4887

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 7 1471.51236 210.21605 5.46 0.0010


Error 22 846.34704 38.47032
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 13.55584 1.83696 2094.98506 54.46 <.0001


x1 -0.37165 0.14275 260.75952 6.78 0.0162
x14 0.59929 0.19442 365.52471 9.50 0.0054
x21 0.53675 0.18302 330.86444 8.60 0.0077
x22 -0.87857 0.21234 658.59389 17.12 0.0004
x23 0.93669 0.24879 545.31442 14.17 0.0011
x24 -0.35695 0.17640 157.51207 4.09 0.0553
x25 5.27134 2.63320 154.17020 4.01 0.0578

Bounds on condition number: 6.1777, 141.7


---------------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the
model.

Summary of Stepwise Selection

Variable Variable Number Partial Model


Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 x21 1 0.0978 0.0978 -0.0284 3.04 0.0925


2 x22 2 0.1856 0.2834 -3.3711 6.99 0.0135
3 x23 3 0.1263 0.4097 -5.0060 5.56 0.0262
4 x1 4 0.0515 0.4612 -4.4885 2.39 0.1347
5 x14 5 0.0554 0.5165 -4.0826 2.75 0.1103
6 x24 6 0.0518 0.5683 -3.5740 2.76 0.1102
7 x25 7 0.0665 0.6349 -3.4887 4.01 0.0578

259
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output: Backward

The REG Procedure


Model: model2
Dependent Variable: y

Number of Observations Read 30


Number of Observations Used 30

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.8610 and C(p) = 26.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 25 1995.78720 79.83149 0.99 0.5782


Error 4 322.07220 80.51805
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 13.77599 3.87828 1015.92294 12.62 0.0238


x1 -0.51352 0.62397 54.53704 0.68 0.4567
x2 0.28307 0.57284 19.66210 0.24 0.6471
x3 0.38552 0.88471 15.28912 0.19 0.6855
x4 -0.44563 0.82102 23.72062 0.29 0.6161
x5 -0.66788 0.85183 49.49819 0.61 0.4768
x6 0.59795 0.64599 68.98624 0.86 0.4070
x7 0.00455 0.47408 0.00741 0.00 0.9928
x8 -0.54948 0.55561 78.75123 0.98 0.3787
x9 0.11509 0.83022 1.54724 0.02 0.8964
x10 -0.14936 0.49199 7.42073 0.09 0.7766
x11 0.16215 0.79841 3.32126 0.04 0.8490
x12 -0.43384 0.92863 17.57390 0.22 0.6647
x13 0.56707 0.89891 32.04329 0.40 0.5624
x14 0.18036 0.52230 9.60166 0.12 0.7472
x15 -0.09812 0.61076 2.07822 0.03 0.8802
x16 -0.23379 0.67977 9.52392 0.12 0.7482
x17 0.15836 1.02967 1.90457 0.02 0.8852
x18 -0.25081 0.42056 28.63779 0.36 0.5830
x19 -0.14436 0.48469 7.14245 0.09 0.7807
x20 0.03451 0.62265 0.24734 0.00 0.9585
x21 1.09512 0.58873 278.59750 3.46 0.1364
x22 -0.53348 0.57828 68.52517 0.85 0.4085
x23 0.00346 0.94034 0.00109 0.00 0.9972
x24 0.19069 0.65494 6.82603 0.08 0.7854
x25 -0.78819 10.09376 0.49097 0.01 0.9415

Bounds on condition number: 46.908, 10956


---------------------------------------------------------------------------

260
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Backward Elimination: Step 1

Variable x23 Removed: R-Square = 0.8610 and C(p) = 24.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 24 1995.78611 83.15775 1.29 0.4225


Error 5 322.07329 64.41466
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 13.77995 3.33224 1101.55859 17.10 0.0090


x1 -0.51487 0.45145 83.78415 1.30 0.3058
x2 0.28252 0.49452 21.02440 0.33 0.5925
x3 0.38754 0.61982 25.18119 0.39 0.5592
x4 -0.44591 0.73106 23.96501 0.37 0.5685
x5 -0.67013 0.52982 103.04851 1.60 0.2617
x6 0.59922 0.48713 97.47019 1.51 0.2734
x7 0.00524 0.38916 0.01167 0.00 0.9898
x8 -0.55030 0.45476 94.32496 1.46 0.2803
x9 0.11359 0.64723 1.98408 0.03 0.8676
x10 -0.14894 0.42788 7.80437 0.12 0.7420
x11 0.16421 0.51004 6.67661 0.10 0.7605
x12 -0.43641 0.54533 41.25299 0.64 0.4599
x13 0.56965 0.50061 83.40803 1.29 0.3067
x14 0.17927 0.38417 14.02661 0.22 0.6604
x15 -0.09790 0.54352 2.08974 0.03 0.8641
x16 -0.23235 0.49644 14.10989 0.22 0.6595
x17 0.15602 0.72375 2.99351 0.05 0.8378
x18 -0.24999 0.31925 39.49852 0.61 0.4690
x19 -0.14543 0.34565 11.40323 0.18 0.6914
x20 0.03574 0.47027 0.37197 0.01 0.9424
x21 1.09468 0.51599 289.92086 4.50 0.0873
x22 -0.53204 0.38117 125.49503 1.95 0.2216
x24 0.19280 0.28324 29.84645 0.46 0.5263
x25 -0.81649 5.83523 1.26117 0.02 0.8942

Bounds on condition number: 28.969, 6219.8


---------------------------------------------------------------------------

---------------------------------------------------------------------------

---------------------------------------------------------------------------

---------------------------------------------------------------------------

261
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Backward Elimination: Step 14

Variable x4 Removed: R-Square = 0.7676 and C(p) = 0.6912

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 11 1779.09449 161.73586 5.40 0.0008


Error 18 538.76491 29.93138
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 13.43590 1.31839 3108.65753 103.86 <.0001


x1 -0.41606 0.14962 231.44490 7.73 0.0123
x3 0.44520 0.21451 128.93363 4.31 0.0525
x5 -0.75425 0.16996 589.48352 19.69 0.0003
x6 0.53361 0.17857 267.26287 8.93 0.0079
x8 -0.38296 0.13728 232.92123 7.78 0.0121
x12 -0.37114 0.16337 154.47786 5.16 0.0356
x13 0.72142 0.20605 366.92288 12.26 0.0025
x18 -0.29199 0.10747 220.93273 7.38 0.0141
x21 1.07806 0.17259 1167.88878 39.02 <.0001
x22 -0.68403 0.14223 692.25894 23.13 0.0001
x24 0.22227 0.11210 117.68499 3.93 0.0629

Bounds on condition number: 3.8093, 295.94


---------------------------------------------------------------------------

All variables left in the model are significant at the 0.1000 level.

Summary of Backward Elimination

Variable Number Partial Model


Step Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 x23 24 0.0000 0.8610 24.0000 0.00 0.9972


2 x7 23 0.0000 0.8610 22.0002 0.00 0.9898
3 x20 22 0.0002 0.8609 20.0052 0.01 0.9338
4 x25 21 0.0005 0.8604 18.0182 0.02 0.8842
5 x15 20 0.0018 0.8587 16.0690 0.10 0.7587
6 x9 19 0.0015 0.8572 14.1109 0.09 0.7676
7 x17 18 0.0025 0.8547 12.1834 0.18 0.6836
8 x10 17 0.0028 0.8518 10.2648 0.21 0.6525
9 x11 16 0.0034 0.8484 8.3634 0.28 0.6080
10 x19 15 0.0070 0.8414 6.5646 0.60 0.4527
11 x14 14 0.0052 0.8362 4.7141 0.46 0.5093
12 x16 13 0.0207 0.8155 3.3110 1.90 0.1884
13 x2 12 0.0186 0.7969 1.8456 1.61 0.2226
14 x4 11 0.0294 0.7676 0.6912 2.46 0.1353

262
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

SAS Output: Forward

The REG Procedure


Model: model3
Dependent Variable: y

Number of Observations Read 30


Number of Observations Used 30

Forward Selection: Step 1

Variable x21 Entered: R-Square = 0.0978 and C(p) = -0.0284

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 226.67954 226.67954 3.04 0.0925


Error 28 2091.17987 74.68500
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 18.01303 1.59486 9527.15175 127.56 <.0001


x21 0.27484 0.15776 226.67954 3.04 0.0925

Bounds on condition number: 1, 1


---------------------------------------------------------------------------

Forward Selection: Step 2

Variable x22 Entered: R-Square = 0.2834 and C(p) = -3.3711

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 656.85943 328.42972 5.34 0.0111


Error 27 1660.99997 61.51852
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 17.58780 1.45637 8971.92920 145.84 <.0001


x21 0.69146 0.21289 648.97779 10.55 0.0031
x22 -0.49660 0.18780 430.17990 6.99 0.0135

Bounds on condition number: 2.2107, 8.843


---------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------

263
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Forward Selection: Step 13

Variable x15 Entered: R-Square = 0.7839 and C(p) = 4.2205

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 13 1816.99303 139.76869 4.46 0.0030


Error 16 500.86638 31.30415
Corrected Total 29 2317.85940

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 14.18281 1.76755 2015.49501 64.38 <.0001


x1 -0.26072 0.14026 108.16811 3.46 0.0815
x5 -0.51862 0.19589 219.41470 7.01 0.0176
x6 0.57018 0.21443 221.34691 7.07 0.0171
x8 -0.36099 0.14850 184.99511 5.91 0.0272
x14 0.42445 0.23019 106.42948 3.40 0.0838
x15 0.13204 0.18223 16.43460 0.52 0.4792
x16 -0.24622 0.15164 82.53182 2.64 0.1240
x18 -0.13388 0.12898 33.72613 1.08 0.3147
x21 0.80534 0.19650 525.83505 16.80 0.0008
x22 -0.83625 0.21956 454.10664 14.51 0.0015
x23 0.58102 0.25714 159.82581 5.11 0.0382
x24 -0.12607 0.17692 15.89479 0.51 0.4864
x25 1.39129 2.74770 8.02601 0.26 0.6195

Bounds on condition number: 8.1101, 622.09


---------------------------------------------------------------------------

No other variable met the 0.5000 significance level for entry into the model.

Summary of Forward Selection

Variable Number Partial Model


Step Entered Vars In R-Square R-Square C(p) F Value Pr > F

1 x21 1 0.0978 0.0978 -0.0284 3.04 0.0925


2 x22 2 0.1856 0.2834 -3.3711 6.99 0.0135
3 x23 3 0.1263 0.4097 -5.0060 5.56 0.0262
4 x1 4 0.0515 0.4612 -4.4885 2.39 0.1347
5 x14 5 0.0554 0.5165 -4.0826 2.75 0.1103
6 x24 6 0.0518 0.5683 -3.5740 2.76 0.1102
7 x25 7 0.0665 0.6349 -3.4887 4.01 0.0578
8 x5 8 0.0161 0.6510 -1.9534 0.97 0.3356
9 x6 9 0.0225 0.6735 -0.6013 1.38 0.2541
10 x8 10 0.0353 0.7088 0.3824 2.30 0.1455
11 x16 11 0.0558 0.7646 0.7769 4.26 0.0536
12 x18 12 0.0122 0.7768 2.4247 0.93 0.3478
13 x15 13 0.0071 0.7839 4.2205 0.52 0.4792

264
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4 Diagnostic Methods

Outliers or outlying cases are observations in the data set


that are well separated from the rest of the data.

Should outliers be retained or eliminated?

Example
Figure 6.11: Scatter Plot for Regression with One Predictor Variable
Illustrating Outlying Cases.

• 1 and 2 are outlying with respect to Y


• 2, 3 and 4 are outlying with respect to X
• 3 and 4 are somewhat outlying with respect to Y

Which ones seem influential?


• 3 and 4 more than 2
265
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.1 Identifying Outlying Y Observations

• Residuals
ei = Yi − Ŷ i

• Semistudentized Residuals
∗ ei
ei = √
M SE

• Studentized Residuals
ei r
ri = , where: s(ei) = M SE(1 − hii)
s(ei)
Note: hii is the ith diagonal element of the
hat matrix, H = X(X ′X)−1X ′ .

• Deleted Residuals
ei
di = Yi − Ŷ i(i) =
1 − hii
Note: Ŷ i(i) is the predicted value for the ith
case obtained from fitting the same regression
model without the ith case.

• Studentized Deleted Residuals


v
u
di n−p−1
u
u
u
ti = =e iu
t 2
∼ tn−p−1
s(di) SSE(1 − hii) − ei

266
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Test for Outliers (in Y )

In general, observations whose studentized deleted


residuals are large in absolute value can be considered as
having outlying Y observations.

To test this formally, the Bonferroni test procedure can


be used.

• Since each case will be tested, g ≡ n .

• Critical value: t1− 2n


α , n−p−1 .

• Compare each |ti| with the critical value.

|ti| > t1− 2n


α , n−p−1 ⇒ case i is outlying in Y

|ti| ≤ t1− 2n
α , n−p−1 ⇒ case i is not outlying in Y

Note: Sometimes, it will be sufficient to test only


the case with the largest |ti|. If this case is not
outlying, then neither are any of the other cases.

267
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.2 Identifying Outlying X Observations

Recall the hat matrix, H, with diagonal elements hii,


where 0 ≤ hii ≤ 1. If p is the number of parameters in
the model, then n
X
hii = p .
i=1

In the context of diagnostic techniques, hii is called the


leverage (in terms of the X values) of the ith case.
• the hii can indicate, in a multivariate setting, whether
or not a case is outlying with respect to its X values
• hii is a measure of the distance between the X values
for case i and the means of the X values for all cases
case i is distant from the
large hii =⇒
center of all X observations

• also useful to determine if an estimation or prediction


for a new case entails hidden extrapolation (hnew,new )

A leverage value hii is usually considered to be large if it


is more than twice as large as the mean leverage, i.e.
2p
hii > .
n

Rule of Thumb (for reasonably large data sets):

0.2 ≤ hii ≤ 0.5 =⇒ moderate leverage


hii > 0.5 =⇒ high leverage
268
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.3 Identifying Influential Cases

After identifying cases that are outlying with respect to


their Y and/or X values, the next step is to ascertain
whether or not these outlying cases are influential.

A case is considered to be influential if its exclusion


causes major changes in the fitted regression function.

6.4.3.1 Influence on a Single Fitted Value

DFFITS
v
u
u
Ŷ i − Ŷ i(i) hii u
u
DF F IT Si = s = t iu
t
M SE(i)hii 1 − hii

• measures how much the estimated/fitted value, Ŷ i,


changes if the ith case is excluded

Rule of Thumb:

|DF F IT S|i > 1 =⇒ for small to medium


data sets
v
u
u p
|DF F IT S|i > 2 u
t =⇒ for large data sets
n

269
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.3.2 Influence on All Fitted Values

Cook’s Distance
!
2  
n
X Ŷ j − Ŷ j(i) e2i  hii 
Di = =  
j=1 pM SE pM SE (1 − hii)2

• measures the influence of the ith case on all the n


fitted values

Rule of Thumb:

Di < F0.1 ; p , n−p


=⇒ non-influential
or Di < F0.2 ; p , n−p

Di > F0.5 ; p , n−p =⇒ influential

Note: Sometimes, it will be sufficient to test only the


case with the largest Cook’s Distance Di. Compare
the latter with the 10th or 20th percentile of the
Fp , n−p distribution. If this case is not influential, then
neither are any of the other cases.

An index influence plot, which plots Cook’s Distance


measure, Di, against the case number, i, is often used to
identify the cases in the data set that have large Di.

270
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.3.3 Influence on the Regression Coefficients

DFBETAS
β̂ k − β̂ k(i)
DF BET ASk(i) = s ,
M SE(i) ckk
where: ckk = k th diagonal element of (X ′X)−1

• measures the influence of the ith case on the


regression coefficients

Rule of Thumb:

|DF BET AS|k(i) > 1 =⇒ for small data sets


2
|DF BET AS|k(i) > √ =⇒ for large data sets
n

SAS Code

/* To Obtain Influence Statistics */

proc reg data=mydata;


model Y = X1 X2 X3 / influence;
ods output OutputStatistics=mystats;
run; quit;

271
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.4 Examples

DF F IT S, Cook’s Distances Di and DF BET AS


for Body Fat Example with Two Predictor Variables

Various Diagnostics for Outlying Cases


for Surgical Unit Example with Three Predictor Variables

272
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.4.5 Multicollinearity Diagnostics

Some key problems typically arise when the predictors in a


regression model are highly correlated among themselves:
• adding or deleting a predictor changes the regression
coefficients
• the extra sum of squares associated with a predictor
varies, depending upon which other predictors are
already included in the model
• the estimated standard deviations of the regression
coefficients are large
• the estimated regression coefficients are not
statistically significant individually even though a
definite statistical relation exists between the response
and the set of predictors
The presence of serious multicollinearity can also be
indicated by the following informal diagnostics:
• large coefficients of simple pairwise correlation, rXiXj
• nonsignificant results in individual tests on the
regression coefficients of important predictors
• estimated regression coefficients with an algebraic
sign that is opposite of what is expected from
theoretical considerations or prior experience
• wide confidence intervals for the regression
coefficients of important predictors
273
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Variance Inflation Factor

• formal method of detecting the presence of


multicollinearity
• measures how much the variances of the estimated
regression coefficients are inflated as compared to
when the predictors are not linearly related

(V IF )k = (1 − Rk2 )−1 k = 1, . . . , p − 1

where:
Rk2 = coefficient of multiple determination
when Xk is regressed on all the
other (p − 2) predictor variables

Rk2 = 0 =⇒ (V IF )k = 1
Rk2 = 1 =⇒ (V IF )k = ∞
Rk2 ∈ (0, 1) =⇒ (V IF )k > 1

Rule of Thumb:

max (V IF )k > 10 ⇒ severe multicollinearity


X
(V IF )k
V IF k = >> 1 ⇒ severe multicollinearity
p−1

274
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.5 Case Study: SENIC Data

Study on the Efficacy of Nosocomial Infection Control

Objective

To determine whether infection surveillance and control


programs have reduced the rates of nosocomial
(hospital-acquired) infection in USA.

Variables
ID : identification number
LSTAY : average length of stay of all patients
AGE : average age of patients
INFRISK : average estimated probability of acquiring
infection in the hospital
RURATIO : ratio of cultures performed to number of
patients without symptoms
RUXRAY : ratio of number of X-rays performed to
number of patients
NBEDS : average number of beds
MEDSCH : 1 = yes, 2 = no
REGION : geographic region, codes 1, 2, 3 and 4
AVECEN : average number of patients per day
during study period
NNURSE : average of number of nurses
FACILI : percent of 35 potential facilities

275
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Data

Random sample of 113 hospitals selected from 338.


Full Data Set

ID LSTAY AGE INFRISK RURATIO RUXRAY NBEDS MEDSCH REGION AVECEN NNURSE FACILI

1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241 60.0


2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52 40.0
................................................................................
................................................................................
................................................................................
56 11.15 56.5 3.9 7.7 73.9 281 2 1 217 199 51.4
57 7.14 59.0 3.7 2.6 75.8 70 2 4 37 35 31.4

58 7.65 47.1 4.3 16.4 65.7 318 2 4 265 314 51.4


59 10.73 50.6 3.9 19.3 101.0 445 1 2 374 345 51.4
................................................................................
................................................................................
................................................................................
112 17.94 56.2 5.9 26.4 91.8 835 1 1 791 407 62.9
113 9.41 59.5 3.1 20.6 91.7 29 2 3 20 22 22.9

Analysis

• Length of stay (Y ) is to be predicted.

• Consider all variables, except medical affiliation and


region, as potential predictors.

• Find an appropriate model with response log10(Y )


and no interactions.

• Consider cases 57 -113 to build the model.


276
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Guide for the Analysis

1. Obtain the scatter plot matrix. Also obtain the


correlation matrix of the X variables. Is there any
evidence of strong linear pairwise associations among
the predictor variables here?
2. Obtain the three best subsets according to the Cp
criterion. Which of these subset models appears to
have the smallest bias?
3. Use the model containing age, routine chest X-ray
ratio, and average daily census to study the issues in
the succeeding questions.
4. Obtain the studentized deleted residuals and prepare
a dot plot of these residuals. Are any outliers present?
Use the Bonferroni test procedure with α = .01.
State the decision rule and conclusion.
5. Obtain the diagonal elements of the hat matrix.
Using the rule of thumb, identify any outlying X
observations.
6. Cases 62, 75, 102, 104, 106 and 112 are moderately
outlying with respect to their X values, and case 87
is reasonably far outlying with respect to its Y value.
Obtain the DF F IT S, DF BET AS, and Cook’s
distance values for these cases to asses their influence.
What do you conclude?

277
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

7. Fit the model to the validation data set (obs 1-56).


Compare the estimated regression coefficients and
their estimated standard deviations with those
obtained in the previous steps. Also compare the error
mean squares and coefficients of multiple
determination. Does the model fitted to the
validation data set yield similar estimates as the
model fitted to model-building data set?
8. Calculate the mean squared prediction error and
compare it to the M SE obtained from the
model-building data set. Is there any evidence of a
substantial bias problem in M SE here?
9. Combine the model-building and validation data sets
and fit the selected regression model to the combined
data. Are the estimated regression coefficients and
their standard deviation appreciably different from
those for the model-building data set? Should you
expect any difference in the estimates? Explain.

In the next pages you will find some useful information to


answer the questions.

278
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Figure 6.12: Scatter Plot Matrix for Senic Example.

279
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Correlation Analysis

The CORR Procedure

9 Variables: AGE INFRISK RURATIO RUXRAY NBEDS AVECEN NNURSE FACILI LOGSTAY

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

AGE 57 53.17368 4.68320 3031 38.80000 65.90000


INFRISK 57 4.04211 1.27852 230.40000 1.30000 6.60000
RURATIO 57 13.52632 8.30152 771.00000 1.60000 42.00000
RUXRAY 57 78.99825 17.98636 4503 45.70000 133.50000
NBEDS 57 241.80702 187.88355 13783 29.00000 835.00000
AVECEN 57 184.49123 165.59278 10516 20.00000 791.00000
NNURSE 57 157.47368 127.40124 8976 14.00000 528.00000
FACILI 57 41.81754 15.49668 2384 5.70000 68.60000
LOGSTAY 57 0.96354 0.07752 54.92198 0.82607 1.25382

Pearson Correlation Coefficients, N = 57


Prob > |r| under H0: Rho=0

AGE INFRISK RURATIO RUXRAY NBEDS AVECEN NNURSE FACILI LOGSTAY

AGE 1.00000 0.02518 -0.10113 0.16099 -0.19787 -0.17221 -0.23643 -0.16352 0.17064
0.8525 0.4542 0.2316 0.1401 0.2002 0.0766 0.2242 0.2044

INFRISK 0.02518 1.00000 0.44783 0.33396 0.49007 0.50085 0.53009 0.45334 0.47137
0.8525 0.0005 0.0111 0.0001 <.0001 <.0001 0.0004 0.0002

RURATIO -0.10113 0.44783 1.00000 0.19482 0.16780 0.20362 0.23884 0.23954 0.25483
0.4542 0.0005 0.1464 0.2121 0.1287 0.0736 0.0727 0.0557

RUXRAY 0.16099 0.33396 0.19482 1.00000 0.06682 0.08554 0.06020 0.12833 0.36377
0.2316 0.0111 0.1464 0.6214 0.5269 0.6564 0.3414 0.0054

NBEDS -0.19787 0.49007 0.16780 0.06682 1.00000 0.99000 0.90893 0.76448 0.57431
0.1401 0.0001 0.2121 0.6214 <.0001 <.0001 <.0001 <.0001

AVECEN -0.17221 0.50085 0.20362 0.08554 0.99000 1.00000 0.90389 0.72942 0.60799
0.2002 <.0001 0.1287 0.5269 <.0001 <.0001 <.0001 <.0001

NNURSE -0.23643 0.53009 0.23884 0.06020 0.90893 0.90389 1.00000 0.70706 0.47005
0.0766 <.0001 0.0736 0.6564 <.0001 <.0001 <.0001 0.0002

FACILI -0.16352 0.45334 0.23954 0.12833 0.76448 0.72942 0.70706 1.00000 0.40391
0.2242 0.0004 0.0727 0.3414 <.0001 <.0001 <.0001 0.0018

LOGSTAY 0.17064 0.47137 0.25483 0.36377 0.57431 0.60799 0.47005 0.40391 1.00000
0.2044 0.0002 0.0557 0.0054 <.0001 <.0001 0.0002 0.0018

280
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Procedure

The REG Procedure


Model: MODEL1
Dependent Variable: LOGSTAY

Stepwise Selection: Step 1


Variable AVECEN Entered: R-Square = 0.3696 and C(p) = 16.2329

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 0.12440 0.12440 32.25 <.0001


Error 55 0.21214 0.00386
Corrected Total 56 0.33655

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 0.91103 0.01238 20.90126 5418.84 <.0001


AVECEN 0.00028463 0.00005012 0.12440 32.25 <.0001

Bounds on condition number: 1, 1


---------------------------------------------------------------------------

Stepwise Selection: Step 2


Variable RUXRAY Entered: R-Square = 0.4676 and C(p) = 7.4790

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 0.15736 0.07868 23.71 <.0001


Error 54 0.17919 0.00332
Corrected Total 56 0.33655

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 0.80642 0.03513 1.74895 527.05 <.0001


RUXRAY 0.00135 0.00042955 0.03295 9.93 0.0027
AVECEN 0.00027205 0.00004666 0.11282 34.00 <.0001

Bounds on condition number: 1.0074, 4.0295


---------------------------------------------------------------------------

281
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Stepwise Selection: Step 3


Variable AGE Entered: R-Square = 0.5192 and C(p) = 3.8112

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 0.17472 0.05824 19.07 <.0001


Error 53 0.16182 0.00305
Corrected Total 56 0.33655

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 0.61043 0.08881 0.14424 47.24 <.0001


AGE 0.00388 0.00163 0.01737 5.69 0.0207
RUXRAY 0.00117 0.00041881 0.02402 7.87 0.0070
AVECEN 0.00029261 0.00004558 0.12585 41.22 <.0001

Bounds on condition number: 1.0647, 9.4505


---------------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the model.

Summary of Stepwise Selection

Variable Variable Number Partial Model


Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 AVECEN 1 0.3696 0.3696 16.2329 32.25 <.0001


2 RUXRAY 2 0.0979 0.4676 7.4790 9.93 0.0027
3 AGE 3 0.0516 0.5192 3.8112 5.69 0.0207

282
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Cp Selection Procedure

The REG Procedure


Model: MODEL1
Dependent Variable: LOGSTAY

C(p) Selection Method

Number of Observations Read 57


Number of Observations Used 57

Number in
Model C(p) R-Square Variables in Model

3 3.8112 0.5192 AGE RUXRAY AVECEN


4 3.8638 0.5369 AGE RUXRAY AVECEN NNURSE
4 4.2696 0.5332 AGE RUXRAY NBEDS AVECEN

283
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Model Fitted to the Model-Building Data Set

The REG Procedure


Model: MODEL1
Dependent Variable: LOGSTAY

Number of Observations Read 57


Number of Observations Used 57

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 0.17472 0.05824 19.07 <.0001


Error 53 0.16182 0.00305
Corrected Total 56 0.33655

Root MSE 0.05526 R-Square 0.5192


Dependent Mean 0.96354 Adj R-Sq 0.4919
Coeff Var 5.73471

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 0.61043 0.08881 6.87 <.0001


AGE 1 0.00388 0.00163 2.38 0.0207
RUXRAY 1 0.00117 0.00041881 2.81 0.0070
AVECEN 1 0.00029261 0.00004558 6.42 <.0001

284
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Model Fitted to the Validation Data Set

The REG Procedure


Model: MODEL1
Dependent Variable: LOGSTAY

Number of Observations Read 56


Number of Observations Used 56

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 0.09125 0.03042 7.20 0.0004


Error 52 0.21980 0.00423
Corrected Total 55 0.31105

Root MSE 0.06501 R-Square 0.2934


Dependent Mean 0.99113 Adj R-Sq 0.2526
Coeff Var 6.55961

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 0.61887 0.12477 4.96 <.0001


AGE 1 0.00399 0.00211 1.89 0.0638
RUXRAY 1 0.00152 0.00043724 3.48 0.0010
AVECEN 1 0.00015680 0.00006216 2.52 0.0148

285
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Model Fitted to the Combined Data Set

The REG Procedure


Model: MODEL1
Dependent Variable: LOGSTAY

Number of Observations Read 113


Number of Observations Used 113

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 0.26759 0.08920 24.21 <.0001


Error 109 0.40151 0.00368
Corrected Total 112 0.66909

Root MSE 0.06069 R-Square 0.3999


Dependent Mean 0.97721 Adj R-Sq 0.3834
Coeff Var 6.21073

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 0.62718 0.07383 8.50 <.0001


AGE 1 0.00353 0.00129 2.74 0.0072
RUXRAY 1 0.00143 0.00029679 4.83 <.0001
AVECEN 1 0.00023645 0.00003743 6.32 <.0001

286
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Diagnostics

Hat Diag |-------------- Dfbetas -------------|


CASE Rstudent H DFFITS INTERCEP AGE RUXR AVECEN CookD

57 -1.61670 0.05527 -0.39102 0.19064 -0.25427 0.06783 0.14369 0.03710


58 -1.20093 0.05470 -0.28890 -0.22183 0.18308 0.09451 -0.05475 0.02069
59 -0.07913 0.06950 -0.02162 -0.00107 0.00624 -0.01330 -0.01005 0.00012
60 1.27433 0.04456 0.27520 -0.14471 0.10354 0.16313 -0.02834 0.01871
61 1.78937 0.06918 0.48783 -0.12868 0.27988 -0.30737 -0.09714 0.05712
62 0.28445 0.14214 0.11579 0.01003 0.00729 -0.06092 0.09412 0.00341
63 -1.72618 0.12879 -0.66370 0.52919 -0.50589 -0.18656 0.13603 0.10616
64 0.69675 0.05565 0.16915 0.02567 -0.05470 0.11711 -0.08658 0.00722
65 -1.82620 0.08388 -0.55259 -0.36315 0.35045 0.06767 -0.25593 0.07312
66 -0.55429 0.07307 -0.15563 -0.03342 0.00973 0.07596 -0.11230 0.00614
67 0.61142 0.04429 0.13162 0.05621 -0.07890 0.07809 -0.01717 0.00438
68 -1.40640 0.03663 -0.27422 0.10767 -0.05838 -0.16102 -0.06825 0.01846
69 -0.95946 0.05103 -0.22250 0.02194 -0.00033 -0.04447 -0.16695 0.01240
70 -0.68806 0.03177 -0.12463 0.00392 -0.00022 -0.04824 0.07005 0.00392
71 -1.01376 0.04788 -0.22734 -0.08857 0.10349 -0.10056 0.14674 0.01291
72 -0.80976 0.05497 -0.19529 -0.08386 0.02046 0.12691 0.07985 0.00960
73 0.38530 0.03075 0.06863 0.02520 -0.00877 -0.03812 0.01919 0.00120
74 0.82324 0.02574 0.13381 0.03228 -0.04371 0.06215 -0.03957 0.00450
75 0.48949 0.21262 0.25436 0.22177 -0.24151 0.06909 -0.06596 0.01641
76 -1.56149 0.05461 -0.37528 -0.25516 0.25363 -0.08125 0.22029 0.03428
77 0.61442 0.05499 0.14821 0.07773 -0.08596 0.05839 -0.09353 0.00556
78 -1.42471 0.13302 -0.55807 0.10403 -0.10146 0.07331 -0.51981 0.07638
79 -0.30938 0.02399 -0.04850 0.01682 -0.02420 0.00897 -0.00015 0.00060
80 0.62247 0.05418 0.14898 -0.10516 0.10536 0.03075 -0.02043 0.00561
81 1.77584 0.10585 0.61101 0.48981 -0.39940 -0.25444 0.15321 0.08969
82 -0.95934 0.04714 -0.21337 -0.09923 0.13101 -0.12241 0.06853 0.01140
83 -0.61888 0.03806 -0.12311 -0.05061 0.01205 0.07745 0.03309 0.00383
84 0.77179 0.08216 0.23092 0.03181 0.05394 -0.19331 -0.03740 0.01343
85 -0.14766 0.06639 -0.03938 0.00539 -0.01841 0.02678 0.00941 0.00039
86 0.30234 0.02444 0.04786 0.02287 -0.02087 0.00674 -0.01810 0.00058
87 -2.14501 0.03542 -0.41105 0.02528 -0.03054 0.02192 -0.29085 0.03955
88 0.46082 0.02421 0.07258 -0.02268 0.01650 0.02766 0.01469 0.00134
89 0.13331 0.02789 0.02258 -0.00265 0.00035 0.01198 -0.00710 0.00013
90 1.12156 0.04275 0.23701 0.07002 -0.05308 -0.05458 0.15337 0.01398
91 0.29537 0.03926 0.05971 0.02203 -0.02572 0.02585 -0.03511 0.00091
92 0.37811 0.03463 0.07161 -0.01250 0.02788 -0.02005 -0.03157 0.00130
93 0.46896 0.03133 0.08434 0.00881 -0.00088 0.00660 -0.05507 0.00180
94 -0.40342 0.02956 -0.07041 0.00495 -0.01210 -0.00386 0.04009 0.00126
95 0.79702 0.03954 0.16172 0.07213 -0.09082 0.08352 -0.06240 0.00658
96 -0.18689 0.03313 -0.03460 0.00931 -0.01169 -0.00444 0.01748 0.00030
97 -0.20892 0.02253 -0.03171 -0.00757 -0.00018 0.01466 -0.00020 0.00026
98 1.51279 0.03694 0.29626 -0.02234 -0.04178 0.20442 0.03919 0.02142
99 -0.23943 0.05080 -0.05539 -0.02297 0.00407 0.04177 0.00810 0.00078
100 -0.13464 0.09272 -0.04304 -0.00307 -0.00440 0.02390 -0.03266 0.00047

287
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

101 1.19483 0.03066 0.21250 0.04289 -0.02690 0.02600 -0.13863 0.01120


102 0.79004 0.14862 0.33008 0.16650 -0.24133 0.23232 -0.09119 0.02743
103 -1.91836 0.05154 -0.44718 0.19872 -0.17595 -0.18446 0.20987 0.04759
104 0.67181 0.28817 0.42745 -0.34801 0.25481 0.26606 0.09923 0.04616
105 0.45559 0.04346 0.09711 0.02306 0.00703 -0.07275 0.02417 0.00239
106 1.75650 0.15665 0.75703 -0.43652 0.62564 -0.40003 -0.03148 0.13785
107 -0.70227 0.08197 -0.20985 -0.08549 0.00949 0.17263 0.04159 0.01112
108 0.24131 0.09027 0.07601 0.00874 0.01922 -0.06134 -0.01949 0.00147
109 -0.26309 0.13059 -0.10197 0.03096 -0.00198 -0.07317 -0.05124 0.00265
110 1.35824 0.04382 0.29075 0.20912 -0.16840 -0.04134 -0.15350 0.02080
111 -0.91106 0.04168 -0.19000 0.05168 -0.10168 0.08767 0.04794 0.00905
112 1.88869 0.28763 1.20011 -0.46431 0.37224 0.05100 1.13188 0.34343
113 0.34762 0.06698 0.09314 -0.05308 0.05159 0.02833 -0.04019 0.00221

Sum of Residuals 0
Sum of Squared Residuals 0.16182
Predicted Residual SS (PRESS) 0.19371

The SAS System

* ** * ** * * *** * * ** * ***** * ****** ***** * ** * * ** *

-|---------|---------|---------|---------|---------|---------|---------|---------|---------|
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

Studentized Residual without Current Obs

NOTE: 16 obs hidden.

MSPR for the Validation Data Set

(Yi − Ŷ i)2 0.2582713


n
X

M SP R = ∗
= = 0.004612
i=1 n 56

288
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Regression Results Based on Model-Building and


Validation Data Sets

Statistics Model-Building Validation

β̂ 0 0.61043 0.61887
s(β̂ 0) 0.08881 0.12477
β̂ AGE 0.00388 0.00399
s(β̂ AGE) 0.00163 0.00211
β̂ RUXRAY 0.00117 0.00152
s(β̂ RUXRAY) 0.00041881 0.00043724
β̂ AVECEN 0.00029261 0.00015680
s(β̂ AVECEN) 0.00004558 0.00006216

SSE 0.16182 0.21980


P RESS 0.19371 ———–
M SE 0.00305 0.00423
R2 0.5192 0.2934
M SP R ———– 0.00461

289
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.6 Remedial Measures

When diagnostics indicate that a regression model is not


appropriate in some aspect(s), remedial measures need to
be taken.

Sometimes, a suitable transformation can:


• linearize the regression relation
• make the error distributions more nearly normal
• make the variances of the errors more nearly equal

Other remedial measures include:


1. Weighted Least Squares (WLS) Regression
• for unequal error variances
2. Ridge Regression
• for multicollinearity
3. Robust Regression
• for influential cases
4. Regression with Autocorrelated Errors
• for non-independent error terms
5. Nonlinear Regression
• for more complex response surfaces
290
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.6.1 Weighted Least Squares

• an alternative to least squares regression when the


error variance is not constant, i.e. under
heteroscedasticity
• assumes a more general form of the General Linear
Regression Model (5.1)
GLRM (using LSE) ⇒ εi ∼ N (0, σ 2)
WLS Regression ⇒ εi ∼ N (0, σi2)
• different observations do not have the same variance,
and thus, have different reliabilities
⇒ observations with smaller variances provide more
reliable information about the regression function
than those with larger variances
• each observation is given a weight that is inversely
proportional to its variance, i.e.
1
wi = 2
σi
⇒ wi reflects the amount of information contained in
the observation Yi
⇒ observations with larger variances receive less
weight than those with smaller variances in fitting
the regression function

(Read Section 11.1 of Kutner et al. for more details.)


291
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.6.2 Ridge Regression

• addresses multicollinearity problems by modifying the


method of LS to allow biased estimators of the
regression coefficients
⇒ an estimator with only a small bias but is
substantially more precise than an unbiased
estimator might be preferred over the latter
Figure 6.13: Biased Estimator with Small Variance May Be
Preferable to Unbiased Estimator with Large Variance.

• ridge standardized regression estimators are obtained


by introducing, into the LS normal equations for the
correlation transformed data, a biasing constant
• ridge regression coefficients are biased but tend to be
more stable (i.e. less variable) than ordinary LSEs

(Read Section 11.2 of Kutner et al. for more details.)


292
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.6.3 Robust Regression

• procedures that dampen the influence of outlying


cases, as compared to ordinary LSE, to provide a
better fit for the majority of the cases

• useful when a known, smooth regression function is to


be fitted to data that are “noisy,” with a number of
outlying cases, so that the assumption of a normal
distribution for the error terms is not appropriate

Robust Regression Methods:

1. Least Absolute Residuals (LAR) or Least Absolute


Deviations (LAD) Regression
• based on minimizing the sum of the absolute
deviations of the Y observations from their means
• places less emphasis on outlying observations as
compared to LSE, since absolute deviations rather
than squared ones are used
2. Iteratively Reweighted Least Squares (IRLS) Robust
Regression
• uses WLS estimation, but with weights based on
how far outlying a case is, rather than on the error
variances

293
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

3. Least Median of Squares (LMS) Regression


• based on minimizing the median of the squared
deviations of the Y observations from their means
4. Other Approaches
• trimming one or several of the extreme squared
deviations before applying the LS criterion
• rank-based regression methods

(Read Section 11.3 of Kutner et al. for more details.)

6.6.4 Regression with Autocorrelated Errors

• useful for regression problems that involve time series


data, typically encountered in business or economic
regression applications

• in such applications, the error terms are frequently


correlated positively over time, i.e. are said to be
autocorrelated or serially correlated

• specialized regression models that account for


autocorrelation should be used for these situations

(Read Chapter 12 of Kutner et al. for more details.)

294
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

6.6.5 Nonlinear Regression

• for some applications, an empirically indicated or a


theoretically justified nonlinear regression model is
more appropriate than a linear one

• such models can be fitted using nonlinear regression


techniques

Examples:
Growth from birth to maturity in human subjects
typically is nonlinear in nature, characterized by
rapid growth shortly after birth, pronounced
growth during puberty, and a leveling off sometime
before adulthood.
Dose-response relationships tend to be nonlinear
with little or no change in response for low dose
levels of a drug, followed by rapid S-shaped
changes occurring in the more active dose region,
and finally with dose response leveling off as it
reaches a saturated level.

(Read Chapter 13 of Kutner et al. for more details.)

295
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Figure 6.14: Plot of Exponential Figure 6.15: Plot of Logistic


Response Function. Response Function.

γ0
Yi = γ0 + γ1 exp(γ2Xi) + εi Yi = + εi
1 + γ1 exp(γ2Xi)
γ0
E(Yi) = γ0 + γ1 exp(γ2Xi) E(Yi) =
1 + γ1 exp(γ2Xi)

γ0 = 100 γ0 = 10
γ1 = − 50 γ1 = 20
γ2 = − 2 γ2 = − 2

10
E(Y ) = 100 + (−50) exp(−2X) E(Y ) =
1 + 20 exp(−2X)

296
Appendix A

P-Value

In a statistical test of hypotheses, the p-value is the


probability of obtaining a test statistic at least as extreme
as the one that was actually observed, assuming the null
hypothesis is true.

If the test is two-sided, we can define the p-value on the


two tails of the distribution:

2-Tailed P-Value:
PHo (|T | ≥ |to|) = PHo (T ≥ |to|) + PHo (T ≤ −|to|)
= 2PHo (T ≥ |to|), for symmetric distn’s.
For one-sided tests, the p-value is defined on just one tail:

1-Tailed P-Value:
PHo (T ≥ |to|) or PHo (T ≤ −|to|).

The choice of which p-value to use is determined by the


form of the alternative hypothesis.
297
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Traditionally, one rejects the null hypothesis if the p-value


is smaller than or equal to some pre-specified significance
level, α, usually set at 5% or 1%.

When the null hypothesis is rejected, the result is said to


be statistically significant.

The significance level, α, is the chance of rejecting the


null hypothesis when it is true, i.e. the probability of
making Type I error.
• a significance level of 0.05 would deem as
extraordinary any results that are only 5% likely (or
less), given that the null hypothesis is true
• in this case, a p-value less than 0.05 would be
rejected at the 5% (significance) level

Comparing the p-value obtained from a given data set to


a significance level will yield either of two results:
1. the null hypothesis is rejected
• this does not imply that the alternative hypothesis
is true
2. the null hypothesis cannot be rejected at that
significance level
• this does not imply that the null hypothesis is true

298
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

There are several common misunderstandings about


p-values.

1. The p-value is not the probability that the null


hypothesis is true.

2. The p-value is not the probability of falsely rejecting


the null hypothesis. (This error is a version of the
so-called prosecutor’s fallacy.)

3. The p-value is not the probability that a replicating


experiment would not yield the same conclusion.

4. The quantity (1 − p-value) is not the probability of


the alternative hypothesis being true.

5. The significance level of the test is not determined by


the p-value.

6. The p-value does not indicate the size or importance


of the observed effect (compared with effect size).
The two do vary together however – the larger the
effect, the smaller the p-value will be, other things
being equal.

299
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Example 1:
to = 2.5 df = 8 α = 0.05
t1−α/2 ; df = t1−0.025 ; 8 = 2.306
t1−α ; df = t1−0.05 ; 8 = 1.860

2-tailed p-value:
P (|T | ≥ 2.5) = 2P (T ≥ 2.5) = 2×0.0185 = 0.0370
1-tailed p-value:
P (T ≥ 2.5) = 0.0185

2-sided test: Ho : θ = 0 vs. Ha : θ 6= 0


2.5 > 2.306 =⇒ Reject Ho : θ = 0
p = 0.0370 < 0.05 =⇒ Reject Ho : θ = 0

1-sided test: Ho : θ = 0 vs. Ha : θ > 0 (or < 0)


2.5 > 1.860 =⇒ Reject Ho : θ = 0
p = 0.0185 < 0.05 =⇒ Reject Ho : θ = 0

300
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Example 2:
to = 2.0 df = 8 α = 0.05
t1−α/2 ; df = t1−0.025 ; 8 = 2.306
t1−α ; df = t1−0.05 ; 8 = 1.860

2-tailed p-value:
P (|T | ≥ 2.0) = 2P (T ≥ 2.0) = 2×0.0403 = 0.0806
1-tailed p-value:
P (T ≥ 2.0) = 0.0403

2-sided test: Ho : θ = 0 vs. Ha : θ 6= 0


2.0 6> 2.306 =⇒ Accept Ho : θ = 0
p = 0.0806 > 0.05 =⇒ Accept Ho : θ = 0

1-sided test: Ho : θ = 0 vs. Ha : θ > 0 (or < 0)


2.0 > 1.860 =⇒ Reject Ho : θ = 0
p = 0.0403 < 0.05 =⇒ Reject Ho : θ = 0

301
Appendix B

Matrix Expressions

Simple Linear Regression:


   
1 

X1  Y1 


   
1 


X2  Y2 



.. 


..  .. 



   
X= 

 Y =  
1 
 Xi  
Yi 


..

 ..  .. 



   
   

1 Xn  Yn 

 P   P 
 n Xi   Yi 
′   ′  
XX= 

 P



XY = 

 P



Xi P
Xi2 Xi Yi
Y ′ J Y = ( Y i )2
X X
Y ′Y = Yi2

 P P 

Xi2 − Xi 
 P P 



n (Xi − X̄)2 n (Xi − X̄)2 


(X ′ X)−1 = 





P



− Xi n 


P P
n (Xi − X̄)2 n (Xi − X̄)2

302
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

Multiple Linear Regression:

   
1 

X11 X12 ··· X1,p−1  Y1  

  
1 


X21 X22 ··· X2,p−1  Y2  


.. 


.. .. ... ..  .. 


  
X= 

 Y = 
1 
 Xi1 Xi2 ··· Xi,p−1  Yi 


.. 

 .. .. ... ..  .. 


  
  

1 Xn1 Xn2 ··· Xn,p−1  Yn  

 
P P P P
 n Xi1 Xi2 Xi3 ··· Xi,p−1 
 
 
 
 
 P P P P P 
 Xi1 2
Xi1 Xi1 Xi2 Xi1 Xi3 ··· Xi1 Xi,p−1 
 
 
 
 
 P P P P P 
 2 


Xi2 Xi2 Xi1 Xi2 Xi2 Xi3 ··· Xi2 Xi,p−1 

′  
XX=  
 
 P P P P P 
 Xi3 Xi3 Xi1 Xi3 Xi2 2
Xi3 ··· Xi3 Xi,p−1 
 
 
 
 



.. .. .. .. ... .. 


 
 
 
 
 P P P P P 
Xi,p−1 Xi,p−1 Xi1 Xi,p−1 Xi2 Xi,p−1 Xi3 ··· 2
Xi,p−1

 P 

Yi 
 
 
 P 
 

 YiXi1 

 
 
 P 
X ′Y = 

 YiXi2 


 
 
 


... 

 
 
 
 P 
YiXi,p−1

303
Chapter 6. Model Building, Diagnostics and Remedial Measures C. Sotto

For either SLR or MLR,


 
1
SST O = Y ′Y −   Y ′J Y
n
SSE = Y ′Y − β̂X ′Y
 
1
SSR = β̂X Y −  Y ′J Y
′  

n

304
Bibliography

Berry, W.D. and Feldman, S. (1985). Multiple Regression In Practice. Sage


University Paper series on Quantitative Applications in Social Sciences
07-050. Newbury Park, CA: Sage.

Chambers, J.M., Cleveland, W.S., Kleiner, B. and Tukey, P.A. (1983).


Graphical Methods for Data Analysis. Belmont: Wadsworth.

Draper, N. and Smith, H. (1998). Applied Regression Analysis (3rd ed). NY:
John Wiley.

Fox, J. (1997). Applied Regression Analysis, Linear Models, and Regression


Methods. Thousand Oaks, CA: Sage.

Kutner, M., Nachtsheim, C., Neter, J. and Li, W. (2005). Applied Linear
Statistical Models (5th ed). NY: McGraw-Hill Irwin.

Montgomery, D. and Peck, E.A. (1992). Introduction to Linear Regression


Analysis (2nd ed). NY: John Wiley.

Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression: A Second
Course in Statistics. Reading, MA: Addison-Wesley.

Ronald, C. (1998). Analysis of Variance, Design and Regression: Applied


Statistical Methods. Boca Raton, FL: Chapman and Hall.

SAS Institute Inc. (2008). SAS 9.2 Help and Documentation. Cary, NC: SAS
Institute Inc.

Sokal, R.R. and Rohlf, F.J. (1993). Biometry: The Principles and Practice of
Statistics in Biological Research (3rd ed). NY: Freeman.

Weisberg, S. (1980). Applied Linear Regression. NY: John Wiley.

305

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy