03 - Simple Linear Regression
03 - Simple Linear Regression
1
Estimating the Model Coefficients
Compare two lines, the first upward sloping, the second horizontal
Given a set of x and y data, there are a number of ways Line 1 (upward sloping)
which could be used to formulate the line that best Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 =
characterizes those data points.
When we conduct regression analysis, we select the line Line 2 (horizontal)
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 =
that minimizes the sum of squared vertical differences
between the points and the line.
4 (2,4)
We often refer to this technique as “ordinary least squares” w
(OLS) regression. 3 w (4,3.2) The smaller the sum of
y w 2.5 squared differences,
w 2 the better the fit of the
(1,2) w
w
w w (3,1.5) line to the data.
w w w w w 1
w w w w w
w
5 1 2 3 4 6
x
7 8
2
Example – Dog Show Example – Dog Show
# of Show y
Dog biscuits score xi - yi - (xi - )*(yi - ) (xi - )2
xi yi
8
1 2 4
2 6 8
3 4 7 6
4 8 9
Total
4 This is the ____ of the line.
Avg
For each additional biscuit
cov (x, y) = sx2 = This is the intercept. Don’t given to Chip, the
2 interpret as the estimated estimated score Chip
score Chip receives if he is receives increases by __.
given ____ biscuits.
9 2 4 6 8 x 10
Independent variable x
Dependent variable y
11 12
3
Example – Armani’s Pizza
6533
Armani’s Pizza is
considering locating at
the OSU campus. To do
0
their financial analysis,
No data
they first need to
estimate sales for their
product.
The intercept is b0 = 6533.
This is the slope of the line. They have data from
For each additional mile on the their existing 10 locations
odometer, the price decreases by on other college
an average of $0.0312
Do not interpret the intercept as the campuses.
“Price of cars that have not been driven”
13 14
4
Model Assessment 1. R2 (Coefficient of Determination) to Evaluate Model
Once our model is estimated, the next step is to assess how
successful we have been in accomplishing our objective.
When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
Recall that our principal objective in regression analysis is to
understand the behavior of the dependent variable.
There are three means by which we will make this
assessment:
1. R2 (Coefficient of Determination) To understand the significance of this coefficient note:
2. F-Test for Overall Validity of the Model SST = SSR + SSE
― this test becomes much more important when using multiple The regression model
regression. Sum of Squares Regression (SSR)
Overall variability in y
3. T-test for the Slope
― using b1 (estimate of the slope) Sum of Squares Total (SST) = The error
17 + Sum of Squares for Error
18 (SSE)
Two data points (x1,y1) and (x2,y2) of a certain sample are shown. Variation in y (SST) = SSR + SSE
y2 R2 measures the proportion of the variation in y that is
explained by the variation in x.
y1
5
Example – Car Sale Price Example – Armani Pizza
Find the Coefficient of Determination for the Car Sale Price Find the Coefficient of Determination for the Armani’s Pizza
Example. What does it tell you about the model? example; what does it tell you about the model? You are
Solution given values of SSR = 81702.499, and SSE = 6331.901.
• Using the computer
―From the regression output we have
21 22
Example – Armani Pizza 2. The F-Test for Overall Validity of the Model
Solving with Excel We will rely much more heavily on this test when we introduce multiple
regression. This is because when doing simple regression this F-test
exactly duplicates the results of the t-test for slope we will do next.
Because it approaches the question differently however, it remains
useful.
The multiple regression version of this test asks:
• Is there at least one independent variable linearly related to the
dependent variable?
To answer the question, we test the hypothesis:
H0: b1 = b2 = … = bk = 0
H1: At least one bi is not equal to zero.
If at least one bi i is not equal to zero, the model is valid.
In simple regression, the test simplifies to:
H0: b1 = 0
H1: b1 ≠ 0.
23 If you reject H0 in favor of H1, the model is valid. 24
6
F Distribution Hypothesis Testing
Non-negative, positively skewed distribution To test these hypotheses we perform an analysis of variance
procedure.
The Center of this distribution is always “1”. The inverse of a The F test
value below 1 on one side of the distribution gives you the • Construct the F statistic
equivalent value above 1 on the right so of the distribution, • k is the number of independent variables used in the model
and vice-versa. SST = SSR + SSE.
MSR MSR=SSR/k
Degrees of freedom are k for the numerator and n-k-1 for Large F results from a large SSR.
Then, much of the variation in y is
F=
the denominator explained by the regression model.
MSE
The null hypothesis should MSE=SSE/(n-k-1)
Fa,k,n-k-1 be rejected; thus, the model is valid.
F>Fa,k,n-k-1
a • Rejection region
There is an important caveat to this test. It is always an upper-tail test
(despite what H1 would suggest).
25 26
0
7
3. T-Test for the slope Hypothesis Testing
When no linear relationship exists between two variables, We can test if the slope is non-zero in simple regression.
the regression line should be horizontal. • We can draw inference about b1 from b1 by testing
―H0: b1 = 0
q
―H1: b1 ≠ 0 (or < 0, or > 0)
q
q
• The test statistic is
qq q
q
q
q q
q q
• H 0: b 1 = 0
• H 1: b 1 ≠ 0
There is _________ evidence to infer
that the odometer reading affects the
Using the computer auction selling price.
8
Example – Armani Pizza Example – Artificial Data
Solving with Excel Run regressions on the following data sets. Calculate slope,
intercept and R2 in each case. Which model is the best?
How does the result from the t-test for the slope compare to
the F-test for the overall validity of the model?
33 34
9
1. Normality of errors 2. Constant Variance of Errors (homoskedasticity)
A Partial list of
Standard residuals
The residuals, our estimate of the errors, seem to have an
approximately equal “spread” around the regression line.
^y +
++
Residual
+ +
+ + ++
+ +
+ + + +
+ ++ + +
+
+ +
+ + + + + ++
+ + y^ +
+ +
+ + + ++ +
+ ++
+ ++
37 38
10
3. Independence of errors (Non-Autocorrelation) 4. No unnecessary outliers
Patterns in the appearance of the residuals over time An outlier is an observation that is unusually small or large.
indicates that autocorrelation exists. Several possibilities need to be investigated when an outlier is
observed:
Residual Residual
• There was an error in recording the value.
+
• The point does not belong in the sample.
+ ++
+
+ +
• The observation is valid.
+ +
0
+
+
+
0 + + Identify outliers from the scatter diagram.
+ Time Time
+ + +
+ + + +
+ It is customary to suspect an observation is an outlier if its |standard
+ +
++ residual| > 2.
+
Once identified, the cause will determine the appropriate response. If it
Note the runs of positive residuals, Note the oscillating behavior of the
is either the first or second situation, delete it, and re-estimate your
replaced by runs of negative residuals residuals around zero. model. If, on the other hand, it appears to be a valid observation, keep
it in, but recognize the impact it is having on your model.
41 42
11
Applying the Regression Equation Estimating the value of y
Once we have assessed how well the model fits the data, Predict the selling price of a three-year-old Taurus with
and verified the assumptions hold, we are then ready to use 40,000 miles on the odometer.
our model to:
• Estimate the value of the dependent variable based on the
value of the independent variable.
Predict Chip’s score in the dog show if he is given 5
• Establish confidence intervals and prediction intervals for our biscuits.
estimates of the value of the dependent variable.
• Interpret the slope of the regression, and thereby gain insight
into the relationship between our independent variable and
our dependent variable.
Predict the amount of annual pizza sales for Armani at OSU
45
—assume 36,000 students. 46
Prediction interval (and confidence interval) Example – Car Sale Price (“prediction” interval)
We frequently want more than just a point estimate for our Provide an interval estimate for the bidding price on a Ford
estimate of the dependent variable. Taurus with 40,000 miles on the odometer.
Two types of intervals can be used to discover how closely Solution
the predicted value will match the true value of y. • The dealer would like to predict the price of a single car
• Prediction interval - for a particular value of y,
• Confidence interval - for the expected value of y – we won’t
cover this one The prediction interval(95%) =
The prediction interval
t.025,98
47 48
12
Example – Armani Pizza
Provide a 95% interval for your Armani’s Pizza estimate from
above (36,000 students). You are interested in an interval
estimate for a single store, and are given the following:
49
13