07 Regression
07 Regression
Regression Analysis
Definitions, Principles and Practice
1
Regression Analysis:
What is?
The construction and evaluation of models used to generate predictions
of continuous numeric values.
Complement of classification which generates predictions of nominal
values.
Predictor variables may be either nominal or numeric.
Commonly
Used
Regression
Modelers
3
Multiple Linear
Regression (MLR)
A statistical approach to model construction that typically uses the
Ordinary Least Squares (OLS) method to generate estimates.
Widely used by statisticians
What you learned in your statistics or econometrics course
Contributions of individual predictors are additive.
Not designed to detect predictor interactions.
4
Adaptions of MLR
Usually employed to model non-linear relationships
Predictor variable values are transformed before submission to the
modeler
◦ x2, x3, log(x)
◦ 2nd Order Polynomial – squared term included
◦ 3rd Order Polynomial – squared and cubed terms included
◦ x1*x2 in an attempt to model interaction
5
Basic MLR Construction
6
Basic MLR Construction
7
Understanding the
Equation
Y = b0 + b1X + e
b0 is the intercept
◦ It is the predicted value of Y when X is zero
◦ The intercept is shared by all cases (there is only one)
e is the residual
◦ It is the difference between the observed Y and the predicted Y
◦ Each case has a different residual
8
Assumptions
X is “fixed” (chosen by researcher)
◦ The values of a predictor represented in the study capture all possible values of interest in X at the population level
◦ This can also take on a second interpretation: that the effect is the same for all individuals
◦ In essence, the residuals from the regression equation are error and not due to sampling
Residuals are normally distributed (needed for hypothesis testing and confidence intervals)
◦ When you have multivariate normality, you automatically meet this assumption, but multivariate normality is more
restrictive than this assumption
◦ This generally occurs when any one of the previous assumptions are violated
10
Basic MLR Construction
To perform an OLS regression you must have:
◦ 1 continuous target (dependent) variable
◦ 1 or more independent variables
◦ Continuous variables
◦ Categorical Variables
◦ Require special coding
◦ Interactions
◦ Coded using variable products
◦ Non-linear predictors
◦ Coded using squares, cubes, etc.
11
Using Categorical
Predictors
Categorical predictors require dummy codes expressed as 0 or 1
2 Category Case Dummy
No 0
Yes 1
3 Category Case
Dummy 1 Dummy 2
Republican 1 0
Democrat 0 1
Other 0 0
12
The Dummy Coefficient
Y = b0 + b1(D1) + b2(D2) + e
The intercept is the predicted Y for the Other political affiliation
b1 is the change in Y from Other to Republican
b2 is the change in Y from Other to Democrat
13
Measures of Model
Quality
In general, we compare the performance of our
model versus the simple model.
In classification, the simple model uses the
prevalence of classes in the full dataset.
◦ if the full dataset contains 70% class A and 30% class B,
then prediction is class A and the expected error rate is
30%
◦ if the error rate of our model is less than 30%, then we
have a better model.
14
The Simple Regression
Model
In regression analysis the simple model prediction is assumed to be the
response attribute mean in the training dataset.
The measures of error are SSE and MSE
/
To put back in original units
RMSE = Sqrt(MSE)
15
Regression Model SSE
In our regression model SSE is computed as:
16
Model Comparison
SSE and MSE are unit-of-measure dependent
◦ different for measurement in feet versus inches versus meters
a measure of how much better our model is versus the simple model
(prediction as mean)
17
Ranges of R2
R2 indicate the percentage of the variance in the
dependent variable (Y) that is explainable by the
predictors in the model
R2 = 1.0 – model predicts without error (very
unlikely)
R2 = 0 – model does no better than simple model
R2 is only a useful measure if the model overall is
significant
18
Adjusted R2
In general, a model that uses fewer
predictors to achieve the same R2 would be
considered superior.
19
Other Measures
MLR metrics
◦ F statistic – is the MLR model significant overall?
◦ p-values – is the individual predictor significant?
If the F stat is not significant, nothing else
matters.
If the p value is not significant, use caution in
interpreting that coefficient.
20
Model Construction
Feature Selection – which predictors should
be included?
◦ correlation matrix is good starting point
◦ avoid including highly correlated predictors
◦ in MLR, coefficient estimates will be unstable
◦ as a general rule: if correlation between two predictors is over .80,
include only the predictor with highest correlation to response
attribute
21
Predicting Y
The equation can be used for predicting Y with
known values of X
Simply plug in values for X into the equation and it
will give you a “best guess” predicted value for Y
It will not be exact. The difference is residual error
(e).
22
Other Issues
Choose the right modeler
◦ Look beyond the model summary statistics
◦ Look for interaction between predictors
23
Steps for Regression
Open the source data into a data frame
Explore correlations and aggregations for possible
predictors and multicollinearity
Scatterplots to test for assumptions
Dummy code (one-hot encode) any categorical predictors
Create regression model
Evaluate metrics
Predict
24
Walkthrough –
Regression in Python
scikit-learn vs statsmodel – two different ways to
do the same thing
Linear Regression
Lasso Regression
K Nearest Neighbor (KNN)
25