Intermediate Analytics-Regression-Week 1
Intermediate Analytics-Regression-Week 1
Analytics
ALY6015
Northeastern University
By: Behzad Abdi
Meet Your Instructor
• Goal: To predict a continuous • Goal: To assign data to specific • Goal: To group data into similar
value based on given input data. categories based on labelled clusters based on patterns and
• Type of Output: A numerical data. similarities, without predefined
value (e.g., predicting price, • Type of Output: A discrete labels.
temperature, or weight) value or category (e.g., yes/no, • Type of Output: Grouping data
• Example: Predicting house prices dog/cat). based on internal similarities.
based on size, location, and year of • Example: Detecting spam • Example: Grouping customers
aims to predict a specific and assigns new data to a unsupervised learning, where data is
numerical result. specific category. not assigned to predefined
categories. The model automatically
Model Validation and
Evaluation:
Independent variable, also called a predictor variable.
Dependent variable, also called a response variable
Relationships Between Variables
• Inferential statistics help determine if relationships
exist between numerical variables.
• Examples:
• Sales volume and advertising spending
• Study hours and exam scores
• Age and blood pressure
Regression:
• Describes the nature of relationships
(positive/negative, linear/nonlinear).
• Helps predict one variable based on another.
Correlation
coefficient
• Measures the strength and direction of the relationship
between two variables.
• Pearson Correlation:
Correlation Coefficient
Interpretation:
Visualizing Correlation
Scatter Plots illustrate the relationship between variables.
Is there a significant linear relationship between the
variables, or the value of r is due to chance?
The relationship between the variables may be caused by a third variable (lurking variable).
1 Definition 2 Goal
Linear regression models To find the best-fitting
the relationship between straight line through a set
a dependent variable and of points.
one or more independent
variables.
Introduction to Linear Regression
1. Simple Linear Regression:
Example: Predicting a student's exam score (y) based on their studied hours (x).
1. Removing Outliers:
• Explanation: Outliers (data points that are significantly different from others) can skew the
model and reduce its accuracy.
• Why it's effective: Removing or managing outliers allows the model to align more with most
of the data and perform better.
• Example: If a student with very few study hours (e.g., 1 hour) achieves a very high score (e.g.,
95), this could be an outlier and need to be removed or examined further.
Improving the results of a linear regression mode
2. Multicollinearity
Definition:
• Multicollinearity occurs when two or more predictors in a regression model are highly
correlated.
• This makes it difficult for the model to distinguish their individual effects on the
dependent variable.
Common Methods:
1. Correlation Matrix:
• High correlation ( or ) between predictors is a warning sign.
2. Variance Inflation Factor (VIF):
• Measures how much the variance of a predictor is inflated due to multicollinearity.
• Rule of Thumb: indicates severe multicollinearity.
Resolving Multicollinearity
Definition:
• Measures how well a regression model explains the variability in the dependent variable based on the
independent variable(s).
Key Points:
• Higher values indicate a better fit.
• Used to evaluate the performance of regression models.
R-Squared Formula:
Linear Regression
R-Squared for Goodness of fit
Limitations of R-Squared:
• Not Always Predictive: High does not guarantee good predictions for new data.
• Sensitive to Overfitting: Adding more variables can artificially increase .
• Correlation, Not Causation: only measures association, not causality.
Adjusted 𝑅-Square:
Adjusted 𝑅-Square:
Definition: AIC estimates in-sample prediction error and compares model quality within
the same dataset.
Key Points:
• Lower AIC indicates a better model.
• Only valid for comparing models from the same dataset.
• Does not measure absolute model quality.
Bayesian Information Criterion (BIC):
Key Points:
• Heavily penalizes complexity to favor simpler models.
• Lower BIC indicates a better model.
Formula:
Mallow’s Cp:
Definition: Compares precision and bias of the full model to models with subsets of
predictors.
Key Points:
• Calculates Cp for all variable combinations.
• Best model has Cp value closest to (number of predictors + 1).
Formula:
Comparison
of Adjusted
R-Square,
AIC, BIC, and
Mallow’s Cp:
Detailed Use Cases:
1. Adjusted 𝑅 Square :
• When to Use: When comparing multiple linear regression models with different
numbers of predictors.
• When Not to Use: When working with nonlinear models or datasets with many
predictors.
4. Mallow’s Cp:
• When to Use: When selecting the best subset of predictors in linear regression.
• When Not to Use: For nonlinear models or when there are too many predictors.
Definition:
1. Forward Selection:
• Start with no variables.
• Add predictors one by one based on their significance in the model.
2. Backward Elimination:
• Start with all predictors.
• Remove the least significant predictor one by one.
3. Stepwise Selection:
• Combine Forward Selection and Backward Elimination.
• Variables can be added or removed at each step.
Example: Predicting House Prices with Stepwise Regression
Predictors:
• House Size (X1)
• Number of Rooms (X2)
• House Age (X3)
• Wall Color (X4)
1. Forward Selection:
• Start with no predictors.
• Add House Size (X1) because it has the most significant impact.
• Add Number of Rooms (X2) if it improves model performance.
• Continue until no additional predictors improve the model.
2. Backward Elimination:
• Start with all predictors: X1, X2, X3, X4.
• Remove Wall Color (X4) if it has the least significance.
• Repeat until all remaining predictors are significant.
3. Stepwise Selection:
• Combine both methods. Variables can be added or removed based on
their impact on the model.
Advantages and Disadvantages of Stepwise
Regression:
•Advantages:
•Disadvantages:
• Explanation: Considering only study hours as the independent variable might be insufficient.
You can add other variables, such as the amount of sleep, the number of tutoring classes, or
even the level of student stress.
• Why it's effective: Adding more independent variables can help the model consider more
factors that affect the score, thereby creating a more accurate model.
• Example: Suppose you add the amount of sleep to the model. Now your model might look
like this:
Tools for Validation:
•Popular Techniques:
• Train-Validation Split: Directly splitting data into training and validation sets.
• K-Fold Cross-Validation: Dividing data into K subsets for repeated training and validation.