0% found this document useful (0 votes)
9 views52 pages

Intermediate Analytics-Regression-Week 1

The document provides an overview of intermediate analytics concepts, focusing on regression, classification, and clustering techniques. It discusses model validation, evaluation methods, and the importance of understanding relationships between variables through correlation and regression analysis. Additionally, it covers linear regression, including simple and multivariate regression, model selection criteria, and techniques like stepwise regression for improving model accuracy.

Uploaded by

cronguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views52 pages

Intermediate Analytics-Regression-Week 1

The document provides an overview of intermediate analytics concepts, focusing on regression, classification, and clustering techniques. It discusses model validation, evaluation methods, and the importance of understanding relationships between variables through correlation and regression analysis. Additionally, it covers linear regression, including simple and multivariate regression, model selection criteria, and techniques like stepwise regression for improving model accuracy.

Uploaded by

cronguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Intermediate

Analytics
ALY6015
Northeastern University
By: Behzad Abdi
Meet Your Instructor

ALY6015-Intermediate Analytics-Northeastern University-Behzad Abdi


Introduction
Human Simulation in Machines
Introduction
Artificial Intelligence:
Introduction

Regression Classification Clustering

• Goal: To predict a continuous • Goal: To assign data to specific • Goal: To group data into similar

value based on given input data. categories based on labelled clusters based on patterns and
• Type of Output: A numerical data. similarities, without predefined
value (e.g., predicting price, • Type of Output: A discrete labels.
temperature, or weight) value or category (e.g., yes/no, • Type of Output: Grouping data
• Example: Predicting house prices dog/cat). based on internal similarities.
based on size, location, and year of • Example: Detecting spam • Example: Grouping customers

construction. emails or classifying fruit as an based on purchasing behavior


• Feature: Regression is used for apple or an orange. without knowing predefined
problems where the output is a • Feature: In classification, the categories.
numerical value, and the model model learns from labelled data • Feature: Clustering is used in

aims to predict a specific and assigns new data to a unsupervised learning, where data is
numerical result. specific category. not assigned to predefined
categories. The model automatically
Model Validation and
Evaluation:
 Independent variable, also called a predictor variable.
 Dependent variable, also called a response variable
Relationships Between Variables
• Inferential statistics help determine if relationships
exist between numerical variables.

• Examples:
• Sales volume and advertising spending
• Study hours and exam scores
• Age and blood pressure

• Techniques: Correlation and Regression Analysis.


Main Questions
1. Are two or more variables related?

2. If so, what is the strength of the relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from the relationship?


Correlation vs. Regression
 Correlation:
• Measures whether variables are related.
• Determines the strength of the relationship
using a correlation coefficient.

 Regression:
• Describes the nature of relationships
(positive/negative, linear/nonlinear).
• Helps predict one variable based on another.
Correlation
coefficient
• Measures the strength and direction of the relationship
between two variables.

• It ranges between −1and +1:


• +1: Perfect positive correlation.
• −1: Perfect negative correlation.
• 0: No correlation.

• Pearson Correlation:
Correlation Coefficient
Interpretation:
Visualizing Correlation
Scatter Plots illustrate the relationship between variables.
 Is there a significant linear relationship between the
variables, or the value of r is due to chance?

The  Hypothesis testing is used to determine if r is


statistically significant.
Significanc
e test of  Assumptions:
the • Variables x and y are linearly related.
• Variables x and y are random variables.
Correlation • The variables have a bivariate normal distribution.
Coefficient o For any given x, y values have a bell-shaped distribution.
o For any given y, x values have a bell-shaped distribution.
Hypothesis-Testing
Procedure (Traditional
Method):
Steps:

1. State the hypotheses:


• Null hypothesis (…H0): r = 0 (no
correlation).
• Alternative hypothesis (…H1): r ≠ 0
(significant correlation).
2. Compute the test value (t-test).

3. Compare the test value with critical values


from Table t distribution .

4. Make the decision (Reject or Fail to Reject


H0).

5. Summarize the results.


Steps:
P-
Value 1. State the hypotheses.
2. Find the test value using the t-test.

Metho 3. Compute the P-value (e.g., from


Table F or calculator).
4. Compare P-value to α level (e.g.,
d: 0.05).
5. Summarize results.
Using Table Critical Values of r:

 Table I provides critical r values for specific α levels and


degrees of freedom.
 Steps:
1. State the hypotheses.
2. Find critical values from Table I.
3. Compare r to critical values.
4. Make a decision.
Limitations

 The relationship between the variables may be caused by a third variable (lurking variable).

 Correlation ≠ Causation: Correlation does not imply causation.

 Effect of Outliers: Outliers can distort the correlation value.

 Linear Relationships Only: Pearson correlation measures only linear relationships.


Introduction to Linear Regression
Introduction to Linear Regression
What is Linear Regression?

1 Definition 2 Goal
Linear regression models To find the best-fitting
the relationship between straight line through a set
a dependent variable and of points.
one or more independent
variables.
Introduction to Linear Regression
1. Simple Linear Regression:

• Only one independent variable (input)


• Goal: establish a linear relationship between the input and the output variable.
• Represented by the equation: y=mx+c
Where:
y is the dependent variable (output).
x is the independent variable (input).
m is the slope of the line (the rate at which y changes with
respect to x).
c is the intercept (the point where the line crosses the y-axis).
Linear Regression
1. Simple Linear Regression:

Example: Predicting a student's exam score (y) based on their studied hours (x).

The independent variable: the number of hours studied


The dependent variable: the exam score
Linear Regression
Fit a slope line through all the points such that the error or residual (i.e., the distance of the line from each point) is the best possible minimal.
Linear Regression
• The error could be positive or negative.
• A simple sum of all the errors will be zero
• So, we should square the error
Improving the results of a linear regression mode

1. Removing Outliers:

• Explanation: Outliers (data points that are significantly different from others) can skew the
model and reduce its accuracy.

• Why it's effective: Removing or managing outliers allows the model to align more with most
of the data and perform better.

• Example: If a student with very few study hours (e.g., 1 hour) achieves a very high score (e.g.,
95), this could be an outlier and need to be removed or examined further.
Improving the results of a linear regression mode

2. Multicollinearity

Definition:
• Multicollinearity occurs when two or more predictors in a regression model are highly
correlated.
• This makes it difficult for the model to distinguish their individual effects on the
dependent variable.

Impact on the Model:


• Unstable Coefficients: Changes in data can lead to large variations in predictor coefficients.
• Reduced Interpretability: Difficult to determine the importance of individual predictors.
• Inflated Standard Errors: Wider confidence intervals, making predictors appear insignificant.
Improving the results of a linear regression mode
 Detecting Multicollinearity

Common Methods:
1. Correlation Matrix:
• High correlation ( or ) between predictors is a warning sign.
2. Variance Inflation Factor (VIF):
• Measures how much the variance of a predictor is inflated due to multicollinearity.
• Rule of Thumb: indicates severe multicollinearity.

 Resolving Multicollinearity

Strategies to Handle Multicollinearity:

1. Remove One of the Correlated Predictors:


• Identify redundant predictors and exclude them.
2. Combine Predictors:
• Create a new variable (e.g., principal component analysis or a
combined index).
3. Regularization Methods:
• Use techniques like Ridge Regression or Lasso to reduce the impact
of correlated predictors.
Linear Regression
Multivariate Regression

• In real-life use cases, there are more than one


independent variable.

• The concept of having multiple independent


variables is called multivariate regression.
Model Selection and
Comparison
Techniques:
• R Square
• Adjusted R Square
• AIC (Akaike information criterion)
• BIC (Bayesian information criterion)
• Mallow’s Cp
What is R-Squared

 Definition:
• Measures how well a regression model explains the variability in the dependent variable based on the
independent variable(s).

 Range: values range from 0 to 1:


• High: Large proportion of variability explained by the model.
• Low: Model does not explain much of the variability.

 Example: Predicting house prices based on square footage:


• : 85% of the variation in house prices is explained by square footage.
• : Only 20% of the variation is explained.

 Key Points:
• Higher values indicate a better fit.
• Used to evaluate the performance of regression models.
R-Squared Formula:
Linear Regression
R-Squared for Goodness of fit

R-squared = 1510.01 / 1547.55 = 0.97


Advantages and limitation of R-
Squared:
 Advantages of R-Squared:
• Easy to Interpret: Provides a simple measure of how well the model fits the data.
• Useful for Model Comparison: Higher values suggest better models (on the same
dataset).

 Limitations of R-Squared:
• Not Always Predictive: High does not guarantee good predictions for new data.
• Sensitive to Overfitting: Adding more variables can artificially increase .
• Correlation, Not Causation: only measures association, not causality.
Adjusted 𝑅-Square:

 Adjusted 𝑅-Square:

• To address the limitation of


overfitting, it penalizes the model
for including unnecessary variables.
Akaike Information Criterion (AIC):

 Definition: AIC estimates in-sample prediction error and compares model quality within
the same dataset.

 Key Points:
• Lower AIC indicates a better model.
• Only valid for comparing models from the same dataset.
• Does not measure absolute model quality.
Bayesian Information Criterion (BIC):

 Definition: A penalized-likelihood criterion derived from Bayesian probability, closely


related to AIC.

 Key Points:
• Heavily penalizes complexity to favor simpler models.
• Lower BIC indicates a better model.

 Formula:
Mallow’s Cp:

 Definition: Compares precision and bias of the full model to models with subsets of
predictors.

 Key Points:
• Calculates Cp for all variable combinations.
• Best model has Cp value closest to (number of predictors + 1).

 Formula:
Comparison
of Adjusted
R-Square,
AIC, BIC, and
Mallow’s Cp:
Detailed Use Cases:
1. Adjusted 𝑅 Square :
• When to Use: When comparing multiple linear regression models with different
numbers of predictors.
• When Not to Use: When working with nonlinear models or datasets with many
predictors.

2. AIC (Akaike Information Criterion):


• When to Use: When working with more complex models, such as GLMs or time-series
models.
• When Not to Use: When simpler models are preferred, or the dataset is very large.

3. BIC (Bayesian Information Criterion):


• When to Use: When datasets are large and simpler models are desirable.
• When Not to Use: When high accuracy is needed, even if the model is more complex.

4. Mallow’s Cp:
• When to Use: When selecting the best subset of predictors in linear regression.
• When Not to Use: For nonlinear models or when there are too many predictors.

 How to Decide Which Method to Use:

• Linear Models: Use Adjusted R-Square and Mallow’s Cp.


• Complex Models or Small Datasets: Use AIC.
• Large Datasets or Simplicity Preferred: Use BIC.
Stepwise Regression:

 Definition:

• Stepwise Regression is a modeling technique that automatically decides


which independent variables (features) to include or remove from the
model.
• Goal: Create a simpler and more effective model by reducing unnecessary
predictors.

 Stepwise Regression Methods:

1. Forward Selection:
• Start with no variables.
• Add predictors one by one based on their significance in the model.

2. Backward Elimination:
• Start with all predictors.
• Remove the least significant predictor one by one.

3. Stepwise Selection:
• Combine Forward Selection and Backward Elimination.
• Variables can be added or removed at each step.
Example: Predicting House Prices with Stepwise Regression

 Predictors:
• House Size (X1)
• Number of Rooms (X2)
• House Age (X3)
• Wall Color (X4)

1. Forward Selection:
• Start with no predictors.
• Add House Size (X1) because it has the most significant impact.
• Add Number of Rooms (X2) if it improves model performance.
• Continue until no additional predictors improve the model.

2. Backward Elimination:
• Start with all predictors: X1, X2, X3, X4.
• Remove Wall Color (X4) if it has the least significance.
• Repeat until all remaining predictors are significant.

3. Stepwise Selection:
• Combine both methods. Variables can be added or removed based on
their impact on the model.
Advantages and Disadvantages of Stepwise
Regression:
•Advantages:

• Simplifies the model.


• Removes unnecessary or insignificant predictors.
• Fast and automatic.

•Disadvantages:

• Risk of Omitting Important Predictors:


• Can exclude significant variables, especially if predictors are highly correlated.
• Sensitive to Data:
• Results may vary with different datasets.
Linear Regression
Improving the results of a linear regression mode

Adding More Independent Variables:

• Explanation: Considering only study hours as the independent variable might be insufficient.
You can add other variables, such as the amount of sleep, the number of tutoring classes, or
even the level of student stress.

• Why it's effective: Adding more independent variables can help the model consider more
factors that affect the score, thereby creating a more accurate model.

• Example: Suppose you add the amount of sleep to the model. Now your model might look
like this:
Tools for Validation:
•Popular Techniques:

• Train-Validation Split: Directly splitting data into training and validation sets.

• K-Fold Cross-Validation: Dividing data into K subsets for repeated training and validation.

• Stratified K-Fold: Preserves class distribution during splitting.


K-Fold Example:
Iteration 1:
• Training Set: [3, 4, 5, 6, 7, 8, 9, 10]
Imagine your dataset consists of the following 10 samples: • Validation Set: [1, 2]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Iteration 2:
Using 5-Fold Cross-Validation, the data is split into 5 folds:
• Training Set: [1, 2, 5, 6, 7, 8, 9, 10]
• Fold 1: [1, 2]
• Validation Set: [3, 4]
• Fold 2: [3, 4]
• Fold 3: [5, 6] Iteration 3:
• Fold 4: [7, 8] • Training Set: [1, 2, 3, 4, 7, 8, 9, 10]
• Fold 5: [9, 10] • Validation Set: [5, 6]
Iteration 4:
• Training Set: [1, 2, 3, 4, 5, 6, 9, 10]
• Validation Set: [7, 8]
Iteration 5:
• Training Set: [1, 2, 3, 4, 5, 6, 7, 8]
• Validation Set: [9, 10]
Polynomial Regression

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy