0% found this document useful (0 votes)

9 views52 pages

Intermediate Analytics-Regression-Week 1

The document provides an overview of intermediate analytics concepts, focusing on regression, classification, and clustering techniques. It discusses model validation, evaluation methods, and the importance of understanding relationships between variables through correlation and regression analysis. Additionally, it covers linear regression, including simple and multivariate regression, model selection criteria, and techniques like stepwise regression for improving model accuracy.

Uploaded by

cronguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views52 pages

Intermediate Analytics-Regression-Week 1

Uploaded by

cronguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Intermediate

Analytics
ALY6015
Northeastern University
By: Behzad Abdi
Meet Your Instructor

ALY6015-Intermediate Analytics-Northeastern University-Behzad Abdi

Introduction
Human Simulation in Machines
Introduction
Artificial Intelligence:
Introduction

Regression Classification Clustering

• Goal: To predict a continuous • Goal: To assign data to specific • Goal: To group data into similar

value based on given input data. categories based on labelled clusters based on patterns and
• Type of Output: A numerical data. similarities, without predefined
value (e.g., predicting price, • Type of Output: A discrete labels.
temperature, or weight) value or category (e.g., yes/no, • Type of Output: Grouping data
• Example: Predicting house prices dog/cat). based on internal similarities.
based on size, location, and year of • Example: Detecting spam • Example: Grouping customers

construction. emails or classifying fruit as an based on purchasing behavior

• Feature: Regression is used for apple or an orange. without knowing predefined
problems where the output is a • Feature: In classification, the categories.
numerical value, and the model model learns from labelled data • Feature: Clustering is used in

aims to predict a specific and assigns new data to a unsupervised learning, where data is
numerical result. specific category. not assigned to predefined
categories. The model automatically
Model Validation and
Evaluation:
 Independent variable, also called a predictor variable.
 Dependent variable, also called a response variable
Relationships Between Variables
• Inferential statistics help determine if relationships
exist between numerical variables.

• Examples:
• Sales volume and advertising spending
• Study hours and exam scores
• Age and blood pressure

• Techniques: Correlation and Regression Analysis.

Main Questions
1. Are two or more variables related?

2. If so, what is the strength of the relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from the relationship?

Correlation vs. Regression
 Correlation:
• Measures whether variables are related.
• Determines the strength of the relationship
using a correlation coefficient.

 Regression:
• Describes the nature of relationships
(positive/negative, linear/nonlinear).
• Helps predict one variable based on another.
Correlation
coefficient
• Measures the strength and direction of the relationship
between two variables.

• It ranges between −1and +1:

• +1: Perfect positive correlation.
• −1: Perfect negative correlation.
• 0: No correlation.

• Pearson Correlation:
Correlation Coefficient
Interpretation:
Visualizing Correlation
Scatter Plots illustrate the relationship between variables.
 Is there a significant linear relationship between the
variables, or the value of r is due to chance?

The  Hypothesis testing is used to determine if r is

statistically significant.
Significanc
e test of  Assumptions:
the • Variables x and y are linearly related.
• Variables x and y are random variables.
Correlation • The variables have a bivariate normal distribution.
Coefficient o For any given x, y values have a bell-shaped distribution.
o For any given y, x values have a bell-shaped distribution.
Hypothesis-Testing
Procedure (Traditional
Method):
Steps:

1. State the hypotheses:

• Null hypothesis (…H0): r = 0 (no
correlation).
• Alternative hypothesis (…H1): r ≠ 0
(significant correlation).
2. Compute the test value (t-test).

3. Compare the test value with critical values

from Table t distribution .

4. Make the decision (Reject or Fail to Reject

H0).

5. Summarize the results.

Steps:
P-
Value 1. State the hypotheses.
2. Find the test value using the t-test.

Metho 3. Compute the P-value (e.g., from

Table F or calculator).
4. Compare P-value to α level (e.g.,
d: 0.05).
5. Summarize results.
Using Table Critical Values of r:

 Table I provides critical r values for specific α levels and

degrees of freedom.
 Steps:
1. State the hypotheses.
2. Find critical values from Table I.
3. Compare r to critical values.
4. Make a decision.
Limitations

 The relationship between the variables may be caused by a third variable (lurking variable).

 Correlation ≠ Causation: Correlation does not imply causation.

 Effect of Outliers: Outliers can distort the correlation value.

 Linear Relationships Only: Pearson correlation measures only linear relationships.

Introduction to Linear Regression
Introduction to Linear Regression
What is Linear Regression?

1 Definition 2 Goal
Linear regression models To find the best-fitting
the relationship between straight line through a set
a dependent variable and of points.
one or more independent
variables.
Introduction to Linear Regression
1. Simple Linear Regression:

• Only one independent variable (input)

• Goal: establish a linear relationship between the input and the output variable.
• Represented by the equation: y=mx+c
Where:
y is the dependent variable (output).
x is the independent variable (input).
m is the slope of the line (the rate at which y changes with
respect to x).
c is the intercept (the point where the line crosses the y-axis).
Linear Regression
1. Simple Linear Regression:

Example: Predicting a student's exam score (y) based on their studied hours (x).

The independent variable: the number of hours studied

The dependent variable: the exam score
Linear Regression
Fit a slope line through all the points such that the error or residual (i.e., the distance of the line from each point) is the best possible minimal.
Linear Regression
• The error could be positive or negative.
• A simple sum of all the errors will be zero
• So, we should square the error
Improving the results of a linear regression mode

1. Removing Outliers:

• Explanation: Outliers (data points that are significantly different from others) can skew the
model and reduce its accuracy.

• Why it's effective: Removing or managing outliers allows the model to align more with most
of the data and perform better.

• Example: If a student with very few study hours (e.g., 1 hour) achieves a very high score (e.g.,
95), this could be an outlier and need to be removed or examined further.
Improving the results of a linear regression mode

2. Multicollinearity

Definition:
• Multicollinearity occurs when two or more predictors in a regression model are highly
correlated.
• This makes it difficult for the model to distinguish their individual effects on the
dependent variable.

Impact on the Model:

• Unstable Coefficients: Changes in data can lead to large variations in predictor coefficients.
• Reduced Interpretability: Difficult to determine the importance of individual predictors.
• Inflated Standard Errors: Wider confidence intervals, making predictors appear insignificant.
Improving the results of a linear regression mode
 Detecting Multicollinearity

Common Methods:
1. Correlation Matrix:
• High correlation ( or ) between predictors is a warning sign.
2. Variance Inflation Factor (VIF):
• Measures how much the variance of a predictor is inflated due to multicollinearity.
• Rule of Thumb: indicates severe multicollinearity.

 Resolving Multicollinearity

Strategies to Handle Multicollinearity:

1. Remove One of the Correlated Predictors:

• Identify redundant predictors and exclude them.
2. Combine Predictors:
• Create a new variable (e.g., principal component analysis or a
combined index).
3. Regularization Methods:
• Use techniques like Ridge Regression or Lasso to reduce the impact
of correlated predictors.
Linear Regression
Multivariate Regression

• In real-life use cases, there are more than one

independent variable.

• The concept of having multiple independent

variables is called multivariate regression.
Model Selection and
Comparison
Techniques:
• R Square
• Adjusted R Square
• AIC (Akaike information criterion)
• BIC (Bayesian information criterion)
• Mallow’s Cp
What is R-Squared

 Definition:
• Measures how well a regression model explains the variability in the dependent variable based on the
independent variable(s).

 Range: values range from 0 to 1:

• High: Large proportion of variability explained by the model.
• Low: Model does not explain much of the variability.

 Example: Predicting house prices based on square footage:

• : 85% of the variation in house prices is explained by square footage.
• : Only 20% of the variation is explained.

 Key Points:
• Higher values indicate a better fit.
• Used to evaluate the performance of regression models.
R-Squared Formula:
Linear Regression
R-Squared for Goodness of fit

R-squared = 1510.01 / 1547.55 = 0.97

Advantages and limitation of R-
Squared:
 Advantages of R-Squared:
• Easy to Interpret: Provides a simple measure of how well the model fits the data.
• Useful for Model Comparison: Higher values suggest better models (on the same
dataset).

 Limitations of R-Squared:
• Not Always Predictive: High does not guarantee good predictions for new data.
• Sensitive to Overfitting: Adding more variables can artificially increase .
• Correlation, Not Causation: only measures association, not causality.
Adjusted 𝑅-Square:

 Adjusted 𝑅-Square:

• To address the limitation of

overfitting, it penalizes the model
for including unnecessary variables.
Akaike Information Criterion (AIC):

 Definition: AIC estimates in-sample prediction error and compares model quality within
the same dataset.

 Key Points:
• Lower AIC indicates a better model.
• Only valid for comparing models from the same dataset.
• Does not measure absolute model quality.
Bayesian Information Criterion (BIC):

 Definition: A penalized-likelihood criterion derived from Bayesian probability, closely

related to AIC.

 Key Points:
• Heavily penalizes complexity to favor simpler models.
• Lower BIC indicates a better model.

 Formula:
Mallow’s Cp:

 Definition: Compares precision and bias of the full model to models with subsets of
predictors.

 Key Points:
• Calculates Cp for all variable combinations.
• Best model has Cp value closest to (number of predictors + 1).

 Formula:
Comparison
of Adjusted
R-Square,
AIC, BIC, and
Mallow’s Cp:
Detailed Use Cases:
1. Adjusted 𝑅 Square :
• When to Use: When comparing multiple linear regression models with different
numbers of predictors.
• When Not to Use: When working with nonlinear models or datasets with many
predictors.

2. AIC (Akaike Information Criterion):

• When to Use: When working with more complex models, such as GLMs or time-series
models.
• When Not to Use: When simpler models are preferred, or the dataset is very large.

3. BIC (Bayesian Information Criterion):

• When to Use: When datasets are large and simpler models are desirable.
• When Not to Use: When high accuracy is needed, even if the model is more complex.

4. Mallow’s Cp:
• When to Use: When selecting the best subset of predictors in linear regression.
• When Not to Use: For nonlinear models or when there are too many predictors.

 How to Decide Which Method to Use:

• Linear Models: Use Adjusted R-Square and Mallow’s Cp.

• Complex Models or Small Datasets: Use AIC.
• Large Datasets or Simplicity Preferred: Use BIC.
Stepwise Regression:

 Definition:

• Stepwise Regression is a modeling technique that automatically decides

which independent variables (features) to include or remove from the
model.
• Goal: Create a simpler and more effective model by reducing unnecessary
predictors.

 Stepwise Regression Methods:

1. Forward Selection:
• Start with no variables.
• Add predictors one by one based on their significance in the model.

2. Backward Elimination:
• Start with all predictors.
• Remove the least significant predictor one by one.

3. Stepwise Selection:
• Combine Forward Selection and Backward Elimination.
• Variables can be added or removed at each step.
Example: Predicting House Prices with Stepwise Regression

 Predictors:
• House Size (X1)
• Number of Rooms (X2)
• House Age (X3)
• Wall Color (X4)

1. Forward Selection:
• Start with no predictors.
• Add House Size (X1) because it has the most significant impact.
• Add Number of Rooms (X2) if it improves model performance.
• Continue until no additional predictors improve the model.

2. Backward Elimination:
• Start with all predictors: X1, X2, X3, X4.
• Remove Wall Color (X4) if it has the least significance.
• Repeat until all remaining predictors are significant.

3. Stepwise Selection:
• Combine both methods. Variables can be added or removed based on
their impact on the model.
Advantages and Disadvantages of Stepwise
Regression:
•Advantages:

• Simplifies the model.

• Removes unnecessary or insignificant predictors.
• Fast and automatic.

•Disadvantages:

• Risk of Omitting Important Predictors:

• Can exclude significant variables, especially if predictors are highly correlated.
• Sensitive to Data:
• Results may vary with different datasets.
Linear Regression
Improving the results of a linear regression mode

Adding More Independent Variables:

• Explanation: Considering only study hours as the independent variable might be insufficient.
You can add other variables, such as the amount of sleep, the number of tutoring classes, or
even the level of student stress.

• Why it's effective: Adding more independent variables can help the model consider more
factors that affect the score, thereby creating a more accurate model.

• Example: Suppose you add the amount of sleep to the model. Now your model might look
like this:
Tools for Validation:
•Popular Techniques:

• Train-Validation Split: Directly splitting data into training and validation sets.

• K-Fold Cross-Validation: Dividing data into K subsets for repeated training and validation.

• Stratified K-Fold: Preserves class distribution during splitting.

K-Fold Example:
Iteration 1:
• Training Set: [3, 4, 5, 6, 7, 8, 9, 10]
Imagine your dataset consists of the following 10 samples: • Validation Set: [1, 2]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Iteration 2:
Using 5-Fold Cross-Validation, the data is split into 5 folds:
• Training Set: [1, 2, 5, 6, 7, 8, 9, 10]
• Fold 1: [1, 2]
• Validation Set: [3, 4]
• Fold 2: [3, 4]
• Fold 3: [5, 6] Iteration 3:
• Fold 4: [7, 8] • Training Set: [1, 2, 3, 4, 7, 8, 9, 10]
• Fold 5: [9, 10] • Validation Set: [5, 6]
Iteration 4:
• Training Set: [1, 2, 3, 4, 5, 6, 9, 10]
• Validation Set: [7, 8]
Iteration 5:
• Training Set: [1, 2, 3, 4, 5, 6, 7, 8]
• Validation Set: [9, 10]
Polynomial Regression

Rajesh (Dl Unit1) 04dec2024
No ratings yet
Rajesh (Dl Unit1) 04dec2024
125 pages
DA-MODULE-3
No ratings yet
DA-MODULE-3
54 pages
DA_UNIT_3_R22
No ratings yet
DA_UNIT_3_R22
15 pages
Simple Linear Regression and Correlation
No ratings yet
Simple Linear Regression and Correlation
77 pages
BRM-Lecture 4-2023
No ratings yet
BRM-Lecture 4-2023
48 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
CH 5
No ratings yet
CH 5
36 pages
Copy of Unit 5 Business Analytics
No ratings yet
Copy of Unit 5 Business Analytics
24 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
3
No ratings yet
3
19 pages
Regression_PDF
No ratings yet
Regression_PDF
33 pages
PS PPT AAT-2
No ratings yet
PS PPT AAT-2
12 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
Module 4
No ratings yet
Module 4
33 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Corr and Regress PPT
No ratings yet
Corr and Regress PPT
37 pages
13 Predictive Analysis - Tests of Association- Regression
No ratings yet
13 Predictive Analysis - Tests of Association- Regression
70 pages
Correlation Regression Tutorial
No ratings yet
Correlation Regression Tutorial
42 pages
Lesson-9 (1)
No ratings yet
Lesson-9 (1)
4 pages
Inferential Analysis
No ratings yet
Inferential Analysis
45 pages
ML unit-2 ppt
No ratings yet
ML unit-2 ppt
34 pages
Unit-III
No ratings yet
Unit-III
13 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Correlation
No ratings yet
Correlation
5 pages
Aiml Module 3 Part 3
No ratings yet
Aiml Module 3 Part 3
12 pages
Regression Analysis (AI)
No ratings yet
Regression Analysis (AI)
9 pages
Model Development
No ratings yet
Model Development
80 pages
MODULE 2
No ratings yet
MODULE 2
21 pages
UNIT II Regression
No ratings yet
UNIT II Regression
59 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Day 2-Data Science
No ratings yet
Day 2-Data Science
16 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
FDA UNIT 5
No ratings yet
FDA UNIT 5
20 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Regression Analysis
No ratings yet
Regression Analysis
20 pages
PAS 19 (Revised) Employee Benefits
100% (1)
PAS 19 (Revised) Employee Benefits
35 pages
Ch 4- Correlation and Regression YARA&LAMA
No ratings yet
Ch 4- Correlation and Regression YARA&LAMA
27 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Unit 3
No ratings yet
Unit 3
24 pages
Concepts - Regression Overview
No ratings yet
Concepts - Regression Overview
14 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
UNIT-2 ML
No ratings yet
UNIT-2 ML
39 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Regressi On
No ratings yet
Regressi On
16 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
Nocc Final - 2012
No ratings yet
Nocc Final - 2012
72 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Linear Regression (1)
No ratings yet
Linear Regression (1)
19 pages
Simple Linear Regression & Correlation Chapter No 14...
No ratings yet
Simple Linear Regression & Correlation Chapter No 14...
43 pages
CH 03
100% (3)
CH 03
67 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
25 pages
Regression Analysis
No ratings yet
Regression Analysis
12 pages
Ch13slides Generalized Linear Models
No ratings yet
Ch13slides Generalized Linear Models
24 pages
SIPF SI Rules English
No ratings yet
SIPF SI Rules English
21 pages
Identify The Choice That Best Completes The Statement or Answers The Question
0% (1)
Identify The Choice That Best Completes The Statement or Answers The Question
9 pages
Heteroscedasticity Week 1 Econometrics
No ratings yet
Heteroscedasticity Week 1 Econometrics
33 pages
HumanityEB Corporate Brochure 08-10-2020
No ratings yet
HumanityEB Corporate Brochure 08-10-2020
21 pages
Lecture Notes (Chapter 5) ASC2014 Life Contingencies I
No ratings yet
Lecture Notes (Chapter 5) ASC2014 Life Contingencies I
31 pages
IAS 19 past paper solutions
No ratings yet
IAS 19 past paper solutions
5 pages
Regression (1)
No ratings yet
Regression (1)
11 pages
Epidemiology - The Natural History of Disease
No ratings yet
Epidemiology - The Natural History of Disease
37 pages
Global Demography
No ratings yet
Global Demography
12 pages
Population Dynamics
No ratings yet
Population Dynamics
19 pages
Assignment 3 - DS
No ratings yet
Assignment 3 - DS
9 pages
ACTL3162 General Insurance Techniques S22016
No ratings yet
ACTL3162 General Insurance Techniques S22016
19 pages
Multiple Linear Regression 1
No ratings yet
Multiple Linear Regression 1
8 pages
cs1 Specimen Questions and Solutions
No ratings yet
cs1 Specimen Questions and Solutions
7 pages
Unit II
No ratings yet
Unit II
6 pages
Yaregal Birhanu
No ratings yet
Yaregal Birhanu
8 pages
Pop Quiz
100% (5)
Pop Quiz
2 pages
Nz6S73TkKj1-j6-9 - zINMuJvN2BBTtj0u-EPSM - Unit 7 - End of Unit Quiz MR
No ratings yet
Nz6S73TkKj1-j6-9 - zINMuJvN2BBTtj0u-EPSM - Unit 7 - End of Unit Quiz MR
3 pages
Global Demography Report
No ratings yet
Global Demography Report
5 pages
DIMTOT
No ratings yet
DIMTOT
3 pages
Time Value of Money Tables
No ratings yet
Time Value of Money Tables
12 pages
BUS336 A3 Spring 2024
No ratings yet
BUS336 A3 Spring 2024
4 pages
Actuarial Function Overview - Updatedv2
No ratings yet
Actuarial Function Overview - Updatedv2
4 pages
Actuarie 4 Article
No ratings yet
Actuarie 4 Article
2 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Intermediate Analytics-Regression-Week 1

Uploaded by

Intermediate Analytics-Regression-Week 1

Uploaded by

Intermediate

ALY6015-Intermediate Analytics-Northeastern University-Behzad Abdi

Regression Classification Clustering

construction. emails or classifying fruit as an based on purchasing behavior

• Techniques: Correlation and Regression Analysis.

2. If so, what is the strength of the relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from the relationship?

• It ranges between −1and +1:

The  Hypothesis testing is used to determine if r is

1. State the hypotheses:

3. Compare the test value with critical values

4. Make the decision (Reject or Fail to Reject

5. Summarize the results.

Metho 3. Compute the P-value (e.g., from

 Table I provides critical r values for specific α levels and

 Correlation ≠ Causation: Correlation does not imply causation.

 Effect of Outliers: Outliers can distort the correlation value.

 Linear Relationships Only: Pearson correlation measures only linear relationships.

• Only one independent variable (input)

The independent variable: the number of hours studied

Impact on the Model:

Strategies to Handle Multicollinearity:

1. Remove One of the Correlated Predictors:

• In real-life use cases, there are more than one

• The concept of having multiple independent

 Range: values range from 0 to 1:

 Example: Predicting house prices based on square footage:

R-squared = 1510.01 / 1547.55 = 0.97

• To address the limitation of

 Definition: A penalized-likelihood criterion derived from Bayesian probability, closely

2. AIC (Akaike Information Criterion):

3. BIC (Bayesian Information Criterion):

 How to Decide Which Method to Use:

• Linear Models: Use Adjusted R-Square and Mallow’s Cp.

• Stepwise Regression is a modeling technique that automatically decides

 Stepwise Regression Methods:

• Simplifies the model.

• Risk of Omitting Important Predictors:

Adding More Independent Variables:

• Stratified K-Fold: Preserves class distribution during splitting.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.