0% found this document useful (0 votes)
22 views48 pages

1 Regression Analysis

Uploaded by

Ayush Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views48 pages

1 Regression Analysis

Uploaded by

Ayush Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

BUSINESS KNOWLEDGE

Cart Abandonment Analysis


Problem :
High fractions of your online customer are adding product to their cart but not
purchasing it
Examples
Business Knowledge that will be helpful

1. Discussions with the marketing team


2. Discussions with the product team
3. Dry run of the online purchasing process to understand customer journey
4. Research on industry reports regarding cart abandonment
5. Any previous work in your /other organization regarding cart Abandonment

Start-Tech Academy
Data Exploration
Next step should be to use the acquired business knowledge to search for relevant data

Identify Plan data Quality


Steps Data need request checks

Start-Tech Academy
Data Exploration
Next step should be to use the acquired business knowledge to search for relevant data

1. Internal Data
Data collected by your organization
Data E.g. Usage data, sales data, promotion data

Exploration 2. External data


Data acquired from external data sources
E.g. Census Data, External vendor Data, Scrape data

Start-Tech Academy
Data Exploration
Cart Abandonment Analysis
1. Input from the marketing team –
Our 50 % comes from email marketing, 30% from organic search and rest 20% from
ad word marketing
-> Gather the source website data for all customers
2. Input from the product team
Examples We have 3 step purchase process – Cart review, Address/personal detail, Payment
-> Gather the Cart Abandonment location for all customer
3. Input from industry reports regarding cart abandonment
Customers tends to put high value item for long duration in their cart
-> Gather the data about total Cart value of all customers
4. Input from dry run
Encountered a survey link for rate website experience
-> Gather survey data for all customers

Start-Tech Academy
DATA DICTIONARY
Next step should be to understand the data. You should know variable definition and distribution along with table’s
unique identifiers and foreign keys

A Comprehensive Data Dictionary should include


1. Definition of predictors
Data
2. Unique identifier of each table ( or Primary Keys)
Dictionary 3. Foreign keys or matching keys between tables
https://youtu.be/76Y6Tg1glrQ
4. Explanation of values in case of Categorical variables

Start-Tech Academy
DATA DICTIONARY
Data Dictionary
House Pricing Dataset

The data set contains 506 observations of house prices from different towns.
Corresponding to each house price, data of 18 other variables is available on
which price is suspected to depend

Examples price
crime_rate
Value of the house
Crime rate in that neighborhood
resid_area Proportion of residential area in the town
air_qual Quality of air in that neighborhood
room_num Average number of rooms in houses of that locality
age How old is the house construction in years
dist1 Distance from employment hub 1
dist2 Distance from employment hub 2
dist3 Distance from employment hub 3
dist4 Distance from employment hub 4
teachers Number of teachers per thousand population in the town

Start-Tech Academy
DATA DICTIONARY
Data Dictionary
House Pricing Dataset

The data set contains 506 observations of house prices from different towns.
Corresponding to each house price, data of 18 other variables is available on
which price is suspected to depend
Examples
poor_prop Proportion of poor population in the town
airport Is there an airport in the city? (Yes/No)
n_hos_beds Number of hospital beds per 1000 population in the town
n_hot_rooms Number of hotel rooms per 1000 population in the town
waterbody What type of natural fresh water source is there in the city (lake/ river/ both/ none)
rainfall The yearly average rainfall in centimeters
bus_ter Is there a bus terminal in the city? (Yes/No)
parks Proportion of land assigned as parks and green areas in the town

Start-Tech Academy
UNIVARIATE ANALYSIS
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one
variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; it takes
data, summarizes that data and finds patterns in the data.

Ways to describe patterns found in univariate data


1. Central tendency
1. Mean
2. Mode
3. Median
Univariate 2. Dispersion
Analysis 1. Range
2. Variance
3. maximum, minimum,
4. Quartiles (including the interquartile range), and
5. Standard deviation
3. Count /Null count

Start-Tech Academy
EDD (EXTENDED DATA DICTIONARY)

Example

Start-Tech Academy
Missing Value Imputation
Real-world data often has missing values. Data can have missing values for a number of reasons such as observations
that were not recorded and data corruption.

Impact
• Handling missing data is important as many machine learning algorithms do
not support data with missing values.
Solution
Missing Value • Remove rows with missing data from your dataset.
Imputation • Impute missing values with mean/median values in your dataset.
Note
• Use business knowledge to take separate approach for each variable
• It is advisable to impute instead of remove in case of small sample size or
large proportion of observations with missing values

Start-Tech Academy
Missing Value Imputation
1. Impute with ZERO
• Impute missing values with zero
2. Impute with Median/Mean/Mode
• For numerical variables, impute missing values with Mean or Median
• For categorical variables, impute missing values with Mode
Methods 3. Segment based imputation
• Identify relevant segments
• Calculate mean/median/mode of segments
• Impute the missing value according to the segments
• For example, we can say rainfall hardly varies for cities in a particular
State
• In this case, we can impute missing rainfall value of a city with the
average of that state

Start-Tech Academy
Outlier Treatment
Outlier is a commonly used terminology by analysts and data scientists, Outlier is an observation that appears far away
and diverges from an overall pattern in a sample.

Reasons
• Data Entry Errors
• Measurement Error
• Sampling error etc
Outlier Impact
Treatment • It increases the error variance and reduces the power of statistical tests
Solution
• Detect outliers using EDD and visualization methods such as scatter plot,
histogram or box plots
• Impute outliers

Start-Tech Academy
Outlier Treatment

Without Outlier With Outlier


Data 6,6,6,4,4,5,5,5,5,7,7 6,6,6,4,4,5,5,5,5,7,7,300
Mean 5.45 30.0
Example Median 5 5.5
Mode 5 5
Standard 1.04 85.03
deviation
Variance 1.08 7230.10

Start-Tech Academy
Outlier Treatment
1. Capping and Flooring
• Impute all the values above 3* P99 and below 0.3*P1
• Impute with values 3* P99 and 0.3*P1
• You can use any multiplier instead of 3, as per your business
requirement
Methods 2. Exponential smoothing
• Extrapolate curve between P95 to P99 and cap all the values falling
outside to the value generated by the curve
• Similarly, extrapolate curve between P5 and P1
3. Sigma Approach
• Identify outliers by capturing all the values falling outside 𝝁 ∓ 𝔁𝝈
• You can use any multiplier as x, as per your business requirement

Start-Tech Academy
Bivariate Analysis
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship
between two variables, whether there exists an association and the strength of this association, or whether there are
differences between two variables and the significance of these differences.

Scatter Plot
• Scatter indicates the type (linear or non-linear) and strength of the
relationship between two variables
Creating new • We will use Scatter plot to transform variables
Correlation
Variables • Linear correlation quantifies the strength of a linear relationship between
two numerical variables.
• When there is no correlation between two variables, there is no tendency
for the values of one quantity to increase or decrease with the values of the
second quantity.
• Correlation is used to drop Non Usable variables

Start-Tech Academy
Scatter plots

Start-Tech Academy
Variable Transformation
Transform your existing variable to extract more information out of them

Identify
• Using your business knowledge and bivariate analysis to modify variable
Creating new Methods
Variables • Use Mean/Median of variables conveying similar type of information
• Create ratio variable which are more relevant to business
• Transform variable by taking log, exponential, roots etc.

Start-Tech Academy
Transformation

If Take e^x instead of x

If Take log(1+x) instead of x

Start-Tech Academy
Transformation

𝑛
If Take 𝑥 𝑜𝑟 𝑥 instead of x

Start-Tech Academy
Scatter plots

Start-Tech Academy
Variable Transformation
Transform your existing variable to extract more information out of them

Identify
• Using your business knowledge and bivariate analysis to modify variable
Creating new Methods
Variables • Use Mean/Median of variables conveying similar type of information
• Create ratio variable which are more relevant to business
• Transform variable by taking log, exponential, roots etc.

Start-Tech Academy
Transformation

If Take e^x instead of x

If Take log(1+x) instead of x

Start-Tech Academy
Transformation

𝑛
If Take 𝑥 𝑜𝑟 𝑥 instead of x

Start-Tech Academy
Correlation
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A
positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation
indicates the extent to which one variable increases as the other decreases.

Examples
Some examples of data that have a high correlation:
• Your caloric intake and your weight.
Correlation • The amount of time your study and your GPA.

Some examples of data that have a low correlation (or none at all):
• A dog’s name and the type of dog biscuit they prefer.
• The cost of a car wash and how long it takes to buy a soda inside the
station.

Start-Tech Academy
The Correlation Coefficient
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A
positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation
indicates the extent to which one variable increases as the other decreases.

Definition
• A correlation coefficient is a way to put a value to the relationship.
• Correlation coefficients have a value of between -1 and 1.
• A “0” means there is no relationship between the variables at all,
Correlation • While -1 or 1 means that there is a perfect negative or positive correlation
Coefficient Example

Start-Tech Academy
Correlation vs Causation
Causation : The relation between something that happens and the thing that causes it . The first thing that happens is
the cause and the second thing is the effect .

Correlation
vs
Causation

Source :http://www.tylervigen.com/spurious-correlations
Start-Tech Academy
The Correlation Matrix
Definition
• A correlation matrix is a table showing correlation coefficients between variables.
• Each cell in the table shows the correlation between two variables.
• A correlation matrix is used as a way to summarize data, as an input into a more
advanced analysis, and as a diagnostic for advanced analyses.
Example
Correlation
Matrix

Application
• To summarize a large amount of data where the goal is to see patterns.
• To Identify collinearity in the data

Start-Tech Academy
Multicollinearity

Definition
• Multicollinearity exists whenever two or more of the predictors in a regression
model are moderately or highly correlated.
Effects
• Multicollinearity results in a change in the signs as well as in the magnitudes of
the partial regression coefficients from one sample to another sample.
• Multicollinearity makes it tedious to assess the relative importance of the
Multicollinearity independent variables in explaining the variation caused by the dependent
variable.
Solution
• Remove highly correlated independent variables by looking at the correlation
matrix and VIF

Start-Tech Academy
Dummy Variable
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or
more distinct categories/levels.

Why
• Regression analysis treats all independent (X) variables in the analysis as
numerical.
• Nominal variables, or variables that describe a characteristic using two or more
categories, are commonplace in regression research, but are not always useable
in their categorical form.
Dummy Variable • Dummy coding is a way of incorporating nominal variables into regression
analysis

How
• We can make a separate column, or variable, for each category.
• This new variables can take value 0 or 1 depending on the value of the
categorical variable

Start-Tech Academy
Dummy Variable
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or
more distinct categories/levels.

Student Favorite class Science Math

1 Science 1 0
2 Science 1 0
3 English 0 0
4 Math 0 1

Dummy Variable Things to keep in mind


• The number of dummy variables necessary to represent a single attribute
Example variable is equal to the number of levels (categories) in that variable minus
one.
• We cannot code variables like science = 1, math = 2, and English = 3. As, we
can see that there is no such thing as an increase in favorite class – math is
not higher than science, and is not lower than language either. And even if
there is increase , we cannot quantify that increase

Start-Tech Academy
Linear Regression
linear regression is a linear approach to modelling the relationship between a dependent variable and one or more
independent variables

Introduction

Start-Tech Academy
Linear Regression

Here are a few important questions that we might seek to address:

1. Prediction Question
Questions How accurately can I predict the price of a house , given the values of all
variables
2. Inferential Question
How accurately can we estimate the effect of each of this variables on the
house price

Start-Tech Academy
Simple Linear Regression
Simple linear regression is an approach for predicting a quantitative response Y on the basis of a single predictor
variable X. It assumes that there is approximately a linear relationship between X and Y .

Model Equation
𝒀 ≈𝛽0 + 𝛽1 𝑋
𝛽0 is known as Intercept
𝛽1 is known as slope
Together 𝛽0 and 𝛽1 known as the model coefficients or parameters.
Introduction
For House Price data
• X will represent Room_num
• Y will represent Price
Price ≈𝛽0 + 𝛽1 × Room_num

From our training data we will get 𝛽0 and 𝛽𝟏

Start-Tech Academy
Simple Linear Regression

• Our goal is to obtain coefficient estimates 𝛽0 and 𝛽𝟏 such that the linear
model fits the available data well
• Total number of rows (Data Point) ⇒ 𝑛 = 506
• Data ⇒ 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , 𝑥3, 𝑦3 , ………………… 𝑥506, 𝑦506
• Lets call calculated 𝑦 value as 𝑦
Estimating the 𝑦1 = 𝛽0 + 𝛽𝟏 𝑥1
Coefficients 𝑦2 = 𝛽0 + 𝛽𝟏 𝑥2
𝑦506 = 𝛽0 + 𝛽𝟏 𝑥506
• The difference between residual the ith observed response value and the
ith response value that is predicted by our linear model is known as residual
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖

Start-Tech Academy
Simple Linear Regression
Residual –
The difference between residual the ith observed response value and the ith
response value that is predicted by our linear model is known as residual
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖

Residual

Start-Tech Academy
Simple Linear Regression

Residual sum of squares (RSS)

𝑅𝑆𝑆 = 𝑒12 + 𝑒22 … … . +𝑒𝑛2

The least squares approach chooses 𝛽0 and 𝛽𝟏 to minimize the RSS


RSS Using some calculus, one can show that the minimizers are

Start-Tech Academy
Simple Linear Regression

For our Model

Model

Start-Tech Academy
Simple Linear Regression
we assume that the true relationship between X and Y takes the form Y = f(X) + ε for some unknown function f, where
ε is a mean-zero random error term.

If f is to be approximated by a linear function, then we


can write this relationship as

𝒀 = 𝛽0 + 𝛽1 𝑋 +ε
Assessing the 𝛽0 is known as Intercept
𝛽1 is known as slope
Accuracy ε is an error term

Population regression line


Sample regression line

Start-Tech Academy
Simple Linear Regression

𝜎 2 = 𝑉𝑎𝑟 𝜀
σ2 is not known, but can be estimated from the data. This estimate is known as
Standard error the residual standard error (RSE)

In Coefficients
There is approximately a 95% chance that the interval

will contain the true value of 𝛽1

Start-Tech Academy
Simple Linear Regression

Is there any relationship between X and Y


𝒀 = 𝛽0 + 𝛽1 𝑋
Hypothesis
• If 𝛽1 is zero, it means there is no relationship
tests Ho : There is no relationship between X and Y
Ha : There is some relationship between X and Y
H : 𝛽1 = 0
Ha : 𝛽1 ≠ 0,

Start-Tech Academy
Simple Linear Regression

• To disapprove Ho, we calculate T statistics

• We also compute the probability of observing any value equal to |t| or


Larger
• We call this probability the p-value
Hypothesis • A small p-value means there is an association between the predictor and
tests the response (typically less than 5% or 1 %)

Start-Tech Academy
Simple Linear Regression

The quality of a linear regression fit is typically assessed using two related
quantities: the residual standard error (RSE) and the 𝑅 2 statistic.

Residual Standard Error

Quality of Fit
𝑹𝑺𝑬 • RSE is the average amount that the response will deviate from the true
regression line
• RSE is also considered as a measure of lack of fit of the model to the data

Start-Tech Academy
Simple Linear Regression
The RSE provides an absolute measure of lack of fit of the model to the data.

𝑹𝟐
• 𝑹𝟐 is the proportion of variance explained
• 𝑹𝟐 always takes on a value between 0 and 1,
• 𝑹𝟐 is independent of the scale of Y.
Quality of Fit
𝑹𝟐
• TSS - total sum of squares
• RSS - residual sum of squares

Start-Tech Academy
Multiple Linear Regression
In Multiple linear regression more than one predictor variables are used to predict the response variable

Relationship for Multiple linear regression can be written


as

𝛽0 is known as Intercept
Multiple Linear p is the number of predictors
ϵ is an error term
Regression
For our Model,
The equation is
𝑷𝒓𝒊𝒄𝒆 = 𝛽0 + 𝛽1 Crime_rate +𝛽𝟐poor_pop … … . . 𝛽𝟏𝟔avg_dist

Start-Tech Academy
Multiple Linear Regression

Estimating
Regression
Coefficients

Start-Tech Academy
Multiple Linear Regression

Estimating
Regression
Coefficients

Start-Tech Academy

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy