1 Regression Analysis
1 Regression Analysis
Start-Tech Academy
Data Exploration
Next step should be to use the acquired business knowledge to search for relevant data
Start-Tech Academy
Data Exploration
Next step should be to use the acquired business knowledge to search for relevant data
1. Internal Data
Data collected by your organization
Data E.g. Usage data, sales data, promotion data
Start-Tech Academy
Data Exploration
Cart Abandonment Analysis
1. Input from the marketing team –
Our 50 % comes from email marketing, 30% from organic search and rest 20% from
ad word marketing
-> Gather the source website data for all customers
2. Input from the product team
Examples We have 3 step purchase process – Cart review, Address/personal detail, Payment
-> Gather the Cart Abandonment location for all customer
3. Input from industry reports regarding cart abandonment
Customers tends to put high value item for long duration in their cart
-> Gather the data about total Cart value of all customers
4. Input from dry run
Encountered a survey link for rate website experience
-> Gather survey data for all customers
Start-Tech Academy
DATA DICTIONARY
Next step should be to understand the data. You should know variable definition and distribution along with table’s
unique identifiers and foreign keys
Start-Tech Academy
DATA DICTIONARY
Data Dictionary
House Pricing Dataset
The data set contains 506 observations of house prices from different towns.
Corresponding to each house price, data of 18 other variables is available on
which price is suspected to depend
Examples price
crime_rate
Value of the house
Crime rate in that neighborhood
resid_area Proportion of residential area in the town
air_qual Quality of air in that neighborhood
room_num Average number of rooms in houses of that locality
age How old is the house construction in years
dist1 Distance from employment hub 1
dist2 Distance from employment hub 2
dist3 Distance from employment hub 3
dist4 Distance from employment hub 4
teachers Number of teachers per thousand population in the town
Start-Tech Academy
DATA DICTIONARY
Data Dictionary
House Pricing Dataset
The data set contains 506 observations of house prices from different towns.
Corresponding to each house price, data of 18 other variables is available on
which price is suspected to depend
Examples
poor_prop Proportion of poor population in the town
airport Is there an airport in the city? (Yes/No)
n_hos_beds Number of hospital beds per 1000 population in the town
n_hot_rooms Number of hotel rooms per 1000 population in the town
waterbody What type of natural fresh water source is there in the city (lake/ river/ both/ none)
rainfall The yearly average rainfall in centimeters
bus_ter Is there a bus terminal in the city? (Yes/No)
parks Proportion of land assigned as parks and green areas in the town
Start-Tech Academy
UNIVARIATE ANALYSIS
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one
variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; it takes
data, summarizes that data and finds patterns in the data.
Start-Tech Academy
EDD (EXTENDED DATA DICTIONARY)
Example
Start-Tech Academy
Missing Value Imputation
Real-world data often has missing values. Data can have missing values for a number of reasons such as observations
that were not recorded and data corruption.
Impact
• Handling missing data is important as many machine learning algorithms do
not support data with missing values.
Solution
Missing Value • Remove rows with missing data from your dataset.
Imputation • Impute missing values with mean/median values in your dataset.
Note
• Use business knowledge to take separate approach for each variable
• It is advisable to impute instead of remove in case of small sample size or
large proportion of observations with missing values
Start-Tech Academy
Missing Value Imputation
1. Impute with ZERO
• Impute missing values with zero
2. Impute with Median/Mean/Mode
• For numerical variables, impute missing values with Mean or Median
• For categorical variables, impute missing values with Mode
Methods 3. Segment based imputation
• Identify relevant segments
• Calculate mean/median/mode of segments
• Impute the missing value according to the segments
• For example, we can say rainfall hardly varies for cities in a particular
State
• In this case, we can impute missing rainfall value of a city with the
average of that state
Start-Tech Academy
Outlier Treatment
Outlier is a commonly used terminology by analysts and data scientists, Outlier is an observation that appears far away
and diverges from an overall pattern in a sample.
Reasons
• Data Entry Errors
• Measurement Error
• Sampling error etc
Outlier Impact
Treatment • It increases the error variance and reduces the power of statistical tests
Solution
• Detect outliers using EDD and visualization methods such as scatter plot,
histogram or box plots
• Impute outliers
Start-Tech Academy
Outlier Treatment
Start-Tech Academy
Outlier Treatment
1. Capping and Flooring
• Impute all the values above 3* P99 and below 0.3*P1
• Impute with values 3* P99 and 0.3*P1
• You can use any multiplier instead of 3, as per your business
requirement
Methods 2. Exponential smoothing
• Extrapolate curve between P95 to P99 and cap all the values falling
outside to the value generated by the curve
• Similarly, extrapolate curve between P5 and P1
3. Sigma Approach
• Identify outliers by capturing all the values falling outside 𝝁 ∓ 𝔁𝝈
• You can use any multiplier as x, as per your business requirement
Start-Tech Academy
Bivariate Analysis
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship
between two variables, whether there exists an association and the strength of this association, or whether there are
differences between two variables and the significance of these differences.
Scatter Plot
• Scatter indicates the type (linear or non-linear) and strength of the
relationship between two variables
Creating new • We will use Scatter plot to transform variables
Correlation
Variables • Linear correlation quantifies the strength of a linear relationship between
two numerical variables.
• When there is no correlation between two variables, there is no tendency
for the values of one quantity to increase or decrease with the values of the
second quantity.
• Correlation is used to drop Non Usable variables
Start-Tech Academy
Scatter plots
Start-Tech Academy
Variable Transformation
Transform your existing variable to extract more information out of them
Identify
• Using your business knowledge and bivariate analysis to modify variable
Creating new Methods
Variables • Use Mean/Median of variables conveying similar type of information
• Create ratio variable which are more relevant to business
• Transform variable by taking log, exponential, roots etc.
Start-Tech Academy
Transformation
Start-Tech Academy
Transformation
𝑛
If Take 𝑥 𝑜𝑟 𝑥 instead of x
Start-Tech Academy
Scatter plots
Start-Tech Academy
Variable Transformation
Transform your existing variable to extract more information out of them
Identify
• Using your business knowledge and bivariate analysis to modify variable
Creating new Methods
Variables • Use Mean/Median of variables conveying similar type of information
• Create ratio variable which are more relevant to business
• Transform variable by taking log, exponential, roots etc.
Start-Tech Academy
Transformation
Start-Tech Academy
Transformation
𝑛
If Take 𝑥 𝑜𝑟 𝑥 instead of x
Start-Tech Academy
Correlation
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A
positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation
indicates the extent to which one variable increases as the other decreases.
Examples
Some examples of data that have a high correlation:
• Your caloric intake and your weight.
Correlation • The amount of time your study and your GPA.
Some examples of data that have a low correlation (or none at all):
• A dog’s name and the type of dog biscuit they prefer.
• The cost of a car wash and how long it takes to buy a soda inside the
station.
Start-Tech Academy
The Correlation Coefficient
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A
positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation
indicates the extent to which one variable increases as the other decreases.
Definition
• A correlation coefficient is a way to put a value to the relationship.
• Correlation coefficients have a value of between -1 and 1.
• A “0” means there is no relationship between the variables at all,
Correlation • While -1 or 1 means that there is a perfect negative or positive correlation
Coefficient Example
Start-Tech Academy
Correlation vs Causation
Causation : The relation between something that happens and the thing that causes it . The first thing that happens is
the cause and the second thing is the effect .
Correlation
vs
Causation
Source :http://www.tylervigen.com/spurious-correlations
Start-Tech Academy
The Correlation Matrix
Definition
• A correlation matrix is a table showing correlation coefficients between variables.
• Each cell in the table shows the correlation between two variables.
• A correlation matrix is used as a way to summarize data, as an input into a more
advanced analysis, and as a diagnostic for advanced analyses.
Example
Correlation
Matrix
Application
• To summarize a large amount of data where the goal is to see patterns.
• To Identify collinearity in the data
Start-Tech Academy
Multicollinearity
Definition
• Multicollinearity exists whenever two or more of the predictors in a regression
model are moderately or highly correlated.
Effects
• Multicollinearity results in a change in the signs as well as in the magnitudes of
the partial regression coefficients from one sample to another sample.
• Multicollinearity makes it tedious to assess the relative importance of the
Multicollinearity independent variables in explaining the variation caused by the dependent
variable.
Solution
• Remove highly correlated independent variables by looking at the correlation
matrix and VIF
Start-Tech Academy
Dummy Variable
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or
more distinct categories/levels.
Why
• Regression analysis treats all independent (X) variables in the analysis as
numerical.
• Nominal variables, or variables that describe a characteristic using two or more
categories, are commonplace in regression research, but are not always useable
in their categorical form.
Dummy Variable • Dummy coding is a way of incorporating nominal variables into regression
analysis
How
• We can make a separate column, or variable, for each category.
• This new variables can take value 0 or 1 depending on the value of the
categorical variable
Start-Tech Academy
Dummy Variable
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or
more distinct categories/levels.
1 Science 1 0
2 Science 1 0
3 English 0 0
4 Math 0 1
Start-Tech Academy
Linear Regression
linear regression is a linear approach to modelling the relationship between a dependent variable and one or more
independent variables
Introduction
Start-Tech Academy
Linear Regression
1. Prediction Question
Questions How accurately can I predict the price of a house , given the values of all
variables
2. Inferential Question
How accurately can we estimate the effect of each of this variables on the
house price
Start-Tech Academy
Simple Linear Regression
Simple linear regression is an approach for predicting a quantitative response Y on the basis of a single predictor
variable X. It assumes that there is approximately a linear relationship between X and Y .
Model Equation
𝒀 ≈𝛽0 + 𝛽1 𝑋
𝛽0 is known as Intercept
𝛽1 is known as slope
Together 𝛽0 and 𝛽1 known as the model coefficients or parameters.
Introduction
For House Price data
• X will represent Room_num
• Y will represent Price
Price ≈𝛽0 + 𝛽1 × Room_num
Start-Tech Academy
Simple Linear Regression
• Our goal is to obtain coefficient estimates 𝛽0 and 𝛽𝟏 such that the linear
model fits the available data well
• Total number of rows (Data Point) ⇒ 𝑛 = 506
• Data ⇒ 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , 𝑥3, 𝑦3 , ………………… 𝑥506, 𝑦506
• Lets call calculated 𝑦 value as 𝑦
Estimating the 𝑦1 = 𝛽0 + 𝛽𝟏 𝑥1
Coefficients 𝑦2 = 𝛽0 + 𝛽𝟏 𝑥2
𝑦506 = 𝛽0 + 𝛽𝟏 𝑥506
• The difference between residual the ith observed response value and the
ith response value that is predicted by our linear model is known as residual
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖
Start-Tech Academy
Simple Linear Regression
Residual –
The difference between residual the ith observed response value and the ith
response value that is predicted by our linear model is known as residual
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖
Residual
Start-Tech Academy
Simple Linear Regression
Start-Tech Academy
Simple Linear Regression
Model
Start-Tech Academy
Simple Linear Regression
we assume that the true relationship between X and Y takes the form Y = f(X) + ε for some unknown function f, where
ε is a mean-zero random error term.
𝒀 = 𝛽0 + 𝛽1 𝑋 +ε
Assessing the 𝛽0 is known as Intercept
𝛽1 is known as slope
Accuracy ε is an error term
Start-Tech Academy
Simple Linear Regression
𝜎 2 = 𝑉𝑎𝑟 𝜀
σ2 is not known, but can be estimated from the data. This estimate is known as
Standard error the residual standard error (RSE)
In Coefficients
There is approximately a 95% chance that the interval
Start-Tech Academy
Simple Linear Regression
Start-Tech Academy
Simple Linear Regression
Start-Tech Academy
Simple Linear Regression
The quality of a linear regression fit is typically assessed using two related
quantities: the residual standard error (RSE) and the 𝑅 2 statistic.
Quality of Fit
𝑹𝑺𝑬 • RSE is the average amount that the response will deviate from the true
regression line
• RSE is also considered as a measure of lack of fit of the model to the data
Start-Tech Academy
Simple Linear Regression
The RSE provides an absolute measure of lack of fit of the model to the data.
𝑹𝟐
• 𝑹𝟐 is the proportion of variance explained
• 𝑹𝟐 always takes on a value between 0 and 1,
• 𝑹𝟐 is independent of the scale of Y.
Quality of Fit
𝑹𝟐
• TSS - total sum of squares
• RSS - residual sum of squares
Start-Tech Academy
Multiple Linear Regression
In Multiple linear regression more than one predictor variables are used to predict the response variable
𝛽0 is known as Intercept
Multiple Linear p is the number of predictors
ϵ is an error term
Regression
For our Model,
The equation is
𝑷𝒓𝒊𝒄𝒆 = 𝛽0 + 𝛽1 Crime_rate +𝛽𝟐poor_pop … … . . 𝛽𝟏𝟔avg_dist
Start-Tech Academy
Multiple Linear Regression
Estimating
Regression
Coefficients
Start-Tech Academy
Multiple Linear Regression
Estimating
Regression
Coefficients
Start-Tech Academy