Pa Digital Notes
Pa Digital Notes
PREDICTIVE ANALYTICS
Digital Notes
Compiled by
1
Compiled by
Dr. I. J. Raghavendra, Associate Professor, MBA-MRCET
Prof. G. Naveen Kumar, HOD, MBA-MRCET Page 1
PREDICTIVE ANALYTICS
SUBJECT EXPERT
Dr. P.NAGAJYOTHI
MBA, PhD
Assistant Professor,
Department of Business Management,
Malla Reddy College of Engineering & Technology.
2
PREDICTIVE ANALYTICS
This is an elective paper under Business Analytics specialization for MBA course, Business
Analytics has become one of the most important skills that every business school student should
acquire to become successful in a management career.
Course Aim: To know various predictive data analysis models and use analytical tools to solve
real-life business problems. To understand basic forecasting techniques to predict future values
Learning Outcome: The students should be able to assess the suitability of Predictive models
for effective business decisions.
The students will enable valid and reliable ways to collect analyze and visualize data; thereby
utilize it in decision making.
To enhance the skills on linear and logistic regression.
To apply forecasting techniques in making effective business decisions
UNIT-I: Simple Regression Analysis: Concept Fundamentals of Regression Analysis -
Requirements in Regression Model Building - Model Diagnostics - Interpretation of Regression
results for Management Decision.
Multiple Regression Analysis: Concept - Significance of Multiple Regression Analysis -
Structure of Model Estimation - Testing Rule of Multiple Regression Analysis Unit-II: Non-
linear Regression and Regression Modeling
UNITII: Non-Linear Regression Analysis: Concept - Types of Non-linear Regression Models
- Model Transformation - Difference between Linear and Non-linear Regression Models.
Diagnostics of Regression Modelling: Model Diagnostics - Multicollinearity - Autocorrelation
Unit-III: Dummy modelling and Panel Data Model Dummy modeling: Dummy independent
modelling-linear probability Model-Logit model-Probit model
Panel Data Model: Concept - Panel Data Models - Fixed Effects Model - Random Effects
Model - Forms of Panel Data Models - Applications to use Panel Data Models.
3
PREDICTIVE ANALYTICS
4
PREDICTIVE ANALYTICS
UNIT-1
Predictive analytics has consistently increased over the last five years. Predictive analytics (also
known as advanced analytics) is increasingly being linked to business intelligence.
Understanding Predictive Analytics
Let us take an example of a certain organization that wants to know what its profit will be after a
few years in the business, given the current trends in sales, the customer base in different
locations, etc. Predictive analytics will use the variables given, and techniques such as data
mining and artificial intelligence will predict the future profit or any other factor the organization
is interested in.
What is Predictive Analytics?
Predictive analytics is a significant analytical approach used by many firms to assess risk,
forecast future business trends, and predict when maintenance is required. Data scientists use
historical data as their source and utilize various regression models and machine learning
techniques to detect patterns and trends in the data.
Here are a few examples of how businesses are using predictive analytics:
Customer Service
Businesses may better estimate demand by utilizing advanced and effective analytics and
business intelligence. Consider a hotel company that wants to estimate how many people will
stay in a certain area this weekend so that they can guarantee they have adequate employees and
resources to meet demand.
Higher Education
Predictive analytics applications in higher education include enrollment management,
fundraising, recruiting, and retention. Predictive analytics offers a significant advantage in each
of these areas by offering intelligent insights that would otherwise be neglected.
A prediction algorithm can rate each student and tell administrators ways to serve students
during the duration of their enrollment using data from a student's high school years.
Models can give crucial information to fundraisers regarding the optimal times and strategies for
reaching out to prospective and current donors.
Supply Chain
Forecasting is an important concern in manufacturing because it guarantees that resources in a
supply chain are used optimally. Inventory management and the shop floor, for example, critical
spokes of the supply chain wheel that require accurate forecasts to function.
Predictive modeling is frequently used to clean and improve the data utilized for such estimates.
Modeling guarantees that additional data, including data from customer-facing activities, may be
consumed by the system, resulting in a more accurate prediction.
Insurance
5
PREDICTIVE ANALYTICS
Insurance firms evaluate policy applicants to assess the chance of having to pay out for a future
claim based on the existing risk pool of comparable policyholders, as well as previous
occurrences that resulted in payments. Actuaries frequently utilize models that compare
attributes to data about previous policyholders and claims.
Software Testing
Predictive analytics can help you enhance your operations throughout the full software testing
life cycle.
Simplify the process of interpreting massive volumes of data generated during software testing
by using that data to model outcomes. You can keep your release schedule on track by
monitoring timelines and utilizing predictive modeling to estimate how delays will affect the
project. By identifying these difficulties and their causes, you will be able to make course
corrections in individual areas before the entire project is delayed.
Predictive analytics can assess your clients' moods by researching social media and spotting
trends, allowing you to anticipate any reaction before it occurs.
So far we discussed what is Predictive analytics and its examples. Moving forward, lets
understand what are its analytics tools.
6
PREDICTIVE ANALYTICS
3.Manufacturing: -
1. Predictive analytics is helpful when combined with machine data in order to help in tracking
and comparing machines’ performance and equipment maintenance status and predicting
which particular machine will fail.
2. Predictive analytics insights can lead to decrease in shipping and transportation expenses by
accepting all the factors included in transferring manufacturing products at different places
under the proper system.
3. Considering predictions over supply chain and sales data helps in making more considerable
decisions for purchasing and ensuring that no expensive raw materials get purchased unless
not required. This data can also be used in aligning manufacturing processes with consumer
demands.
4. Finance: -
1. Prohibition of credit card fraud via indicating unusual transactions,
2. Credit card scoring to determine whether to approve or deny loan applications,
3. Most importantly, analyzing customers’ churn data and facilitating banks to approach
potential customers before they are likely to switch respective institutions.
4. Measuring credit risk, maximizing cross-sell/up-sell opportunities and retaining valuable
customers.
5. Healthcare: -
1. Predictive analytics can help medical practitioners by analyzing data concerning global
diseases statistics, drug interactions, patient diagnostic history individually to provide
advanced care and conduct more effective medical practices.
2. Applying predictive analytics on clinics’ past appointment data helps in identifying
probable no-shows or delays in cancellations more accurately and thus save time and
resources.
To detect claims frauds, the health insurance industry is using predictive analytics to discover
patients at most risk of incurable or chronic disease, it helps companies in finding suitable
interventions.
CONCEPT OF ASSOCIATION
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of
dataset. It is based on different rules to discover the interesting relations between variables in the
database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
7
PREDICTIVE ANALYTICS
market basket analysis is a technique used by the various big retailers to discover the
associations between items. We can understand it by taking an example of a supermarket, as in a
supermarket, all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:
Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items is
known as single cardinality. It is all about creating rules, and if the number of items increases,
then cardinality also increases accordingly. So, to measure the associations between thousands of
data items, there are several metrics. These metrics are given below:
Support
Confidence
Lift
8
PREDICTIVE ANALYTICS
The basic technique aims to find the relationship and establish the patterns on the items
purchased. May in much more simpler terms, we can compare to a If-then clause, eg., If bread is
In the shorthand notation, which translates to “the items on the right are likely to be ordered with
Antecedent: The items on the LEFT i.e., the item which the customer buy
Consequent: The items on the RIGHT i.e., the item which the customer follows to buy.
Support: The probability that the antecedent event will occur i.e., the customer will buy
bread.
Confidence: The probability that the consequent will occur wrt the given antecedent. i.e., the
Lift: The lift of the rule is the ratio of the support of the left-hand side of the rule (sandwich)
co-occurring with the right-hand side (tea), divided by the probability that the left-hand side
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the item set X. If there are X datasets, then for
transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and
Y occur together in the dataset when the occurrence of X is already given. It is the ratio of the
transaction that contains X and Y to the number of records that contain X.
9
PREDICTIVE ANALYTICS
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are independent
of each other. It has three possible values:
If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each
other.
Lift>1: It determines the degree to which the two item sets are dependent to each other.
Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.
A lift greater than 1 suggests that the presence of the antecedent increases the chances that
Lift below 1 indicates that purchasing the antecedent reduces the chances of purchasing the
consequent in the same transaction. Note: This could indicate that the items are seen by
When the lift is 1, then purchasing the antecedent makes no difference on the chances of
10
PREDICTIVE ANALYTICS
Apriori Algorithm
Apriori algorithm is a standard algorithm in Association Rule Learning in Data mining. It
is used for drawing familiar item sets and their relevant association rules. It is designed
to perform on a database.
Eclat Algorithm
Eclat algorithm is a type of Association rule learning algorithm. It can be applied to
achieve itemset mining. Itemset mining helps us to obtain periodic patterns in data. For
example, if a consumer buys shoes, he would also buy socks.
11
PREDICTIVE ANALYTICS
1. The direction
2. The strength
3. The linearity
The above characteristics are between variable Y and variable X. The above scatter plot
shows us that variable Y and variable X possess a strong positive linear relationship. Hence,
we can project a straight line which can define the data in the most accurate way possible.
If the relationship between variable X and variable Y is strong and linear, then we conclude
that particular independent variable X is the effective input variable to predict dependent
variable Y.
To check the co-linearity between variable X and variable Y, we have correlation coefficient
(r), which will give you numerical value of correlation between two variables. You can have
strong, moderate or weak correlation between two variables. Higher the value of “r”, higher
the preference given for particular input variable X for predicting output variable Y. Few
properties of “r” are listed as follows:
1. Range of r: -1 to +1
2. Perfect positive relationship: +1
3. Perfect negative relationship: -1
4. No Linear relationship: 0
12
PREDICTIVE ANALYTICS
The term "regression" was coined by Francis Galton in 1877 Father of regression Carl F.
Gauss (1777-1855).
Definition: A numerical target attribute A collection of data objects also characterized
by the target attribute.
The regression task finds a model that allows predicting the target variable value of new
objects through y=f (x1 , x2 , … xn )
Regression analysis can be classified based on
Number of explanatory variables
Simple regression: single explanatory variable
Multiple regression: includes any number of explanatory variables
Types of relationship
Linear regression: straight-line relationship
Non-linear: implies curved relationships (e.g., logarithmic relationships)
Simple linear regression: y= β0+ β1 x
The regression line provides an interpretable model of the phenomenon under analysis
y: estimated (or predicted) value
β0 : estimation of the regression intercept The intercept represents the estimated value
of y when x assumes 0
β1 : estimation of the regression slope
x: independent variable
13
PREDICTIVE ANALYTICS
14
PREDICTIVE ANALYTICS
There is a linear relationship between the dependent variables and the independent
variables
The independent variables are not too highly correlated with each other
yi observations are selected independently and randomly from the population
Residuals should be normally distributed with a mean of 0 and variance
There should be proper specification of the model in multiple regression. This means
that only relevant variables must be included in the model and the model should be
reliable.
Linearity must be assumed; the model should be linear in nature.
Normality must be assumed in multiple regression. This means that in multiple
regression, variables must have normal distribution.
Homoscedasticity must be assumed; the variance is constant across all levels of the
predicted variable.
Homoscedasticity means “having the same scatter.” For it to exist in a set of data, the points
must be about the same distance from the line, as shown in the picture above. The opposite
is heteroscedasticity (“different scatter”), where points are at widely varying distances from
the regression line.
15
PREDICTIVE ANALYTICS
A multiple linear regression model is a linear equation that has the general form: y = b1x1 +
b2x2 + … + c where y is the dependent variable, x1, x2… are the independent variable, and c is
the (estimated) intercept.
16
PREDICTIVE ANALYTICS
In the above data, the ‘Number of weekly riders’ is a dependent variable that depends on the
‘Price per week ($)’, ‘Population of city’, ‘Monthly income of riders ($)’, ‘Average parking rates
per month ($)’.
Let us assign the variables:
Price per week ($) – x1
Population of city – x2
Monthly income of riders ($) – x3
Average parking rates per month ($)- x4
Number of weekly riders – y
The linear model would be of the form: y = ax1 + bx2 + cx3 + dx4 + e where a, b, c, d are the
respective coefficients and e is the intercept.
There are a two different ways to create the linear model on Microsoft Excel. In this article, we
will take a look at the Regression function included in the Data Analysis ToolPak
After the Data Analysis ToolPak has been enabled, you will be able to see it on the Ribbon,
under the Data tab:
Click Data Analysis to open the Data Analysis ToolPak, and select Regression from the Analysis
tools that are displayed.
17
PREDICTIVE ANALYTICS
Right on top are the Regression Statistics. Here we are interested in the following measures:
Multiple R, which is the coefficient of linear correlation
Adjusted R Square, which is the R Square (coefficient of determination) adjusted for more than
one independent variable
.
One Model Building Strategy
The first step
Decide on the type of model that is needed in order to achieve the goals of the study. In general,
there are five reasons one might want to build a regression model. They are:
For predictive reasons — that is, the model will be used to predict the response variable from a
chosen set of predictors.
For theoretical reasons — that is, the researcher wants to estimate a model based on a known
theoretical relationship between the response and predictors.
For control purposes — that is, the model will be used to control a response variable by
manipulating the values of the predictor variables.
For inferential reasons — that is, the model will be used to explore the strength of the
relationships between the response and the predictors.
For data summary reasons — that is, the model will be used merely as a way to summarize a
large set of data by a single equation.
18
PREDICTIVE ANALYTICS
On a univariate basis, check for outliers, gross data errors, and missing values.
Study bivariate relationships to reveal other outliers, to suggest possible transformations, and to
identify possible multicollinearities.
I can't possibly over-emphasize the importance of this step. There's not a data analyst out there
who hasn't made the mistake of skipping this step and later regretting it when a data point was
found in error, thereby nullifying hours of work.
The training set, with at least 15-20 error degrees of freedom, is used to estimate the model.
The validation set is used for cross-validation of the fitted model.
The fifth step
Using the training set, identify several candidate models:
Select the models based on the criteria we learned, as well as the number and nature of the
predictors.
Evaluate the selected models for violation of the model conditions.
If none of the models provide a satisfactory fit, try something else, such as collecting more data,
identifying different predictors, or formulating a different type of model.
The seventh and final step
Select the final model:
Compare the competing models by cross-validating them against the validation data.
The model with a smaller mean square prediction error (or larger cross-validation R2) is a better
predictive model.
19
PREDICTIVE ANALYTICS
20
PREDICTIVE ANALYTICS
UNIT-2
Nonlinear regression refers to a regression analysis where the regression model portrays a
nonlinear relationship between a dependent variable and independent variables. In other words,
the relationship between predictor and response variable follows a nonlinear pattern.
The simplest statistical relationship between a dependent variable Y and one or more
independent or predictor variables X 1, X2, ... is
where represents a random deviation from the mean relationship represented by the
rest of the model. With a single predictor, the model is a straight line. With more than
one predictor, the model is a plane or hyperplane. While such models are adequate for
representing many relationships (at least over a limited range of the predictors), there
are many cases when a more complicated model is required.
21
PREDICTIVE ANALYTICS
In Stat graphics, there are several procedures for fitting nonlinear models. The models
that may be fit include:
2. Polynomial models: models involving one or more predictor variables which include
higher-order terms such as B1,1X12 or B1,2X1X2.
3. Models that are nonlinear in the parameters: models in which the partial
derivatives of Y with respect to the predictor variables involve the unknown parameters.
In their classic book on regression analysis titled Applied Regression Analysis, Draper
and Smith show a data set containing 44 samples of a product in which the active
ingredient was chlorine. Researchers wanted to model the loss of chlorine as a function
of the number of weeks since the sample was produced. As is evident in the scatterplot
below, chlorine decays with time.
In order to get a quick feel for the shape of the relationship, a robust Lowess smooth
may be added to the plot:
22
PREDICTIVE ANALYTICS
Lowess stands for "Locally Weighted Scatterplot Smoothing" and was developed by Bill
Cleveland. It smooths the scatter plot by fitting a linear regression at many points along
the X axis, weighting observations according to their distance from that point. The
procedure is then applied a second time after down-weighting observations that were
far removed from the result of the first smooth. It may be seen that there is significant
nonlinearity in the relationship between chlorine and weeks.
The Simple Regression procedure in Stat graphics gives a choice of many nonlinear
functions that may be fit to this data:
23
PREDICTIVE ANALYTICS
Each function has a form such that after transforming Y, X or both appropriately, the
model will be linear in the parameters. For example, the multiplicative model takes the
form
Y = X
The one caveat in such an approach is that the error term is assumed to be
additive after the model has been linearized.
To help select a good nonlinear model, Stat graphics will fit all of the models and sort
them in decreasing order of R-squared:
24
PREDICTIVE ANALYTICS
25
PREDICTIVE ANALYTICS
Y2 = 0 + 1/X
Y = 0 + 1/X
When I'm building empirical models and the results of 2 models are very similar, I
usually pick the simpler of the two. Fitting a Reciprocal-X model to this data gives the
following curve:
In addition to fitting the general relationship well, this model has the pleasing property of
reaching an asymptotic value of 0.368053 when weeks becomes very large.
Draper and Smith noted the 2 apparent outliers at weeks = 18. The Statgraphics Table
of Unusual Residuals shows that the Studentized residuals for those observations both
exceed 2.4:
26
PREDICTIVE ANALYTICS
In particular, row #17 is 3.66 standard deviations from its predicted value. However,
since they could find no assignable cause that would justify removing those points,
Draper and Smith left them in the dataset.
Rather than transforming Y and/or X, we might try fitting a polynomial to the data
instead. For example, a second-order polynomial would take the form
Y = 0 + 1X + B2X2
Since polynomials are able to approximate the shape of many curves, they might give a
good fit.
27
PREDICTIVE ANALYTICS
By specifying a non-zero value for , the origin of the polynomial is shifted to a different
value of X which can prevent the powers from becoming so large that they overflow the
variables created to hold them when performing calculations. Since the maximum value
of X is not large in our sample data, the shift parameter may be set equal to 0.
For the chlorine, a fourth-order polynomial fits the data quite well:
28
PREDICTIVE ANALYTICS
In fact, if we overlay the Reciprocal-X model and the fourth-order polynomial in the
StatGallery, the predictions are very similar throughout the range of the data:
However, beyond the range of the data the polynomial will behave erratically. While the
polynomial is suitable if we are only doing interpolation, the Reciprocal-X model would
be preferred if extrapolation is required.
From a statistical point of view, the 4th order polynomial may be more complicated than
is required. Statgraphics creates a table that may be used to help determine what order
of polynomial is needed to sufficiently capture the relationship between Y and X. Called
the Conditional Sums of Squares table, it tests the statistical significance of each term in
the polynomial when it is added to a polynomial of one degree less:
29
PREDICTIVE ANALYTICS
For example, when X2 is added to a linear model, the P-Value for 2 equals 0.0000,
implying that it significantly improves the fit. When X 3 is added to a second-order model,
the P-Value for equals 0.1207, implying that it does not significantly improve the fit at
the 10% significance level. In this case, the P-Values suggest that a second-order
polynomial would be sufficient. However, a plot of the fitted model might give one
pause:
Even if only using the model for interpolation, the curvature in the interval between 30
and 40 weeks is disconcerting.
30
PREDICTIVE ANALYTICS
Linear regression requires a linear model. But what does that really mean?
A model is linear when each term is either a constant or the product of a parameter and
a predictor variable. A linear equation is constructed by adding the results for each term.
This constrains the equation to just one basic form:
In statistics, a regression equation (or function) is linear when it is linear in the parameters.
While the equation must be linear in the parameters, you can transform the predictor
variables in ways that produce curvature. For instance, you can include a squared variable
to produce a U-shaped curve.
Y = b o + b1X1 + b2X12
This model is still linear in the parameters even though the predictor variable is squared.
You can also use log and inverse functional forms that are linear in the parameters to
produce different types of curves.
Here is an example of a linear regression model that uses a squared term to fit the curved
relationship between BMI and body fat percentage.
31
PREDICTIVE ANALYTICS
While a linear equation has one basic form, nonlinear equations can take many different
forms. The easiest way to determine whether an equation is nonlinear is to focus on the
term “nonlinear” itself. Literally, it’s not linear. If the equation doesn’t meet the criteria above
for a linear equation, it’s nonlinear.
That covers many different forms, which is why nonlinear regression provides the most
flexible curve-fitting functionality. Here are several examples from Minitab’s nonlinear
function catalog. Thetas represent the parameters and X represents the predictor in the
nonlinear functions. Unlike linear regression, these functions can have more than one
parameter per predictor variable.
Here is an example of a nonlinear regression model of the relationship between density and
electron mobility.
32
PREDICTIVE ANALYTICS
Linear and nonlinear regression are actually named after the functional form of the models
that each analysis accepts. I hope the distinction between linear and nonlinear equations is
clearer and that you understand how it’s possible for linear regression to model curves! It
also explains why you’ll see R-squared displayed for some curvilinear models even
though it’s impossible to calculate R-squared for nonlinear regression.
The constant
A parameter multiplied by an independent variable (IV)
Then, you build the equation by only adding the terms together. These rules limit the form to just
one type:
Y=βο+β1X1+β2X2+…….+βKXK
This type of regression equation is linear in the parameters. However, it is possible to model
curvature with this type of model. While the function must be linear in the parameters, you can
raise an independent variable by an exponent to fit a curve. For example, if you square an
independent variable, the model can follow a U-shaped curve.
Y=βο+β1X1+β2X2^2
33
PREDICTIVE ANALYTICS
The regression example below models the relationship between body mass index (BMI)
and body fat percent. In a different blog post, I use this model to show how to make
predictions with regression analysis. It is a linear model that uses a quadratic (squared)
term to model the curved relationship.
MULTICOLLINEARITY:
Collinearity (and Multicollinearity) means that the predictors variables, also known as
independent variables, aren’t so independent.
34
PREDICTIVE ANALYTICS
Collinearity is a situation where two features are linearly associated (high correlated), and they
are used as predictors for the target. It’s often measured using Pearson’s correlation coefficient.
Collinearity between more than two predictors is also possible (and often the case).
The term multicollinearity was first used by Ragnar Frisch. Multicollinearity is a special case of
collinearity where a feature exhibits a linear relationship with two or more features. We can also
have a situation where more than two features are correlated and, at the same time, have no high
correlation pairwise.
Partial multicollinearity is ubiquitous in multiple regression. Two random variables will almost
always correlate at some level in a sample, even if they share no fundamental relationship in the
larger population. In other words, multicollinearity is a matter of degree.
35
PREDICTIVE ANALYTICS
We can see a strong correlation between 'Index' and 'Height' / 'Weight (as expected). We can also
notice that 'Weight' has much more impact on INdex than 'Height. That shoud be also intuitive
and expected.
Clustermap
Clustermap table shows not only all correlation between variables, but also group (cluster)
relationships.
36
PREDICTIVE ANALYTICS
VIFi=1/1−Ri^2
So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity
with the particular independent variable.
VIF = 1, no correlation between the independent variable and the other variables
VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the
others
Feature VIF
0 Gender 2.028864
1 Height 11.623103
2 Weight 10.688377
Height' and 'Weight' have high values of VIF, indicating that these two variables are highly
correlated. This is expected as the height of a person does influence their weight. Hence,
considering these two features together still leads to a model with high multicollinearity.
In regression analysis, multicollinearity exists when two or more of the variables demonstrate a
linear relationship between them. The VIF measures by how much the linear correlation of a
given regressor with the other regressors increases the variance of its coefficient estimate with
respect to the baseline case of no correlation.
37
PREDICTIVE ANALYTICS
Example: two identical (or almost identical) variables. Weight in pounds and weight in
kilos, or investment income and savings/bond income.
38
PREDICTIVE ANALYTICS
The autocorrelation function often works by producing a value between -1 and +1. A positive
autocorrelation represents a positive correlation, which means that as one time series increases,
you may see a proportionate increase in the other time series. A negative correlation can mean
that an increase in one time series results in a proportionate decrease in the other. The closer to
plus or minus one the value falls, the stronger the correlation in either direction.
For example, a weather scientist might use this function when analyzing the minimum daily
temperature recorded in a city using a data set from the last 10 years. They insert the data into a
statistical modeling program and perform an autocorrelation analysis. The program produces a
graph that shows how the minimum daily temperature in the city has changed over the last 10
years, indicating an increase in the daily minimum temperature with a high degree of confidence.
This degree of confidence shows that the positive correlation between minimum temperature and
time is likely not the result of random chance.
When can you use the autocorrelation function: The autocorrelation function has various uses
in many industries that rely on time-related statistical models. Here are a few industries that use
autocorrelation functions, with examples of how industry professionals may use these functions
to complete their work:
Physics and engineering
Autocorrelation functions have various applications in physics and engineering. In particular,
these functions help scientists measure and understand patterns in the behaviors of sound waves
and light. For example, a physicist might use this type of function when studying patterns in how
light scatters when moving through a particular medium, like air or a liquid. They may also use
this function to study sonic concepts like pitch, frequency and tempo. An astrophysicist may use
autocorrelation functions to understand how wavelengths travel through space, with
consideration for how physical principles like gravity affect their behavior.
Meteorology
Meteorologists and climate researchers frequently use autocorrelation functions. They use this
function to understand how weather patterns change over time and how different variables
influence these trends. For example, meteorologists use historical data patterns to predict
changes in future weather conditions. These scientists create statistical models using
autocorrelation functions to assess how weather trends like precipitation and temperature and
natural phenomenon like hurricanes have changed over time and how they may continue to
change in the future. Understanding weather patterns is important for predicting emergency
weather conditions and natural disasters so people can prepare ahead for these events.
Finance
Autocorrelation functions also apply to financial modeling. Stock analysts often use
autocorrelation functions to assess trends in a stock's value over time and use that data to predict
its future value. Another application of autocorrelation functions in finance is its use in technical
analysis. A technical analyst can use autocorrelation functions to understand how past prices for
a security may influence its future value. For example, if an autocorrelation function reveals that
a stock has accumulated significant gains over two or more days, it's reasonable to predict that
the stock may continue to gain in the following days.
39
PREDICTIVE ANALYTICS
One common type of autocorrelation function is the Durbin-Watson test. This statistic uses
regression analysis to identify autocorrelation in a time series. When you apply it, the Durbin-
Watson test assesses the degree of correlation between variables in a time series in a range of
zero to four. Results closer to zero indicate a stronger positive correlation between the variables,
while values closer to four show a stronger negative pattern of correlation. If the value falls
40
PREDICTIVE ANALYTICS
between zero and four, it suggests less autocorrelation. Although the Durbin-Watson test is
common in financial analysis, it may be less common in other industries.
The Durbin Watson statistic is a test for autocorrelation in a regression model's output.
The DW statistic ranges from zero to four, with a value of 2.0 indicating zero
autocorrelation.
Values below 2.0 mean there is positive autocorrelation and above 2.0 indicates negative
autocorrelation.
The Durbin–Watson statistic, while displayed by many regression analysis programs, is not
applicable in certain situations.
Pair One=(10,1,100)
Pair Two=(20,1,200)
Pair Three=(35,985)
Pair Four=(40,750)
Pair Five=(50,1,215)
Pair Six=(45,1,000)
41
PREDICTIVE ANALYTICS
Using the methods of a least squares regression to find the "line of best fit," the equation for the
best fit line of this data is:
Y=−2.6268x+1,129.2
This first step in calculating the Durbin Watson statistic is to calculate the expected "y" values
using the line of best fit equation. For this data set, the expected "y" values are:
ExpectedY(1)=(−2.6268×10)+1,129.2=1,102.9
ExpectedY(2)=(−2.6268×20)+1,129.2=1,076.7
ExpectedY(3)=(−2.6268×35)+1,129.2=1,037.3
ExpectedY(4)=(−2.6268×40)+1,129.2=1,024.1
ExpectedY(5)=(−2.6268×50)+1,129.2=997.9
ExpectedY(6)=(−2.6268×45)+1,129.2=1,011
Next, the differences of the actual "y" values versus the expected "y" values, the errors, are
calculated:
Error(1)=(1,100−1,102.9)=−2.9
Error(2)=(1,200−1,076.7)=123.3
Error(3)=(985−1,037.3)=−52.3
Error(4)=(750−1,024.1)=−274.1
Error(5)=(1,215−997.9)=217.1
Error(6)=(1,000−1,011)=−11
Next, the value of the error minus the previous error are calculated and squared:
Difference(1)=(123.3−(−2.9))=126.2
Difference(2)=(−52.3−123.3)=−175.6
Difference(3)=(−274.1−(−52.3))=−221.9
42
PREDICTIVE ANALYTICS
Difference(4)=(217.1−(−274.1))=491.3
Difference(5)=(−11−217.1)=−228.1
Finally, the Durbin Watson statistic is the quotient of the squared values:
Durbin Watson=389,406.71/140,330.81=2.77
43
PREDICTIVE ANALYTICS
Durbin-Watson Test
The Durbin-Watson test is a statistical test used to determine whether or not there is a serial
correlation in a data set. It tests the null hypothesis of no serial correlation against the alternative
positive or negative serial correlation hypothesis. The test is named after James Durbin and
Geoffrey Watson, who developed it in 1950.
The Durbin-Watson Statistic (DW) is approximated by:
DW=2(1−r)
Where:
r is the sample correlation between regression residuals from one period and the previous period.
The test statistic can take on values ranging from 0 to 4.
A value of 2 indicates no serial correlation, a value between 0 and 2 indicates a positive serial
correlation, and a value between 2 and 4 indicates a negative serial correlation:
If there is no autocorrelation, the regression errors will be uncorrelated, and thus DW=2
DW=2(1−r)= 2(1−0)=2
44
PREDICTIVE ANALYTICS
For positive serial autocorrelation, DW<2. For example, if serial correlation of the regression
residuals = 1, DW=2(1−1)=0
For negative autocorrelation, DW>2. For example, if serial correlation of the regression residual
= −1, DW=2(1−(−1))=4.
To reject the null hypothesis of no serial correlation, we need to find a critical value lower than
our calculated value of d*. Unfortunately, we cannot know the true critical value, but we can
narrow down the range of possible values.
Define dl as the lower value and du as the upper value:
If the DW statistic is less than dl, we reject the null hypothesis of no positive serial correlation.
If the DW statistic is greater than (4-dl), we reject the null hypothesis, indicating a significant
negative serial correlation.
If the DW statistic falls between dl and du, the test results are inconclusive.
If the DW statistic is greater than du, we fail to reject the null hypothesis of no positive serial
correlation.
Consider a regression output with two independent variables that generate a DW statistic of
0.654. Assume that the sample size is 15. Test for serial correlation of the error terms at the 5%
significance level.
Solution
From the Durbin-Watson table with n=15 and k=2, we see that dl=0.95 and du=1.54.
Since d=0.654<0.95=dl, we reject the null hypothesis and conclude that there is significant
positive autocorrelation.
Consider a regression model with 80 observations and two independent variables. Assume that
the correlation between the error term and the first lagged value of the error term is 0.18.
.
Solution
The correct answer is C.
The test statistic is:
DW≈2(1−r)=2(1−0.18)=1.64
The critical values from the Durbin Watson table
with n=80and k=2is dl=1.59and du=1.69Because 1.69 > 1.64 > 1.59,
45
PREDICTIVE ANALYTICS
Lagged Series and Lag Plots: Lagging a time series means to shift its values forward one or
more time steps, or equivalently, to shift the times in its index backward one or more steps. In
either case, the effect is that the observations in the lagged series will appear to have happened
later in time.
y y_lag_1 y_lag_2
Date
we could use y_lag_1 and y_lag_2 as features to predict the target y. This would forecast the
future unemployment rate as a function of the unemployment rate in the prior two months
AutoML generates lags with respect to the forecast horizon. The example in this section
illustrates this concept. Here, we use a forecast horizon of three and target lag order of one.
Consider the following monthly time series:
First, we generate the lag feature for the horizon ℎ=1 only. As you continue reading, it will
become clear why we use individual horizons in each table.
46
PREDICTIVE ANALYTICS
Table 2 is generated from Table 1 by shifting the �� column down by a single observation.
We've added a column named Origin that has the dates that the lag features originate from. Next,
we generate the lagging feature for the forecast horizon ℎ=2 only.
Table 3 is generated from Table 1 by shifting the �� column down by two observations.
Finally, we will generate the lagging feature for the forecast horizon ℎ=3 only.
Next, we concatenate Tables 1, 2, and 3 and rearrange the rows. The result is in the following
table:
47
PREDICTIVE ANALYTICS
In the final table, we've changed the name of the lag column to yt−1 to reflect that the lag is
generated with respect to a specific horizon. The table shows that the lags we generated with
respect to the horizon can be mapped to the conventional ways of generating lags in the previous
tables.
Regression Diagnostics: Diagnostics for regression models are tools that assess a model’s
compliance to its assumptions and investigate if there is a single observation or group of
observations that are not well represented by the model. These tools allow researchers to
evaluate if a model appropriately represents the data of their study.
Model Assumptions
The model fitting is just the first part of the story for regression analysis since this is all based on
certain assumptions. Regression diagnostics are used to evaluate the model assumptions and
investigate whether or not there are observations with a large, undue influence on the analysis.
Again, the assumptions for linear regression are:
48
PREDICTIVE ANALYTICS
Outliers: an outlier is defined as an observation that has a large residual. In other words,
the observed value for the point is very different from that predicted by the regression
model.
Leverage points: A leverage point is defined as an observation that has a value of x that
is far away from the mean of x.
Influential observations: An influential observation is defined as an observation that
changes the slope of the line. Thus, influential points have a large influence on the fit of
the model. One method to find influential points is to compare the fit of the model with
and without each observation.
The diagnostic plots show residuals in four different ways. Let’s take a look at the first type of
plot:
1. Residuals vs Fitted
This plot shows if residuals have non-linear patterns. There could be a non-linear relationship
between predictor variables and an outcome variable, and the pattern could show up in this plot
if the model doesn’t capture the non-linear relationship. If you find equally spread residuals
49
PREDICTIVE ANALYTICS
around a horizontal line without distinct patterns, that is a good indication you don’t have non-
linear relationships.
Let’s look at residual plots from a ‘good’ model and a ‘bad’ model. The good model data are
simulated in a way that meets the regression assumptions very well, while the bad model data are
not.
I don’t see any distinctive pattern in Case 1, but I see a parabola in Case 2, where the non-linear
relationship was not explained by the model and was left out in the residuals.
2. Normal Q-Q
This plot shows if residuals are normally distributed. Do residuals follow a straight line well or
do they deviate severely? It’s good if residuals are lined well on the straight dashed line.
50
PREDICTIVE ANALYTICS
3. Scale-Location
It’s also called a Spread-Location plot. This plot shows if residuals are spread equally along the
ranges of predictors. This is how you can check the assumption of equal variance
(homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.
In Case 1, the residuals appear randomly spread, whereas in Case 2, the residuals begin to spread
wider along the x-axis as it passes around . Because the residuals spread wider and wider, the red
smooth line is not horizontal and shows a steep angle in Case 2.
4. Residuals vs Leverage
This plot helps us to find influential cases (i.e., subjects) if there are any. Not all outliers are
influential in linear regression analysis (whatever outliers mean). Even though data have extreme
values, they might not be influential to determine a regression line. That means the results
wouldn’t be much different if we either include or exclude them from analysis. They follow the
trend in the majority of cases and they don’t really matter; they are not influential. On the other
hand, some cases could be very influential even if they look to be within a reasonable range of
the values. They could be extreme cases against a regression line and can alter the results if we
exclude them from analysis. Another way to put it is that they don’t get along with the trend in
the majority of the cases.
Unlike the other plots, this time patterns are not relevant. We watch out for outlying values at the
upper right corner or at the lower right corner. Those spots are the places where cases can be
influential against a regression line. Look for cases outside of the dashed lines. When cases are
outside of the dashed lines (meaning they have high "Cook’s distance" scores), the cases are
influential to the regression results. The regression results will be altered if we exclude those
cases.
51
PREDICTIVE ANALYTICS
Case 1 is the typical look when there is no influential case, or cases. You can barely see Cook’s
distance lines (red dashed lines) because all cases are well inside of the Cook’s distance lines. In
Case 2, a case is far beyond the Cook’s distance lines (the other residuals appear clustered on the
left because the second plot is scaled to show larger area than the first plot).
The four plots show potential problematic cases with the row numbers of the cases in the data
set.
52
PREDICTIVE ANALYTICS
Linear regression is a method we can use to quantify the relationship between one or more
predictor variables and a response variable.
However, sometimes we wish to use categorical variables as predictor variables. These are
variables that take on names or labels and can fit into categories. Examples include:
When using categorical variables, it doesn’t make sense to just assign values like 1, 2, 3, to
values like “blue”, “green”, and “brown” because it doesn’t make sense to say that green is twice
as colorful as blue or that brown is three times as colorful as blue.
Instead, the solution is to use dummy variables. These are variables that we create specifically for
regression analysis that take on one of two values: zero or one.
Dummy Variables: Numeric variables used in regression analysis to represent categorical data
that can only take on one of two values: zero or one.
The number of dummy variables we must create is equal to k-1 where k is the number of
different values that the categorical variable can take on.
The following examples illustrate how to create dummy variables for different datasets.
53
PREDICTIVE ANALYTICS
Suppose we have the following dataset and we would like to use gender and age to
predict income:
To use gender as a predictor variable in a regression model, we must convert it into a dummy
variable.
Since it is currently a categorical variable that can take on two different values (“Male” or
“Female”), we only need to create k-1 = 2-1 = 1 dummy variable.
To create this dummy variable, we can choose one of the values (“Male” or “Female”) to
represent 0 and the other to represent 1.
In general, we usually represent the most frequently occurring value with a 0, which would be
“Male” in this dataset.
54
PREDICTIVE ANALYTICS
We could then use Age and Gender Dummy as predictor variables in a regression model.
Suppose we have the following dataset and we would like to use marital status and age to
predict income:
To use marital status as a predictor variable in a regression model, we must convert it into a
dummy variable.
Since it is currently a categorical variable that can take on three different values (“Single”,
“Married”, or “Divorced”), we need to create k-1 = 3-1 = 2 dummy variables.
To create this dummy variable, we can let “Single” be our baseline value since it occurs most
often. Thus, here’s how we would convert marital status into dummy variables:
55
PREDICTIVE ANALYTICS
We could then use Age, Married, and Divorced as predictor variables in a regression model.
Suppose we fit a multiple linear regression model using the dataset in the previous example
with Age, Married, and Divorced as the predictor variables and Income as the response variable.
We can use this equation to find the estimated income for an individual based on their age and
marital status. For example, an individual who is 35 years old and married is estimated to have
an income of $68,264:
Intercept: The intercept represents the average income for a single individual who is zero years
old. Obviously you can’t be zero years old, so it doesn’t make sense to interpret the intercept by
itself in this particular regression model.
Age: Each one year increase in age is associated with an average increase of $1,471.67 in
income. Since the p-value (.00) is less than .05, age is a statistically significant predictor of
income.
Married: A married individual, on average, earns $2,479.75 more than a single individual. Since
the p-value (0.80) is not less than .05, this difference is not statistically significant.
Divorced: A divorced individual, on average, earns $8,397.40 less than a single individual.
Since the p-value (0.53) is not less than .05, this difference is not statistically significant.
56
PREDICTIVE ANALYTICS
Since both dummy variables were not statistically significant, we could drop marital status as a
predictor from the model because it doesn’t appear to add any predictive value for income .
A dummy variable is a binary variable that takes a value of 0 or 1. One adds such variables to a
regression model to represent factors which are of a binary nature i.e. they are either observed or
not observed.
For representing a Yes/No property: To indicate whether a data point has a certain property.
For example, a dummy variable can be used to indicate whether a car engine is of type
‘Standard’ or ‘Turbo’. Or if a participant in a drug trial belongs to the placebo group or the
treatment group.
For representing a categorical value: A related use of dummies is to indicate which one of a
set of categorical values a data point belongs to. For example, a vehicle’s body style could be
one of convertible, hatchback, coupe, sedan, or wagon. In this case, we would add five dummy
variables to the data set, one for each of the 5 body styles and we would ‘one hot encode’ this
five element vector of dummies. Thus, the vector [0, 1, 0, 0, 0] would represent all hatchbacks in
the data set.
For representing a seasonal period: A dummy variable can be added to represent each one of
the possibly many seasonal periods contained in the data. For example, the flow of traffic
through intersections often exhibits seasonality at an hourly level (they are highest during the
morning and evening rush hours) and also a weekly period (lowest on Sundays). Adding dummy
variables to the data for each of the two seasonal periods will allow you explain away much of
the variation in the traffic flow that is attributable to daily and weekly variations.
For representing Fixed Effects: While building regression models for panel data sets, dummies
can be used to represent ‘unit-specific’ and ‘time-specific’ effects, especially in a Fixed Effects
regression model.
For representing Treatment Effects: In a treatment effects model, a dummy variable can be
used to represent the effect of both time (i.e. the effect before and after treatment is applied), the
effect of group membership (whether the participant received the treatment or the placebo), and
the effect of the interaction between the time and group memberships.
57
PREDICTIVE ANALYTICS
In regression discontinuity designs: This is best explained with an example. Imagine a data set
of monthly employment rate numbers that contains a sudden, sharp increase in the
unemployment rate caused by a brief and severe recession. For this data, a regression model used
for modeling the unemployment rate can deploy a dummy variable to estimate the expected
impact of the recession on the unemployment rate.
1. A dummy variable takes on 1 and 0 only. The number 1 and 0 have no numerical
(quantitative) meaning. The two numbers are used to represent groups. In short dummy variable
is categorical (qualitative).
(a) For instance, we may have a sample (or population) that includes both female and male. Then
a dummy variable can be defined as D = 1 for female and D = 0 for male. Such a dummy
variable divides the sample into two subsamples (or two sub-populations): one for female and
one for male.
(b) Dummy variable follows Bernoulli distribution. The distribution is characterized by the
parameter p
D = 1, with probability p
0, with probability 1 – p
Y = β0 + β1D + u
Y = { β0 + u, when D = 0
E(Y |D = 1) = β0 + β1
β0 = E(Y |D = 0)
β1 = E(Y |D = 1) − E(Y |D = 0)
Therefore β0 is the mean of Y conditional on D = 0 (or mean of Y in the subpopulation with D = 0),
58
PREDICTIVE ANALYTICS
A Logit model, also known as logistic regression, is a statistical technique used to analyze the
relationship between one or more independent variables and a binary dependent variable. The
dependent variable takes on only two possible values, typically coded as 0 or 1, representing a
"failure" or a "success", respectively.
59
PREDICTIVE ANALYTICS
Social Sciences: Logit model can be used to model the likelihood of a certain behavior or
outcome occurring in a social context, such as the likelihood of an individual engaging in risky
behavior or the likelihood of a community experiencing a social issue like homelessness or
substance abuse.
Political Science and Public Opinion: Logit model can be used to model voting behavior and
public opinion on certain issues. By identifying the factors that influence voting and opinion,
policymakers can better understand and respond to the concerns of their constituents.
A Probit model is a statistical model that is used to analyze the relationship between one or more
independent variables and a binary dependent variable, with the difference being that it assumes
that the errors of the dependent variable follow a normal distribution rather than a logistic
distribution.
Here are some common applications of Probit model:
Economics and Finance: Probit model can be used to model financial decision-making, such as
the likelihood of a borrower defaulting on a loan or the likelihood of an investor making a certain
investment. This can help investors and financial institutions to better understand and manage
risk.
Marketing and Consumer Research: Probit model can be used to model consumer behavior, such
as the likelihood of a customer making a purchase or choosing a certain brand. This can help
businesses to better target their marketing campaigns and optimize their product offerings.
Health Outcomes and Policy: Probit model can be used to model the likelihood of a patient
experiencing a certain health outcome, such as the likelihood of a patient developing a certain
disease. This can help clinicians to predict and prevent adverse health outcomes and inform
clinical decision-making.
Environmental and Agricultural Sciences: Probit model can be used to model the likelihood of a
certain outcome occurring in a natural environment, such as the likelihood of a species becoming
extinct or the likelihood of a crop being affected by a certain pest. This can help policymakers to
design and implement effective environmental and agricultural policies.
Social Sciences: Probit model can be used to model the likelihood of a certain behavior or
outcome occurring in a social context, such as the likelihood of an individual engaging in risky
behavior or the likelihood of a community experiencing a social issue like poverty or inequality.
60
PREDICTIVE ANALYTICS
A different person can have a different perspective like one can say find the mean of all
observations, one can have like take mean of recent two observations, one can say like give more
weightage to current observation and less to past, or one can say use interpolation. There are
different methods to forecast the values.
while Forecasting time series values, 3 important terms need to be taken care of and the main
task of time series forecasting is to forecast these three terms.
1) Seasonality
Seasonality is a simple term that means while predicting a time series data there are some months
in a particular domain where the output value is at a peak as compared to other months. for
example if you observe the data of tours and travels companies of past 3 years then you can see
that in November and December the distribution will be very high due to holiday season and
festival season. So while forecasting time series data we need to capture this seasonality.
2) Trend
The trend is also one of the important factors which describe that there is certainly increasing or
decreasing trend time series, which actually means the value of organization or sales over a
period of time and seasonality is increasing or decreasing.
3) Unexpected Events
Unexpected events mean some dynamic changes occur in an organization, or in the market
which cannot be captured. for example a current pandemic we are suffering from, and if you
observe the Sensex or nifty chart there is a huge decrease in stock price which is an unexpected
event that occurs in the surrounding.
61
PREDICTIVE ANALYTICS
Qualitative variables, on the other hand, have categorical or non-numerical values. For instance, a
website’s name, the type of product sold, or the region where a company operates are qualitative
variables. Typically, they are not plotted on a time-series chart as they do not have a numerical
62
PREDICTIVE ANALYTICS
value that can be represented on a continuous scale. However, we can use them to categorize/group
the data for analysis.
To interpret a time-series plot, you must understand the data patterns over time. These are the
key factors to consider when interpreting a time-series plot:
Seasonality: It refers to recurring patterns/cycles that occur over a particular period. The patterns
can be weekly, monthly, quarterly, or annual. For instance, winter coat sales typically show a
seasonal pattern as they increase in the fall and winter and decrease in the spring and summer.
Trend: It refers to the general direction the data moves in over time. A trend can either be
upward, downward or remain constant. A positive trend indicates that the values are increasing
over time, while a negative trend shows the values are decreasing over time. A horizontal trend
suggests that the values remain constant over time.
Outliers: They refer to values that lie outside the usual pattern of the data. External factors or
random events can cause outliers and can significantly impact the interpretation of the time-
series plot.
Level: It refers to the average data value over the entire time period in a time-series plot. A
higher level indicates that the values are higher overall and vice versa.
Trend Analysis?
Trend analysis is a technique used in technical analysis that attempts to predict future stock
price movements based on recently observed trend data.
63
PREDICTIVE ANALYTICS
A trend is a general direction the market is taking during a specified period of time. Trends can
be both upward and downward, relating to bullish and bearish markets, respectively. While
there is no specified minimum amount of time required for a direction to be considered a trend,
the longer the direction is maintained, the more notable the trend.
There are three main types of market trend for analysts to consider:
1. Upward trend: An upward trend, also known as a bull market, is a sustained period of
rising prices in a particular security or market. Upward trends are generally seen as a
sign of economic strength and can be driven by factors such as strong demand, rising
profits, and favorable economic conditions.
2. Downward trend: A downward trend, also known as a bear market, is a sustained
period of falling prices in a particular security or market. Downward trends are generally
seen as a sign of economic weakness and can be driven by factors such as weak demand,
declining profits, and unfavorable economic conditions.
3. Sideways trend: A sideways trend, also known as a rangebound market, is a period of
relatively stable prices in a particular security or market. Sideways trends can be
characterized by a lack of clear direction, with prices fluctuating within a relatively
narrow range.
Moving Averages: These strategies involve entering into long positions when a short-
term moving average crosses above a long-term moving average, and entering short
positions when a short-term moving average crosses below a long-term moving average.
Momentum Indicators: These strategies involve entering into long positions when a
security is trending with strong momentum and exiting long positions when a security
loses momentum. Often, the relative strength index (RSI) is used in these strategies.
Trendlines & Chart Patterns: These strategies involve entering long positions when a
security is trending higher and placing a stop-loss below key trendline support levels. If
the stock starts to reverse, the position is exited for a profit.
Trend analysis can offer several advantages for investors and traders. It is a powerful tool for
investors and traders as it can help identify opportunities for buying or selling securities,
minimize risk, improve decision-making, and enhance portfolio performance.
Trend analysis can be based on a variety of data points, including financial statements,
economic indicators, and market data, and there are several different methods that can be used
to analyze trends, including technical analysis and fundamental analysis. By providing a deeper
understanding of the factors that are driving trends in data, trend analysis can help investors and
traders make more informed and confident decisions about their investments.
64
PREDICTIVE ANALYTICS
Disadvantages
Trend analysis can have some potential disadvantages as a tool for making investment
decisions. One of these disadvantages is that the accuracy of the analysis depends on the quality
of the data being used. If the data is incomplete, inaccurate, or otherwise flawed, the analysis
may be misleading or inaccurate.
Another potential disadvantage is that trend analysis is based on historical data, which means it
can only provide a limited perspective on the future. While trends in data can provide useful
insights, it's important to remember that the future is not necessarily predetermined by the past,
and unexpected events or changes in market conditions can disrupt trends. Trend analysis is also
focused on identifying patterns in data over a given period of time, which means it may not
consider other important factors that could impact the performance of a security or market.
Most financial, investment and business decisions are taken into consideration on the basis
of future changes and demands forecasts in the financial domain.
Time series analysis and forecasting essential processes for explaining the dynamic and
influential behaviour of financial markets. Via examining financial data, an expert can
predict required forecasts for important financial applications in several areas such as risk
evolution, option pricing & trading, portfolio construction, etc.
For example, time series analysis has become the intrinsic part of financial analysis and
can be used in predicting interest rates, foreign currency risk, volatility in stock markets
and many more. Policymakers and business experts use financial forecasting to make
decisions about production, purchases, market sustainability, allocation of resources, etc.
In investment, this analysis is employed to track the price fluctuations and price of a
security over time. For instance, the price of a security can be recorded;
For the short term, such as the observation per hour for a business day, and
For the long term, such as observation at the month end for five years.
Time series analysis is extremely useful to observe how a given asset, security, or
economic variable behaves/changes over time. For example, it can be deployed to evaluate
how the underlying changes associated with some data observation behave after shifting to
other data observations in the same time period.
65
PREDICTIVE ANALYTICS
Medicine has evolved as a data-driven field and continues to contribute in time series
analysis to human knowledge with enormous developments.
Case study
Consider the case of combining time series with a medical method CBR (case-based
reasoning) and data mining, these synergies are essential as the pre-processing for feature
mining from time series data and can be useful to study the progress of patients over
time.
However, time series in the context of the epidemiology domain has emerged very recently
and incrementally as time series analysis approaches demand recordkeeping systems such
that records should be connected over time and collected precisely at regular intervals.
As soon as the government has placed sufficient scientific instruments to accumulate good
and lengthy temporal data, healthcare applications using time series analysis have resulted
in huge prognostication for the industry as well as for individuals’ health diagnoses.
Medical Instruments
Time series analysis has made its way into medicine with the advent of medical devices
such as
These inventions made more opportunities for medical practitioners to deploy time series
for medical diagnosis.
With the advent of wearable sensors and smart electronic healthcare devices, now persons
can take regular measurements automatically with minimal inputs, resulting in a good
collection of longitudinal medical data for both sick and healthy individuals consistently.
One of the contemporary and modern applications where time series plays a significant
role are different areas of astronomy and astrophysics,
66
PREDICTIVE ANALYTICS
Being specific in its domain, astronomy hugely relies on plotting objects, trajectories and
accurate measurements, and due to the same, astronomical experts are proficient in time
series in calibrating instruments and studying objects of their interest.
Time series data had an intrinsic impact on knowing and measuring anything about the
universe, it has a long history in the astronomy domain, for example, sunspot time series
were recorded in China in 800 BC, which made sunspot data collection as well-recorded
natural phenomena.
To discover variable stars that are used to surmise stellar distances, and
To observe transitory events such as supernovae to understand the mechanism of the
changing of the universe with time.
Such mechanisms are the results of constant monitoring of live streaming of time series
data depending upon the wavelengths and intensities of light that allows astronomers to
catch events as they are occurring.
In the last few decades, data-driven astronomy introduced novel areas of research as
astroinformatics and astrostatistics; these paradigms involve major disciplines such
as statistics, data mining, machine learning and computational intelligence. And here, the
role of time series analysis would be detecting and classifying astronomical objects swiftly
along with the characterization of novel phenomena independently.
Anciently, the Greek philosopher Aristotle researched weather phenomena with the idea to
identify causes and effects in weather changes. Later on, scientists started to accumulate
weather-related data using the instrument “barometer” to compute the state of atmospheric
conditions, they recorded weather-related data on intervals of hourly or daily basis and
kept them in different locations.
With the time, customized weather forecasts began printed in newspapers and later on with
the advancement in technology, currently forecasts are beyond the general weather
conditions.
These stations are equipped with highly functional devices and are interconnected with
each other to accumulate weather data at different geographical locations and forecast
weather conditions at every bit of time as per requirements.
67
PREDICTIVE ANALYTICS
Time series forecasting helps businesses to make informed business decisions, as the
process analyzes past data patterns it can be useful in forecasting future possibilities and
events in the following ways;
Reliability: When the data incorporates a broad spectrum of time intervals in the form of
massive observations for a longer time period, time series forecasting is highly reliable. It
provides elucidate information by exploiting data observations at various time intervals.
Growth: In order to evaluate the overall financial performance and growth as well as
endogenous, time series is the most suitable asset. Basically, endogenous growth is the
progress within organizations’ internal human capital resulting in economic growth. For
example, studying the impact of any policy variables can be manifested by applying time
series forecasting.
Trend estimation: Time series methods can be conducted to discover trends, for example,
these methods inspect data observations to identify when measurements reflect a decrease
or increase in sales of a particular product.
Seasonal patterns: Recorded data points variances could unveil seasonal patterns &
fluctuations that act as a base for data forecasting. The obtained information is significant
for markets whose products fluctuate seasonally and assist organizations in planning
product development and delivery requirements.
An autoregressive model is a process used to predict the future based on accumulated data from
the past. It is possible because there is a correlation between the two. Such a model can represent
any random procedure where the output is dependent on any previous values.
68
PREDICTIVE ANALYTICS
This model is often used to predict the future trend in stock prices by analyzing past
performance. Thus, it assumes that the future result will be similar to the previous years.
However, this is only sometimes acceptable because, due to continuous global technological and
economic changes, there is no guarantee that the future will reflect the past.
The autoregressive (AR) model predicts the future based on past data or information. It helps in
stock price forecasting on the assumption that the prices of previous years will genuinely reflect
the end.
In an autoregressive model time series is calculated based on the correlation of past and future
data. So, it is a statistical method for any fundamental or technical analysis. But the downside of
this model is the assumption that all forces or factors that affected past performance will remain
the same, which is unrealistic since change is inevitable in all fields. There is a rapid
transformation all around due to endless innovation taking place.
A vector autoregressive model, for instance, consists of multiple variables that attempt to
correlate a variable’s present values with its past values and the system’s past data of other
variables. Thus, it is a multivariate model. If an AR model is univariate, it is impossible to get a
two-way result between the variables.
The use of the autoregressive process to make forecasts is very significant. However, these
models are also stochastic, meaning they have an element of uncertainty. Any unforeseen
contingency or sudden change and shift in the economy will affect the outcome of future values
significantly, which refers to the fact that the result will never be accurate. Nevertheless, it is
possible to get the closest possible outcome.
A first-order autoregressive model assumes that the immediately previous value decides the
current value. However, there might be cases that the present value will depend on two previous
values. Thus, in an autoregressive model, time series plays an important role and is used
depending on the situation and desired result.
Formula
In this model, some specific values of Xt serve as variables. They have lagged values, which
means the past or current output will affect future outcomes.
Xt = C + ϕ1Xt-1 + ϵt
69
PREDICTIVE ANALYTICS
where,
Xt-1 = value of X in the previous year/month/week. If “t” is the current year, then “t-1” will be the
last.
ϕ1 = coefficient, which we multiply with Xt-1. The value of ϕ1 will always be 1 or -1.
ϵt = The difference between the period t value and the correct value (ϵt = yt – ŷt)
p = The order. Thus, AR (1) is first order autoregressive model. The second and third order
would be AR (2) and AR (3), respectively.
Examples
Example #1
John is an investor in the stock market. He analyses stocks based on past data related to the
company’s performance and statistics. John believes that the performance of the stocks in the
previous years strongly correlates with the future, which is beneficial to
making investment decisions.
He uses an autoregressive model with price data for the previous five years. The result gives him
an estimate for future prices depending on the assumption that sellers and buyers follow the
market movements and accordingly make investment decisions.
Example #2
The concept of AR models has gained importance in the information technology field. Google
has proposed Autoregressive Diffusion Models (ARDMs), which encompass and generalizes the
models that depend on any data arrangement. It is possible to train the model to achieve any
desired result. Thus, this method will generate outcomes under any order.
Example #3
The autoregression process can be helpful in the veterinary field; also, the main focus is on the
occurrence of a disease over time. In this case, the primary source of information is the systems
used to monitor and track the details of animal disease. This data is analyzed and correlated
using the model to understand the possibility of any disease occurrence. However, this model has
limited use in the veterinary field due to limited data availability and the need for useful software
to generate the best results.
70
PREDICTIVE ANALYTICS
The various time slots impact the time. Some external factors affect the period.
It puts data from the previous time in the regression It states that the next value will be the average of
equation to get the next value. all the past values.
The correlation between the objects of the time The correlation between the objects of the time
series decreases as the time gap increases. series at different points in time is zero.
Predictive analytics is a powerful technique that ‘predicts’ the future, in a sense. It can help
answer key questions, such as how many products a business could sell in the next three months
and how much profit it is likely to make.
Using sales as an example, it’s essential to know past sales data in order to predict future sales.
The past sales data and cleaned data from descriptive analytics are mixed to create a dataset to
train an ML model.
The built model predicts future sales, say, for the next few months. The predicted quantities sold
and profits made are compared with the actual numbers sold and profits made. The actual profits
could be more or less than what was predicted. The model is refined to overcome such
limitations and improve the accuracy of predictions.
Types of analytics
There are four types of analytics: descriptive, diagnostic, predictive, and prescriptive.
71
PREDICTIVE ANALYTICS
Descriptive analytics deals with the cleaning, relating, summarizing, and visualizing of
given data to identify patterns.
Diagnostic analytics deals with analyzing why something is happening. For example,
investigating the reason behind the decline or growth of revenue.
Predictive analytics involves predicting future outcomes or unknown events using
machine learning and statistical algorithms.
Prescriptive analytics uses descriptive and predictive sources to assist with decision-
making.
There are many scenarios where there may be an abundance of data. However, there may be no
algorithms available to train machines to perform certain tasks. In this case, we want the
machines to learn from the data and apply the learning to unseen inputs. This is referred to as
machine learning. For example, if we want to know employee churn rate, we can use a machine
learning model that has been trained on past data to predict if an employee will leave.
ML is used when we cannot explicitly estimate all the possible cases of an event occurring and
write a piece of code for each. For example, what are the rules to predict if content posted on a
video-hosting platform is for kids or adults? How do we predict the genre of a new show? There
are millions of videos uploaded every day. Examining and analyzing each of them manually is
impossible. This is where ML comes into play as the algorithms can process enormous amounts
of structured (data in rows and columns) and unstructured (images, videos, text with emoticons,
etc.) data.
72
PREDICTIVE ANALYTICS
We begin by understanding and defining the problem statement, and deciding on the required
datasets on which to perform predictive analytics.
Example: There is a grocery store. Our objective is to predict the sales of groceries for the next
six months. Here, past sales data of how many groceries were sold and the resulting profits of the
last five years will be the dataset.
Once we know what sort of dataset is needed to perform predictive analytics using machine
learning, we gather all the necessary details that constitute the dataset. We need to ensure that the
historical data is collected from an authorized source.
73
PREDICTIVE ANALYTICS
Using the grocery store example, we can ask the accountant for records of past sales logged in
worksheets or billing software. We collect data spanning the past five years.
The raw dataset obtained will have some missing data, redundancies, and errors. Since we cannot
train the model for predictive analytics directly with such noisy data, we need to clean it. Known
as preprocessing, this step involves refining the dataset by eradicating unnecessary and duplicate
data.
EDA involves exploring the dataset thoroughly in order to identify trends, discover anomalies,
and check assumptions. It summarizes a dataset’s main characteristics. It often uses data
visualization techniques.
Based on the patterns observed in step 4, we build a predictive statistical machine learning
model, trained with the cleaned dataset obtained after step 3. This machine learning algorithm
helps us perform predictive analytics to foresee the future of our grocery store business. The
model can be implemented using Python, R, or MATLAB.
Hypothesis testing
Hypothesis testing can be performed using a standard statistical model. It includes two
hypotheses, null and alternate. We either reject or fail to reject the null hypothesis.
Example: A new ‘buy one, get one free’ scheme is implemented where customers buy a packet
of soap and get a face wash for free. Consider the two cases below:
If the first case is true, we fail to reject the null hypothesis as there is no improvement. If the
second case is true, we reject the null hypothesis.
This is a crucial step wherein we check the efficiency of the model by testing it with unseen
input datasets. Depending on the extent to which it makes correct predictions, the model is
retrained and evaluated.
74
PREDICTIVE ANALYTICS
The model is made available for use in a real-world environment by deploying it on a cloud
computing platform so that users can utilize it. Here, the model will make predictions on real-
time inputs from the users.
Now that the model is functioning in the real world, we need to verify its performance. Model
monitoring refers to examining how the model predicts actual datasets. If any improvement must
be made, the dataset is expanded and the model is rebuilt and redeployed.
Predictive analytics continues to be improved with machine learning algorithms. The eight use
cases discussed below illustrate how.
E-commerce/retail
Predictive analytics achieved through machine learning helps retailers understand customers’
preferences. It works by analyzing users’ browsing patterns and how frequently a product is
clicked on in a website. For example, when we purchase a t-shirt on an e-commerce site, similar
shirts are suggested the next time we log in. Sometimes, we may be recommended several
specific items that are often purchased together for x amount of money. Such personalized
recommendations help retailers retain customers. Predictive analytics also helps maintain
inventory by foreseeing and informing sellers about stockouts.
Customer service
Predictive analytics using machine learning can also detect dissatisfied customers and help
sellers design products aimed to retain existing customers and attract new ones.
Medical diagnosis
Machine learning models that are trained on large and varied datasets can study patient
symptoms comprehensively to provide faster and more accurate diagnoses. Performing
predictive analytics on the reasons behind past hospital readmissions can also improve care.
Further, hospitals can use predictive analytics to provide the best care by pre-determining
increase of hospital bed availability or staff shortage. For example, if the number of COVID
cases for the next month can be predicted and the rise in the number of severely infected can be
forecasted, hospitals can make arrangements to deal with such a scenario more efficiently.
Predictive analytics of historical data of customer behavior and market trends can help
businesses understand the demands of prospective customers. Companies can achieve higher
targets by streamlining their sales and marketing activities into a data-based undertaking.
Demand forecasting also helps businesses estimate the demand for certain products in the future.
Financial services
Predictive analytics using machine learning helps detect fraudulent activities in the financial
sector. Fraudulent transactions are identified by training machine learning algorithms with past
datasets. The models find risky patterns in these datasets and learn to predict and deter fraud.
Cybersecurity
Machine learning algorithms can analyze web traffic in real-time. When an unusual pattern is
observed, advanced statistical methods of predictive analytics foresee and prevent cyber-attacks.
They also automatically collect attack-related data and generate useful reports on a cyber-attack,
thereby reducing the need for manpower.
Manufacturing
Machine learning and predictive analytics help manufacturers monitor machines and notify them
when crucial components need to be repaired or replaced. They can also predict market
fluctuations, reduce the number of accidents, improve key performance indicators (KPIs), and
enhance overall production quality.
Predictive analytics using machine learning identifies employee churn rate and keeps human
resources (HR) departments informed of the same. Models can be trained with datasets that have
details such as an employee's monthly income, allowances, increments, insurance, and so on.
76
PREDICTIVE ANALYTICS
The models learn from past records of ex-employees and find patterns to understand the reasons
for leaving. They then predict if new employees are likely to resign or not, empowering HR to
minimize the risk.
Regression analysis
Decision trees
Decision trees are classification models that place data into different categories based on distinct
variables. The method is best used when trying to understand an individual's decisions. The
model looks like a tree, with each branch representing a potential choice, with the leaf of the
branch representing the result of the decision. Decision trees are typically easy to understand and
work well when a dataset has several missing variables.
Neural networks
Neural networks are machine learning methods that are useful in predictive analytics when
modeling very complex relationships. Essentially, they are powerhouse pattern recognition
engines. Neural networks are best used to determine nonlinear relationships in datasets,
especially when no known mathematical formula exists to analyze the data. Neural networks can
be used to validate the results of decision trees and regression models.
Every business seeks to grow. But only a handful of companies that successfully actualize this
vision do so through data-based decision making. And to make these informed decisions,
companies have been using machine learning-based predictive analytics.
Predictive analytics is predicting future outcomes based on historical and current data. It uses
various statistical and data modeling techniques to analyze past data, identify trends, and help
make informed business decisions. While previously, machine learning and predictive analytics
were viewed as two entirely different and unrelated concepts, the increasing demands of
effective data analytics have brought machine learning algorithms to intertwine with predictive
analytics. Today, predictive analytics extensively uses machine learning for data modeling due to
its ability to accurately process vast amounts of data and recognize patterns.
In this piece, we’ll learn in detail how machine learning analytics is helping companies predict
the future and make informed decisions.
77
PREDICTIVE ANALYTICS
The neural network is a system of hardware and software mimicked after the central nervous
system of humans, to estimate functions that depend on vast amounts of unknown inputs. Neural
networks are specified by three things – architecture, activity rule, and learning rule.
According to Kaz Sato, Staff Developer Advocate at Google Cloud Platform, “A neural network
is a function that learns the expected output for a given input from training datasets”. A neural
network is an interconnected group of nodes. Each processing node has its small sphere of
knowledge, including what it has seen and any rules it was initially programmed with or
developed for itself.
78
PREDICTIVE ANALYTICS
In short neural networks are adaptive and modify themselves as they learn from subsequent
inputs. For example, below is a representation of a neural network that performs image
recognition for ‘humans’. The system has been trained with a lot of samples of human and non-
human images. The resulting network works as a function that takes an image as input and
outputs label human or non-human.
79
PREDICTIVE ANALYTICS
What are the key differences between predictive analytics and machine learning?
As noted, predictive analytics uses advanced mathematics to examine patterns in current and past
data in order to predict the future.
Machine learning is a tool that automates predictive modeling by generating training algorithms
to look for patterns and behaviors in data without explicitly being told what to look for.
Benefits and challenges of using predictive analytics and machine learning for businesses
Machine learning algorithms can produce more accurate predictions, create cleaner data and
empower predictive analytics to work faster and provide more insight with less oversight.
Having a strong predictive analysis model and clean data fuels the machine learning application.
While a combination of predictive analytics and ML does not necessarily provide more
applications, it does mean that the application can be trusted more. Splitting hairs between the
80
PREDICTIVE ANALYTICS
two shows that these terms are actually hierarchical and that when combined, they complete one
another to strengthen the enterprise.
Challenges: While the techniques associated with both predictive analytics and ML are
becoming embedded in software and result in so-called "one-click" forecasting, enterprises will
face the usual challenges associated with getting value out of data, starting with the data.
Corporate data is error-prone, inconsistent and incomplete. Finding the right data and preparing
it for processing is time consuming. Expertise in deploying and interpreting the predictive
models is scarce. Moreover, predictive analytics software is expensive, and so is the processing
required to make effective models. Lastly, machine learning technologies are evolving at a rapid
pace, requiring continuous scrutiny on how and when to upgrade to newer approaches.
Random Forest is a famous machine learning algorithm that uses supervised learning methods.
You can apply it to both classification and regression problems. It is based on ensemble learning,
which integrates multiple classifiers to solve a complex issue and increases the model's
performance.
In layman's terms, Random Forest is a classifier that contains several decision trees on various
subsets of a given dataset and takes the average to enhance the predicted accuracy of that dataset.
Instead of relying on a single decision tree, the random forest collects the result from each tree
and expects the final output based on the majority votes of predictions.
The Working of the Random Forest Algorithm is quite intuitive. It is implemented in two phases:
The first is to combine N decision trees with building the random forest, and the second is to
make predictions for each tree created in the first phase.
Step 2: Create decision trees for your chosen data points (Subsets).
Step 4: For classification and regression, accordingly, the final output is based on Majority
Voting or Averaging, accordingly.
81
PREDICTIVE ANALYTICS
Example - Consider the following scenario: a dataset containing several fruits images. And the
Random Forest Classifier is given this dataset. Each decision tree is given a subset of the dataset
to work with. During the training phase, each decision tree generates a prediction result.
The Random Forest classifier predicts the final decision based on most outcomes when a new
data point appears.
Although a random forest is a collection of decision trees, its behavior differs significantly.
82
PREDICTIVE ANALYTICS
We will differentiate Random Forest from Decision Trees based on 3 Important parameters:
Overfitting, Speed, and Process.
1. Overfitting - Overfitting is not there as in Decision trees since random forests are formed
from subsets of data, and the final output is based on average or majority rating.
2. Speed - Random Forest Algorithm is relatively slower than Decision Trees.
3. Process - Random forest collects data at random, forms a decision tree, and averages the
results. It does not rely on any formulas as in Decision trees.
Ensemble Learning
The ensemble methods in machine learning combine the insights obtained from multiple learning
models to facilitate accurate and improved decisions.
Example 1: If you are planning to buy an air-conditioner, would you enter a showroom and buy
the air-conditioner that the salesperson shows you? The answer is probably no. In this day and
age, you are likely to ask your friends, family, and colleagues for an opinion, do research on
various portals about different models, and visit a few review sites before making a purchase
decision. In a nutshell, you would not come to a conclusion directly. Instead, you would try to
make a more informed decision after considering diverse opinions and reviews. In the case of
ensemble learning, the same principle applies.
In learning models, noise, variance, and bias are the major sources of error. The ensemble
methods in machine learning help minimize these error-causing factors, thereby ensuring the
accuracy and stability of machine learning (ML) algorithms.
Example 2: Assume that you are developing an app for the travel industry. It is obvious that
before making the app public, you will want to get crucial feedback on bugs and potential
loopholes that are affecting the user experience. What are your available options for obtaining
critical feedback? 1) Soliciting opinions from your parents, spouse, or close friends. 2) Asking
your co-workers who travel regularly and then evaluating their response. 3) Rolling out your
travel and tourism app in beta to gather feedback from non-biased audiences and the travel
community.
Taking into account different views and ideas from a wide range of people to fix issues that are
limiting the user experience. The ensemble neural network and ensemble algorithm do precisely
the same thing.
Ex: Imagine a group of blindfolded people playing the touch-and-tell game, where they are asked
to touch and explore a mini donut factory that no one of them has ever seen before. Since they
are blindfolded, their version of what a mini donut factory looks like will vary, depending on the
parts of the appliance they touch. Now, suppose they are personally asked to describe what they
touched. In that case, their individual experiences will give a precise description of specific parts
83
PREDICTIVE ANALYTICS
of the mini donut factory. Still, collectively, their combined experiences will provide a highly
detailed account of the entire equipment.
Ensemble methods in machine learning employ a set of models and take advantage of the
blended output, which, compared to a solitary model, will most certainly be a superior option
when it comes to prediction accuracy.
Mode: In statistical terminology, "mode" is the number or value that most often appears in a
dataset of numbers or values. In this ensemble technique, machine learning professionals use a
number of models for making predictions about each data point. The predictions made by
different models are taken as separate votes. Subsequently, the prediction made by most models
is treated as the ultimate prediction.
The Mean/Average: In the mean/average ensemble technique, data analysts take the average
predictions made by all models into account when making the ultimate prediction.
Let's take, for instance, one hundred people rated the beta release of your travel and tourism app
on a scale of 1 to 5, where 15 people gave a rating of 1, 28 people gave a rating of 2, 37 people
gave a rating of 3, 12 people gave a rating of 4, and 8 people gave a rating of 5.
The average in this case is - (1 * 15) + (2 * 28) + (3 * 37) + (4 * 12) + (5 * 8) / 100 = 2.7
The Weighted Average: In the weighted average ensemble method, data scientists assign
different weights to all the models in order to make a prediction, where the assigned weight
defines the relevance of each model. As an example, let's assume that out of 100 people who
gave feedback for your travel app, 70 are professional app developers, while the other 30 have no
experience in app development. In this scenario, the weighted average ensemble technique will
give more weight to the feedback of app developers compared to others.
The three main classes of ensemble learning methods are bagging, stacking, and boosting, and
it is important to both have a detailed understanding of each method and to consider them on
your predictive modeling project.
Bagging involves fitting many decision trees on different samples of the same dataset and
averaging the predictions.
Stacking involves fitting many different models types on the same data and using another model
to learn how to best combine the predictions.
Boosting involves adding ensemble members sequentially that correct the predictions made by
prior models and outputs a weighted average of the predictions.
84
PREDICTIVE ANALYTICS
85
PREDICTIVE ANALYTICS
Bias training data toward those examples that are hard to predict.
Iteratively add ensemble members to correct predictions of prior models.
Combine predictions using a weighted average of models.
86
PREDICTIVE ANALYTICS
A support vector machine (SVM) is a machine learning algorithm that uses supervised
learning models to solve complex classification, regression, and outlier detection problems
by performing optimal data transformations that determine boundaries between data
points based on predefined classes, labels, or outputs. SVMs are widely adopted across
disciplines such as healthcare, natural language processing, signal processing applications,
and speech & image recognition fields.
87
PREDICTIVE ANALYTICS
A computer’s ability to learn from data without explicit programming is called machine
learning.
Supervised Learning
Supervised learning refers to a data set with known outcomes. If it is unsupervised, there are no
known outcomes and you won’t have the categories or classes necessary for the machine to
learn.
There are two major types of machine learning algorithms in the supervised learning category:
Classification Algorithms
88
PREDICTIVE ANALYTICS
What is SVM?
SVM is a type of classification algorithm that classifies data based on its features. An SVM
will classify any new element into one of the two classes.
Once you give it some inputs, the algorithm will segregate and classify the data and then create
the outputs. When you ingest more new data (an unknown fruit variable in this example), the
algorithm will correctly classify the fruit: e.g., “apple” versus “orange”.
89
PREDICTIVE ANALYTICS
Before separating anything using high-level mathematics, let’s look at an unknown value, which
is new data being introduced into the dataset without a predesignated classification.
You can actually draw several boundaries, as shown above. Then, you need to find the line of
best fit that clearly separates those two groups. The correct line will help you classify the new
data point.
You can find the best line by computing the maximum margin from equidistant support vectors.
Support vectors in this context simply mean the two points — one from each class that are
closest together, but that maximize the distance between them or the margin.
90
PREDICTIVE ANALYTICS
There are a couple of points at the top that are pretty close to one another, and similarly at the
bottom of the graph. Shown below are the points that you need to consider. The rest of the points
are too far away. The bowler points to the right and the batsman points to the left.
Mathematically, you can calculate the distance among all of these points and minimize that
distance. Once you pick the support vectors, draw a dividing line, and then measure the distance
from each support vector to the line. The best line will always have the greatest margin or
distance between the support vectors.
For instance, if you consider the yellow line as a decision boundary, the player with the new data
point is the bowler. But, as the margins don’t appear to be maximum, you can come up with a
better line.
91
PREDICTIVE ANALYTICS
Use other support vectors, draw the decision boundary between those, and then calculate the
margin. Notice now that the unknown data point would be considered a batsman.
If you look at the green decision boundary, the line appears to have a maximum margin
compared to the other two. This the boundary of greatest margin and when you classify your
unknown data value, you can see that it clearly belongs to the batsman's class. The green line
divides the data perfectly because it has the maximum margin between the support vectors. At
this point, you can be confident with the classification — the new data point is indeed a batsman.
The hyperplane with the maximum distance from the support vectors is the one you want.
Sometimes called the positive hyperplane (D+), it is the shortest distance to the closest positive
point and (D-), or the negative hyperplane, which is the shortest distance to the closest negative
point.
This problem set is two-dimensional because the classification is only between two classes. It is
called a linear SVM.
92
PREDICTIVE ANALYTICS
93
PREDICTIVE ANALYTICS
94
PREDICTIVE ANALYTICS
DATA MINING
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes
Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
Data Mining is a process used by organizations to extract specific data from huge databases to
solve business problems. It primarily turns raw data into useful information.
Relational Database
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data search ability,
reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
95
PREDICTIVE ANALYTICS
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.
Object-Relational Database:
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
96
PREDICTIVE ANALYTICS
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.
Data interpretation refers to the process of taking raw data and transforming it into useful
information. This involves analyzing the data to identify patterns, trends, and relationships, and
then presenting the results in a meaningful way. Data interpretation is an essential part of data
analysis, and it is used in a wide range of fields, including business, marketing, healthcare, and
many more.
Data interpretation is critical to making informed decisions and driving growth in today's data-
driven world. With the increasing availability of data, companies can now gain valuable insights
into their operations, customer behavior, and market trends. Data interpretation allows businesses
to make informed decisions, identify new opportunities, and improve overall efficiency.
Data interpretation is critical to making informed decisions and driving growth in today's data-
driven world. With the increasing availability of data, companies can now gain valuable insights
into their operations, customer behavior, and market trends. Data interpretation allows businesses
to make informed decisions, identify new opportunities, and improve overall efficiency.
Descriptive Statistics
Descriptive statistics involve summarizing and presenting data in a way that makes it easy to
understand. This can include calculating measures such as mean, median, mode, and standard
deviation.
97
PREDICTIVE ANALYTICS
Inferential Statistics
Inferential statistics involves making inferences and predictions about a population based on a
sample of data. This type of data interpretation involves the use of statistical models and
algorithms to identify patterns and relationships in the data.
Visualization Techniques
Visualization techniques involve creating visual representations of data, such as graphs, charts,
and maps. These techniques are particularly useful for communicating complex data in an easy-
to-understand manner and identifying data patterns and trends.
98
PREDICTIVE ANALYTICS
99
PREDICTIVE ANALYTICS
Data Reduction: The method of data reduction may achieve a condensed description of the
original data which is much smaller in quantity but keeps the quality of the original data.
INTRODUCTION:
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining,
including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data
by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset
that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it is important to be aware of the trade-off between the size and accuracy of the data,
and carefully assess the risks and benefits before implementing it.
What is classification
The term "classification" is usually used when there are exactly two target classes called binary
classification. When more than two classes may be predicted, specifically in pattern recognition
problems, this is often referred to as multinomial classification. However, multinomial
classification is also used for categorical response data, where one wants to predict which
category amongst several categories has the instances with the highest probability.
Classification is one of the most important tasks in data mining. It refers to a process of assigning
pre-defined class labels to instances based on their attributes. There is a similarity between
classification and clustering, it looks similar, but it is different. The major difference between
100
PREDICTIVE ANALYTICS
classification and clustering is that classification includes the leveling of items according to their
membership in pre-defined groups. Let's understand this concept with the help of an example;
suppose you are using a self-organizing map neural network algorithm for image recognition
where there are 10 different kinds of objects. If you label each image with one of these 10
classes, the classification task is solved.
On the other hand, clustering does not involve any labeling. Assume that you are given an image
database of 10 objects and no class labels. Using a clustering algorithm to find groups of similar-
looking images will result in determining clusters without object labels.
These are given some of the important data mining classification methods:
K-Nearest Neighbors Method is used to classify the datasets into what is known as a K
observation. It is used to determine the similarities between the neighbours.
The Naive Bayes method is used to scan the set of data and locate the records wherein the
predictor values are equal.
The Neural Networks resemble the structure of our brain called the Neuron. The sets of data pass
through these networks and finally come out as output. This neural network method compares
the different classifications. Errors that occur in the classifications are further rectified and are
fed into the networks. This is a recurring process.
In this method, a linear function is built and used to predict the class of variables from
observation with the unknown class.
101
PREDICTIVE ANALYTICS
What is clustering
Clustering refers to a technique of grouping objects so that objects with the same functionalities
come together and objects with different functionalities go apart. In other words, we can say that
clustering is a process of portioning a data set into a set of meaningful subclasses, known as
clusters. Clustering is the same as classification in which data is grouped. Though, unlike
classification, the groups are not previously defined. Instead, the grouping is achieved by
determining similarities between data according to characteristics found in the real data. The
groups are called Clusters.
Methods of clustering
o Partitioning methods
o Hierarchical clustering
o Fuzzy Clustering
o Density-based clustering
o Model-based clustering
Classification Clustering
It uses algorithms to categorize the new data as per the It uses statistical concepts in
observations of the training set. which the data set is divided into
subsets with the same features.
In classification, there are labels for training data. In clustering, there are no labels
for training data.
102
PREDICTIVE ANALYTICS
Its objective is to find which class a new object belongs to form Its objective is to group a set of
the set of predefined classes. objects to find whether there is
any relationship between them.
Association rule mining is a rule-based machine learning method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered in
databases using some measures of interestingness.
Association rule mining is a type of unsupervised machine learning that discovers interesting
relationships between variables in large datasets. It is a rule-based approach that finds association
rules, which are if-then statements that describe the relationship between two or more items.
Types Of Association Rules In Data Mining
There are typically four different types of association rules in data mining. They are
Multi-relational association rules
Generalized Association rule
Interval Information Association Rules
Quantitative Association Rules
Moving on to the next type of association rule, the generalized association rule is largely used
for getting a rough idea about the interesting patterns that often tend to stay hidden in data.
This particular type is actually one of the most unique kinds of all the four association rules
available. What sets it apart from the others is the presence of numeric attributes in at least
one attribute of quantitative association rules. This is in contrast to the generalized association
rule, where the left and right sides consist of categorical attributes.
103
PREDICTIVE ANALYTICS
A cause and effect modeling is done to uncover patterns in data of the business organizations.
There are many different types of causal patterns in the world. Below are six patterns that are
embedded in many concepts. Causality in the real world seldom falls into one neat pattern or
another. The patterns often work together or different parts of a system entail different patterns—
making the causality even more complex!
Linear Causality – Cause precedes effect; sequential pattern. Direct link between cause
and effect. Has a clear beginning and a clear ending. Effect can be traced back to one
cause. One cause and one effect; additional causes or effects turn this pattern into domino
causality
Domino Causality – Sequential unfolding of effects over time. An extended linear pattern
that results in direct and indirect effects. Typically has a clear beginning and a clear
ending. Can be branching where there is more than one effect of a cause (and these may
go on to have multiple effects and so on.). Branching forms can be traced back to “stem”
causes. Anticipating outcomes involves deciding how far to trace effects. Short-
sightedness can lead to unintended effects.
Cyclic Causality – One thing impacts another which in turn impacts the first thing (or
alternatively impacts something else which then impacts something else and so on, but
eventually impacts the first thing). Involves a repeating pattern. Involves feedback loops.
May be sequential or may be simultaneous. Typically no clear beginning or ending
(Sometimes you can look back in time to a beginning but often that results in the classic
‘which came first, the chicken or the egg’ problem.).
Spiraling Causality – One thing impacts another which in turn impacts the first thing (or
alternatively impacts something else which then impacts something else and so on, but
104
PREDICTIVE ANALYTICS
A cause and effect analysis is an attempt to understand why things happen as they do. People in
many professions—accident investigators, scientists, historians, doctors, newspaper reporters,
automobile mechanics, educators, police detectives—spend considerable effort trying to
understand the causes and effects of human behavior and natural phenomena to gain better
control over events and over ourselves. If we understand the causes of accidents, wars, and
natural disasters, perhaps we can avoid them in the future. If we understand the consequences of
our own behavior, perhaps we can modify our behavior in a way that will allow us to lead
happier, safer lives.
105
PREDICTIVE ANALYTICS
The next step is identifying the main causes of the problem, for example, the people, the
procedures your business utilized as well as the materials or equipment. Here, you will need to
do a lot of brainstorming to come up as many reasons as you possibly can.
106
PREDICTIVE ANALYTICS
The cause and effect analysis uses brainstorming and critical analysis by way of visual
representation to enable problem-solving.
Data Simulation
Data simulation is the process of taking a large amount of data and using it to mimic real-world
scenarios or conditions. In technical terms, it could be described as the generation of random
numbers or data from a stochastic process which is stated as a distribution equation (e.g.,
Normal: X~N(μ, σ²)). It can be used to predict future events, determine the best course of action
or validate AI/ML models.
107
PREDICTIVE ANALYTICS
Software Development
A key part of developing any software is testing how it will perform under different conditions.
By creating data simulations that mimic real-world conditions, developers can put the software
through its paces and identify any potential problems. This process can be used to test everything
from the user interface to the backend algorithms.
Data simulation is increasingly being used in the oil and gas industry, too. By creating models of
reservoirs, geologists can better understand how oil and gas flow through rock and whether
they’re present in different geological strata. These models can be used to predict what will
happen when new wells are drilled, and they can help engineers design better production
facilities, too.
Companies and researchers also study the impact of environmental factors on the industry. By
simulating the effects of climate change, researchers gain better understanding of how rising
temperatures might affect the production of oil and gas.
Manufacturing
Data simulation is also being used to create “digital twins” which are virtual copies of physical
objects, such as a car or production factory. These models enable the study of real-world objects
and their operations without ever touching them. Manufacturers can easily identify the most
efficient and effective production process for a particular product, and avoid disruptions as they
transition to new methods.
Autonomous Vehicles
108
PREDICTIVE ANALYTICS
And of course, we can’t talk about data simulation without acknowledging its most high-profile
use case: the training of self-driving cars, drones and robots. Trying to test and train these
systems in the real-world is slow, costly and dangerous. But with synthetic data you can create
virtual training environments for improving these emerging technologies.
109
PREDICTIVE ANALYTICS
Industry use cases for a Monte Carlo simulation include the following:
Finance, such as risk assessment and long-term forecasting.
Project management, such as estimating the duration or cost of a project.
Engineering and physics, such as analyzing weather patterns, traffic flow or energy
distribution.
Quality control and testing, such as estimating the reliability and failure rate of a product.
Healthcare and biomedicine, such as modeling the spread of diseases.
Discriminant event simulation is a type of simulation technique used in the field of operations
research and management science to model and analyze complex systems or processes that
involve both deterministic and random events. It is a combination of two techniques:
discriminant analysis and event simulation.
Discriminant analysis is a statistical technique used to classify observations into two or more
groups based on a set of predictor variables. In discriminant event simulation, the classification is
based on a set of decision rules or policies that determine the actions to be taken in response to
different events or scenarios.
110
PREDICTIVE ANALYTICS
Event simulation, on the other hand, is a technique used to model and simulate the behavior of a
system over time by representing its components as discrete events or processes. Events are
triggered by external or internal factors, and can lead to changes in the state of the system or the
occurrence of new events.
Define the problem: Define the system or process to be modeled, and identify the decision rules
or policies that will be used to classify observations and determine the actions to be taken.
Define the model: Represent the system as a set of components or processes, and define the
events that can occur and the rules that govern their occurrence.
Define the input data: Define the input data required for the simulation, including the
probabilities of different events and the values of the predictor variables.
Generate random samples: Generate a large number of random samples from the input data, and
simulate the behavior of the system over time for each sample.
Analyze the results: Analyze the results of the simulation to gain insights into the behavior of the
system and the performance of the decision rules or policies. This may involve statistical
analysis, visualization, or other techniques.
Discriminant event simulation can be used in a wide range of applications, such as finance,
marketing, supply chain management, and health care, to model and analyze complex systems
that involve both deterministic and random events. For example, it can be used to model
customer behavior and evaluate marketing strategies, optimize inventory levels and supply chain
operations, or simulate the spread of infectious diseases and evaluate public health policies.
Quantitative analysis: Discriminant event simulation provides a quantitative way to analyze the
behavior of a system or process and evaluate the performance of decision rules or policies.
Risk assessment: Discriminant event simulation can be used to assess the risks and uncertainties
associated with a system or process, and to identify the factors that contribute most to the
variability in the outputs.
111
PREDICTIVE ANALYTICS
Input data requirements: Discriminant event simulation requires accurate and reliable input data
to ensure that the results of the simulations are meaningful and relevant.
112