Steps of Analytics Process
Steps of Analytics Process
Objective of analytics
ProcesS inputs
Build a relationship
(Model) Utilize model result
Macro
economic Industry growth, inflation,
unemployment rate
soured
Correct Mismatch in Levels of
from various processes that areMeasurement: The data could be anal1sis
to procced, all mcasured at different levels. For singke
example,
at regional level while inadvertisement
a
marketing Analytics
data at the national
saleslevel.
data Ourpr
erence would beto model at the level of sales study.
and hence, we will
tryt0
disaggregate advertthanisement data into
approach rather This will be
the
usual
regional level.
tryingto summarize sales nationallevel.
into
Analytics Process 31
Summarization: Sometimes the data available will be from operational
systems that are too granular for analysis. Such data will be summarized at
weekly or monthly level before it is used. For example, operational
of a credit card business will be at system
transactional level containing many
details about each transaction like amount, retailer, time, location, etc.
But
for mostof the analytics application,what is required will be a
the transactions at weekly or monthly level for each summary of
customer. Usually data generated
by operational systems
Exploratory Analysis are too granular for
Analytics applications
The objective of exploratory data analysis is tounderstand the nature of and would require sum
the data that was collected for analysis. It usually results in an marization.
presentation tothe decision makers or the users of the resultsintermediate
(refer Myatt
2006 for detailed coverage). Asthe name indicates, it is for
insight intothe data and relationships before building final developing an
solution, usu
ally a model. Input data was collected with certain
and at this stage it is tested to examine if it holds.
assumptions in mind
Exploratory analysis includes univariate and bi-variate analyses. In the
case of univariate analysis, each variable is evaluated on its own. It will
cover measures of central tendency, variation and distribution. In the case
ofbivariate analysis, each of the variable is analyzed against dependent vari
able (also pairs of independent variables chosen judiciously). This analysis
willhelp us to have an early indication of important variables and the type
of relationships (linear, different types of non-linearity, etc.) that exists.
Most importantly, it results in a valuable dialogue with decision makers.
It will ensure that all variables are being considered and no variable is
omitted inadvertently. It helps to get the feedback from them on certain
behaviour of the data which is not explained otherwise. For example,
whileanalysing sales for a snack item, analysts were surprised to noticea
dip insales during holiday season. The brand manager clarified that asize The objective of
able proportion of the consumption of this brand was as snacks for school descriptive Analytics is
students and hence,sales was impacted during holidays. Such insights will to understand the nature
be useful as the study moves to next stage. of data and interrela
In general, the findings will make intuitive sense to decision makers as tionships.
they are living inthis environment with intimate knowledge of the details.
Still, some of the fndings will surprise them too. Thisstage scts the back
ground for the final solution and they are prepared on what to expect. It
also will convince them that theanalysis is progressing well and thefinal
solution willmeet the expectations.
Build a Relationship
The most critical step in the whole process is to build arelationship
between dependent and independent variables. This relationship could be
in the nature of amathematical model built using statistical or operations
research methods. In most of the Analytics Cxercise, this stage will be there
may not move to thËs
but therc can be initiativcs wherC il stage (00.
to Bor
such studies,the insight gencratedl: at cxploratory stagc could be
answer the questionsand take
decisions.
non-lincar
enoughto
In statistical approach, it could be alincar or model.
based on the nature of the dependent variable (continuous or This is
Most of the Analyt and the nature of underlying relationship. If the approach is catoperati
egorical)
ics initiatives involve research, it will be alinear or nonlinear optimization model. This
building a mathemat critical asthe quality of final solution is based on the outcome of thisstage is
ical model connecting stage.
Depending on the approach, this stage itself will gothrough anumbe
depending and indepen well-defined steps.
dent variables.
In the statistical modelling approach it usually passes through the fol.
lowing steps. We will consider each of the variables and make it
for modelling, If it is aquantitative variable, usually it doesn't ready
require any
treatment. However, if the relationship is not meeting the fundamental
assumption of the techrnique, it may require transformation. For exam
ple, when we apply linear regression to build models, the assumption is
linear relationship between dependent and independent variable. In case
the relationship is not so, the independent variable will require a
to make it linear (for example, take log). Similarly, if the nature treatment
of therela
tionship is highly iregular, it doesn't make sense to use it inthe original
form. In such cases, we may convert the variable into
provides an overview of these categories. Figure 2.5
steps.
In case of qualitative variables, a
conversion to quantitative form is
required before it can be used. Categorical
binary variables if the number of categoriesvariable will be converted to
is not too large. If
large numbers of categories like are there
to reduce the number. For zipcodes, it will require some processing
example,
region and then these can be made into zipcodes can be mapped to states or
It is quite usual to have large binary.
stage and hence, the effort requirednumber of variables at the
at this stage is modelling
Models are validated to appropriate modelling approach, this can be reduced. significant. choosin8
By
ensure that performance the umbrella of selection processes Usually it falls under
js
not compromised on the modelling objective, (forward, backward, etc.). Dependiny
when it is implemented maybe chosen. forward, backward or stepwise approa
in a It rarely happens that a
different environ
ment.
Usually, anumber of modelsfinal are
model is developed in the first attemy
For models that are
focused on
developed and it is tested before finaliz1ng
can be made more accuracy, the decision about best model
casily as it involves
Model Validation: The picking the model with lowest error.
oped model retains its objective of validation is to ensure that thedevel-
it is predictive ability
in adeveloped. This concern is raised as outside the environment
where
phenomenon
model fits the data called
on 'overfitting' . This
modelling
refers to a
approach can he
Situation when the
the'quirks' of the
data. which is modelling 'too well' and accommodates
it
while fails
it
miserably
following are someof theoutside
Performance of the model
this data:; when it is will turn out to be good
practice. The
specifics that could lead toapplied this
in
situation.
Fig. 2.5 Data
Processing Steps
Consider variables
One by one
Is variable
quantitative?
Quantitative variable
Qualitative variable
Check and
transform Create binary
variables
Pool variables to
make master set
Choose modelling
approach
Finalize model
" It is
common to have large
Asmore variables are added,number
of variables under
consideration.
the chance of Type Ierror (error
a variable is erroneously included in the
model) increases.
whereby
This also reduces the number of degrees of
freedom and chance of over
fhtting increases.
" It is especially critical when the data is
lying distribution of the data is changing non-stationary. That is the under
over time. For example, con
sider the sales of aproduct in the growth mode. Sales whenthe model is
g0ing to be applied could be very different when itwas modelled.
Although, thhe ultimate test of validity is the performance of the model
practice, we could trv and control factors like overfitting. The usual To validate a model,
the dataset is divided
approach is to keepa part of the modelling dataset (titled hold out sample) into Training and Valida
Coto
cent asidetestwhile
the model. Typical industry practice is to keep about 25 per tion samples. The mnodel
ensuring that the sample sizeis enough to provide areli- isdeveloped using Train
able estimate.. Note that the size of the hold out sample will reduce the data ing sampleand tested on
available for development. Figure 2.6 below shows the steps involved in validation sample.
the process.
The process starts with splitting the modelling sample into develop
nent and hold out samples randomly. These datasetsare also called 'Train
Ing' and Test'. 7The model is developed using the development dataset and
Validation Process Steps
Fig. 2.6 Hold Out Sample
Total modelling
sample (100%)
development
Split data into random
and holdout at
Yes
Proceed to next
step
development
Analytics Process 35
Fig. 2.7 Out of Time Validation