0% found this document useful (0 votes)
36 views102 pages

Datamining and Analytics Unit V

The document discusses exploratory data analysis (EDA) and time series analysis. EDA helps understand patterns in data through visualization and statistics before formal modeling. Tools for EDA include Python and R. Time series analysis forecasts a variable using past values through autoregressive models. Stationarity and constant variance over time are key assumptions. The Dickey-Fuller test checks for a unit root to determine if a time series is stationary.

Uploaded by

Abinaya C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views102 pages

Datamining and Analytics Unit V

The document discusses exploratory data analysis (EDA) and time series analysis. EDA helps understand patterns in data through visualization and statistics before formal modeling. Tools for EDA include Python and R. Time series analysis forecasts a variable using past values through autoregressive models. Stationarity and constant variance over time are key assumptions. The Dickey-Fuller test checks for a unit root to determine if a time series is stationary.

Uploaded by

Abinaya C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

DataMining and Analytics

Unit V
2
3
Time Series
Stationary Time series
https://www.youtube.com/watch?v=OUiBqhvT_r0
Exploratory Data Analysis
• Exploratory data analysis (EDA)
• Originally developed by American mathematician John Tukey in the 1970s
• It helps determine how best to manipulate data sources to get the answers you need, making it easier for
data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
• EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing
task and provides a better understanding of data set variables and the relationships between them.
• It can also help determine if the statistical techniques you are considering for data analysis are
appropriate.
• Purpose of EDA :
• To help look at data before making any assumptions.
• To help identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
• Data scientists can use exploratory analysis to analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods.
• to ensure the results they produce are valid and applicable to any desired business outcomes and goals.
• helps stakeholders by confirming they are asking the right questions. EDA can help answer questions
about standard deviations, categorical variables, and confidence intervals.
• Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data
analysis or modeling, including machine learning.
EDA tools

Specific statistical functions and techniques used:


• Clustering and dimension reduction techniques, which help create graphical displays of
high-dimensional data containing many variables.
• Univariate visualization of each field in the raw dataset, with summary statistics.
• Bivariate visualizations and summary statistics that allow you to assess the relationship
between each variable in the dataset and the target variable you’re looking at.
• Multivariate visualizations, for mapping and understanding interactions between different
fields in the data.
• K-means Clustering is commonly used in market segmentation, pattern recognition, and
image compression.
• Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of EDA
• Univariate non-graphical.
• This is simplest form of data analysis, where the data being analyzed consists of just one variable.
• to describe the data and find patterns that exist within it.
• Since it’s a single variable, it doesn’t deal with causes or relationships.
• Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore
required. Common types of univariate graphics include:
• Stem-and-leaf plots, which show all data values and the shape of the distribution.
• Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of
values.
• Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
• Multivariate nongraphical:
• Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more
variables of the data through cross-tabulation or statistics.
• Multivariate graphical:
• Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with
each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
• Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
• Multivariate chart, which is a graphical representation of the relationships between factors and a response.
• Run chart, which is a line graph of data plotted over time.
• Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
• Heat map, which is a graphical representation of data where values are depicted by color.
Exploratory Data Analysis Tools

• Python:
• An interpreted, object-oriented programming language with dynamic semantics. Its
high-level, built-in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for rapid application development, as well as for use
as a scripting or glue language to connect existing components together.
• Python and EDA can be used together to identify missing values in a data set, which
is important so you can decide how to handle missing values for machine learning.
• R:
• An open-source programming language and free software environment for statistical
computing and graphics supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians in data science in developing
statistical observations and data analysis.
Time Series Analysis

• Time series amounts to forecasting a variable using


only on its past values
• This is called an autoregressive model
• We are going to focus on the application and less
on the estimation calculations because they are
just simply OLS
• Simetar estimates TS models easily with a menu
and provides forecasts of the time series model
Time Series Analysis

• TS is a comprehensive forecasting methodology for


univariate distributions
• Best for variables unrelated to other variables in a
structural model
• Autoregressive process in the simplest form is a
regression model such as:
Ŷt = f(Yt-1, Yt-2, Yt-3, Yt-4, … )
Notice there are no structural variables, just lags of
the variable itself
Time Series Analysis

• Steps for estimating a times series model are:


• Graph the data series to see what patterns are present
• trend, cycle, seasonal, and random
• Test data for Stationarity with a Dickie Fuller (D-F) Test
• If original series is not stationary then difference it until it is
• Number of Differences (p) to make a series stationary is
determined using the D-F Test
• Use the stationary (differenced) data series to
determine the number of Lags that best forecasts the
historical period
• Given a stationary series, use the Schwarz Criteria (SIC) or
autocorrelation table to determine the best number of lags (q)
to include when estimating the model
• Estimate the AR(p,q) Model with OLS and make a
forecast using the Chain Rule
Two Time Series Analysis Assumptions

• Assume series you are forecasting is really random


• No trend, seasonal, or cyclical pattern remains in series
• Series is stationary or “white noise”
• To guarantee this assumption we must make the data
stationary
• Assume the variability is constant, i.e., the same for
the future as for the past, in other words
• σ2T+i = σ2Historically
• Can test for constant variance by testing correlation over
time
• This is a crucial assumption because if σ2 changes
overtime, then forecast will explode overtime
Test for Stationary in Timeseries Data
Unit Root Test
• A unit root test tests whether a time series is not stationary and consists of a unit root in time series
analysis. The presence of a unit root in time series defines the null hypothesis, and the alternative
hypothesis defines time series as stationary.

• Mathematically the unit root test can be represented as

• Where,

• Dt is the deterministic component.


• zt is the stochastic component.
• ɛt is the stationary error process.
• The unit root test’s basic concept is to determine whether the zt (stochastic component ) consists of
a unit root or not.
Unit Root Test

A simple AR model can be represented as:

where

yt is variable of interest at the time t


ρ is a coefficient that defines the unit root
ut is noise or can be considered as an error term.
If ρ = 1, the unit root is present in a time series, and the time series is non-stationary.

If a regression model can be represented as


Δ is a difference operator.
ẟ = ρ-1
So here, if ρ = 1, which means we will get the differencing as the error term and if the coefficient has some values smaller than one or bigger
than one, we will see the changes according to the past observation.
The Dickey-Fuller test
• Consider the stochastic process of form

• where |φ| ≤ 1 and εi is white noise. If |φ| = 1, we have what is called a unit root. In particular, if φ = 1, we have a random
walk (without drift), which is not stationary. In fact, if |φ| = 1, the process is not stationary, while if |φ| < 1, the process is
stationary.
• consider the case where |φ| > 1 ,in this case the process is called explosive and increases over time.
• This process is a first-order autoregressive process, AR(1

• The Dickey-Fuller test is a way to determine whether the above process has a unit root. The approach used is quite
straightforward. First calculate the first difference, i.e.

• If we use the delta operator, defined by Δyi = yi – yi-1 and set β = φ – 1, then the equation becomes the linear regression
equation

• where β ≤ 0 and so the test for φ is transformed into a test that the slope parameter β = 0. Thus, we have a one-tailed test
(since β can’t be positive) where
• H0: β = 0 (equivalent to φ = 1)
• H1: β < 0 (equivalent to φ < 1)
• Under the alternative hypothesis, if b is the ordinary least squares (OLS) estimate of β, and so φ-bar = 1 + b is the OLS
estimate of φ, then for large enough n
The Dickey-Fuller test

• We can use the usual linear regression approach, except that when the null hypothesis holds the t coefficient
doesn’t follow a normal distribution and so we can’t use the usual t-test. Instead, this coefficient follows a tau
distribution, and so our test consists of determining whether the tau statistic τ (which is equivalent to the usual
t statistic) is less than τcrit based on a table of critical tau statistics values shown in Dickey-Fuller Table.
• If the calculated tau value is less than the critical value in the table of critical values, then we have a significant
result; otherwise, we accept the null hypothesis that there is a unit root and the time series is not stationary.

There are the following three versions of the Dickey-Fuller test:

Type 0 No constant, no trend Δyi = β1 yi-1 + εi


Type 1 Constant, no trend Δyi = β0 + β1 yi-1 + εi
Type 2 Constant and trend Δyi = β0 + β1 yi-1 + β2 i+ εi
First Make the Data Series Stationary

• Take differences of the data to make it stationary


• First difference of raw data in Y to calculate D1,t
D1,t = Yt – Yt-1
• Calculate the Second Difference of Y using the first
difference (D1,t ) or
D2,t = D1,t – D1,t-1
• Calculate the Third Difference of Y using the
second difference (D2,t) or
D3,t = D2,t – D2,t-1
• Stop differencing data when series is stationary
Make Data Series Stationary

• Example Difference table for a time series


data set
• D1t in period 1 is 0.41= 71.47 – 71.06
• D1t in period 2 is -1.41= 70.06 – 71.47
• D2t in period 1 is -1.82 = -1.41 – 0.41
• D2t in period 2 is 1.66 = 0.25 – (– 1.41)
• Etc.
Make Data Series Stationary

• Example Difference table for a time series


data set
• D1t in period 1 is 0.41= 71.47 – 71.06
• D1t in period 2 is -1.41= 70.06 – 71.47
• D2t in period 1 is -1.82 = -1.41 – 0.41
• D2t in period 2 is 1.66 = 0.25 – (– 1.41)
• Etc.
Test for Stationarity

• Dickie-Fuller Test for stationarity


• First D-F test First is original data stationary?
• Estimate regression Ŷ = a + b D1,t
• Calculate the t-ratio for b to see if a significant slope exists.
• D-F Test statistic is the t-statistic; if more negative than -2.9, we
declare the dependent series (Ŷ in this case) stationary at alpha
0.05 level (-2.90 is a 1 tailed t test critical value)
• Thus a D-F statistic of -3.0 is more negative than -2.90 so the
original series is stationary and differencing is not necessary
Next Level of Testing for Stationarity

• Second D-F Test for stationarity of the D1,t series,


in other words
• Here we are testing if the Y series will be stationary
after only one differencing?
• So we are asking if the D1,t series is stationary
• Estimate regression for
D1,t = a + b D2,t
• t-statistic on slope b is the second D-F test statistic
• Is the statistic more negative than -2.90?
Next Level of Testing for Stationarity

• Third D-F Test for stationarity of the D2,t series


• In other words will the Y series be stationary with
two differences?
• So we are testing if D2,t is stationary
• Estimate regression for
D2,t = a + b D3,t
• t-statistic on slope b is the third D-F test statistic

• Continue with a fourth and fifth DF test if


necessary
Test for Stationarity
• How does D-F test work?
• A series is stationary, if it oscillates about the mean and
has a slope of zero
0 = D1,t

t = 1, 2 . . . . . . . . T

• If a 1 period lag Yt-1 is used to predict Yt , they are good


predictors of each other because they have the same
mean. The t-statistic on b will be significant for
D1,t = a + b D2,t
• Lagging the data 1 period causes the two series to be
inversely related so the slope is negative in the OLS
equation. That explains why the “t” coefficient of -2.9 is
negative. The -2.9 represents a 5% 1 tail test statistic.
Test for Stationarity

• Estimated regression for: Ŷt = a + b D1,t


• DF is -1.868. You can see it is the t statistic for the beta on D1
• See a trend in original data and no trend in the residuals
• Intercept not zero so mean of Y is not constant
Test for Stationarity

• Estimated regression for: D1,t = a + b D2,t


• DF is -12.948, which is the t ratio on the slope parameter “b”
• See the residuals oscillate about a mean of zero, no trend in either series
• Intercept is 0.121 or about zero, so the mean is more likely to be constant
Test for Stationarity

• Estimated regression for: D2,t = a + b D3,t


• DF is -24.967 the t ratio on the slope variable
• Intercept is about zero
DF Stationarity Test in Simetar

• Dickie Fuller (DF) function in Simetar


=DF ( Data Series, Trend, No Lags, No. Diff to Test)
Where: Data series is the location of the data,
Trend is True or False for Augmented DF test
No. Lags is set to ZERO, always
No. Diff is the number of differences to test
Test for Stationarity

• Number of differences that make the data series


stationary is the one that has a DF test statistic
more negative than -2.9
• Here it is 1 difference no matter if trend included or
not

Lecture 5
Forecasting Tool for Time Series Data
 Moving Average
 Exponential Smoothing
 Holt Exponential Smoothing:
 Holt-Winters Exponential Smoothing

• Weighted Averages:
• A weighted average is simply an average of n numbers
• where each number is given a certain weight
• denominator is the sum of those n weights.
• The weights are often assigned as per some weighing function like logarithmic, linear, quadratic, cubic and exponential.
• Averaging as a time series forecasting technique has the property of smoothing out the variation in the historical values while calculating the forecast.
• By choosing a suitable weighing function, the forecaster determines which historical values should be given emphasis for calculating future values of the time series.

• Exponential Smoothing (ES)


• The technique forecasts the next value using a weighted average of all previous values where the weights decay exponentially from the most recent to the oldest
historical value.
• Crucial assumption made here is that recent values of the time series are much more important to you than older values.
• Shortcomings:
• It cannot be used when your data exhibits a trend
• It cannot be used when your data exhibits a seasonal variations

• Holt Exponential Smoothing:


• fixes one of the two shortcomings of the simple ES technique i.e can be used to forecast time series data that has a trend.
• Holt ES fails in the presence of seasonal variations in the time series.

• Holt-Winters Exponential Smoothing:


• The Holt-Winters ES modifies the Holt ES technique so that it can be used in the presence of both trend and seasonality.
TREND
• A time series whose level changes in some sort of a pattern is said to have a trend.
• A time series whose level changes randomly around some mean value can be said to exhibit a random trend.
• Apart from knowing that the trend is random, the concept of trend is not so useful when it’s random, compared to one where
the trend can be modeled by some function.
• Commonly observed trends are linear, square, exponential, logarithmic, square root, inverse and 3rd degree or higher
polynomials which can be easily modeled using the corresponding mathematical function, namely, log(x), linear, x², exp(x)
etc.
• Highly non-linear trends require complex modeling techniques such as artificial neural networks to model them
successfully.
• Trend is a rate or the velocity of the time series at a given level.This makes trend a vector that has a magnitude (rate of
change) and a direction (increasing or decreasing).
Seasonality
• Seasonality
• Many time series show periodic up and down movements around the
current level. This periodic up and down movement is called seasonality.
Here is an example of a time series demonstrating a seasonal pattern
• Noise
• Noise is simply the aspect of the time series data that you cannot (or do not
want to) explain.
• Level, Trend, Seasonality and Noise are considered to interact in an additive
or multiplicative manner to produce the final value of the time series that
you observe:
Multiplicative combination (with additive trend) • Fully Additive Combination
Linear Time Series Model
• An ARMA model, or Autoregressive Moving Average model, is used to describe weakly stationary stochastic time
series in terms of two polynomials. The first of these polynomials is for autoregression, the second for the moving
average.
ARMA(p,q) model :

•p is the order of the autoregressive polynomial,


•q is the order of the moving average polynomial.
• The equation is given by:
• Where:
• φ = the autoregressive model’s parameters,
• θ = the moving average model’s parameters.
• c = a constant,
• Σ = summation notation,
• ε = error terms (white noise).
he AR and MA components are identical, combining a general autoregressive model AR(p) and general moving
average model MA(q). AR(p) makes predictions using previous values of the dependent variable. MA(q) makes
predictions using the series mean and previous errors. What sets ARMA and ARIMA apart is differencing. An ARMA
model is a stationary model; If your model isn’t stationary, then you can achieve stationarity by taking a series of
differences. The “I” in the ARIMA model stands for integrated; It is a measure of how many non-seasonal differences
are needed to achieve stationarity. If no differencing is involved in the model, then it becomes simply an ARMA.
• An autoregressive integrated moving average, or ARIMA, is a
statistical analysis model that uses time series data to either better
understand the data set or to predict future trends.
• A statistical model is autoregressive if it predicts future values based
on past values. For example, an ARIMA model might seek to predict a
stock's future prices based on its past performance or forecast a
company's earnings based on past periods.
Linear Time Series Models
• ACF (Auto Correlation Function)
• Auto Correlation function takes into consideration of all the past observations irrespective of its effect on the future or present time
period.
• It calculates the correlation between the t and (t-k) time period. It includes all the lags or intervals between t and (t-k) time periods.
• Correlation is always calculated using the Pearson Correlation formula.
• PACF(Partial Correlation Function)
• The PACF determines the partial correlation between time period t and t-k.
• It doesn’t take into consideration all the time lags between t and t-k.
• For e.g. let's assume that today's stock price may be dependent on 3 days prior stock price but it might not take into consideration
yesterday's stock price closure.
• Hence we consider only the time lags having a direct impact on future time period by neglecting the insignificant time lags in
between the two-time slots t and t-k.
• Example :For Sweet stall ,use ACF to find out the income generated in the future but we will be using PACF to find out the
sweets sold in the next month.
Auto Regression Model
• The time period at t is impacted by the observation at various slots t-1, t-2, t-3, ….., t-k.
• The impact of previous time spots is decided by the coefficient factor at that particular period of time.
• The price of a share of any particular company X may depend on all the previous share prices in the time series.
• This kind of model calculates the regression of past time series and calculates the present or future values in the series.

Yt = β₁* y-₁ + β₂* yₜ-₂ + β₃ * yₜ-₃ + ………… + βₖ * yₜ-ₖ


The time period at t is impacted by the unexpected external factors at various slots t-1, t-2, t-3, ….., t-k.
These unexpected impacts are known as Errors or Residuals.

The impact of previous time spots is decided by the coefficient factor α at that particular period of time.

The price of a share of any particular company X may depend on some company merger that happened overnight or maybe the
company resulted in shutdown due to bankruptcy.

This kind of model calculates the residuals or errors of past time series and calculates the present or future values in the series in
know as Moving Average (MA) model.

Yt = α₁* Ɛₜ-₁ + α₂ * Ɛₜ-₂ + α₃ * Ɛₜ-₃ + ………… + αₖ * Ɛₜ-ₖ


Auto Regressive Moving Average (ARMA) Model
• This is a model that is combined from the AR and MA models. In this
model, the impact of previous lags along with the residuals is
considered for forecasting the future values of the time series. Here β
represents the coefficients of the AR model and α represents the
coefficients of the MA model.
• Yt = β₁* yₜ-₁ + α₁* Ɛₜ-₁ + β₂* yₜ-₂ + α₂ * Ɛₜ-₂ + β₃ * yₜ-₃ + α₃ * Ɛₜ-₃
+………… + βₖ * yₜ-ₖ + αₖ * Ɛₜ-ₖ
• To apply the various models
• convert the series into Stationary Time Series. To achieve the same, we apply the differencing or
Integrated method where we subtract the t-1 value from t values of time series.
• After applying the first differencing if we are still unable to get the Stationary time series then we
again apply the second-order differencing.
• The ARIMA model is quite similar to the ARMA model other than the fact that it includes one more
factor known as Integrated( I ) i.e. differencing which stands for I in the ARIMA model.
• So in short ARIMA model is a combination of a number of differences already applied on the model
in order to make it stationary, the number of previous lags along with residuals errors in order to
forecast future values.
ARMA Model
• The ARMA model takes in three parameters:
1.p is the order of the AR term
2.q is the order of the MA term
3.d is the number of differencing
• Autoregressive AR and Moving average MA
• The AR model only depends on past values (lags) to estimate future values.
• The value “p” determines the number of past values p will be taken into
account for the prediction.
• The higher the order of the model, the more past values will be taken into
account. AR(1)

The MA model can simply be thought of as the linear combination of q past forecast errors.
ARIMA Model
Step-by-step general approach of implementing ARIMA:
• Step 1:
• Load the dataset and plot the source data. (Check if the data has any seasonal patterns, cyclic
patterns, general trends)
• Dealing with missing values: ARIMA models don’t work on data that have NAs.
• Plotting the data*
• Step 2:
• Apply the Augmented Dickey Fuller Test (to confirm the stationarity of data)
• Implementation: adfuller()
• If the data is stationary, proceed with ARMA or ARIMA. (It’s your choice!)
• If the data is not stationary, proceed with ARIMA.(Because, data needs to be differenced to make it
stationary: ‘I’ component of ARIMA does this)
• Step 3:
• Run ETS Decomp osition on data (To check the seasonality in data)
• Implementation: seasonal_decompose()
Seasonal ARIMA

• SARIMA: Seasonal ARIMA model has an extra set of parameters (P, D, Q)


• P: order of seasonal AR terms
• D: order of differencing to attain stationarity
• Q: order of seasonal MA terms
• The implementation of SARIMA is similar to that of ARIMA. Except in Step 4, where we obtain 3
values: p,d,q for ARIMA, here, we obtain 6 values: p,d,q,P,D,Q for SARIMA. We apply
SARIMA() for model-fitting and proceed with the same procedure as stated for ARIMA.
Auto Regressive Models

• https://blog.paperspace.com/time-series-forecasting-autoregressive-
models-smoothing-methods/
• https://towardsdatascience.com/time-series-models-d9266f8ac7b0
• https://towardsdatascience.com/time-series-forecasting-with-
autoregressive-processes-ba629717401
• https://online.stat.psu.edu/stat501/lesson/14/14.1
References

• https://www.machinelearningplus.com/time-series/time-series-
analysis-python/
• https://www.kaggle.com/code/kashnitsky/topic-9-part-1-time-series-
analysis-in-python/notebook
Parameter Estimation

• MLE is a commonly used method of estimating parameters in


statistical distribution.
• The core principle behind MLE is maximizing a notion of likelihood
(which can simply be product of probability but is usually log of
summed probability).
Parameter Estimation

1.Method of Moments
• One of the easiest methods of parameter estimation is the method of moments (MOM).
• The basic idea is to find expressions for the sample moments and for the population moments and equate them:
• The E(X r ) expression will be a function of one or more unknown parameters.
• If there are, say, 2 unknown parameters, we would set up MOM equations for r = 1, 2, and solve these 2 equations
simultaneously for the two unknown parameters.
• In the simplest case, if there is only 1 unknown parameter to estimate then we equate the sample mean to the true mean of the
process and solve for the unknown parameter.

Auto Covariance and Auto Correlation


• If the {Xn} process is weakly stationary, the covariance of Xn and Xn+k depends only on the lag k. This leads to the following definition
of the “autocovariance” of the process: γ(k) = cov(Xn+k, Xn)
• The autocorrelation function, ρ(k), is defined by
ρ(k) = γ(k) /γ(0)
• This is simply the correlation between Xn and Xn+k. Another interpretation of ρ(k) is the optimal weight for scaling Xn into a prediction of
Xn+k i.e. the weight, a say, that minimizes E(Xn+k − aXn) 2.
Parameter Estimation
Parameter Estimation
Parameter Estimation -Yule Walker Estimation
Maximum likelihood estimation
• Once the model order has been identified (i.e., the values of pp, dd and qq), we need to estimate the parameters cc, ϕ1,…,ϕpθ1,…,θ q
• When R estimates the ARIMA model, it uses maximum likelihood estimation (MLE).
• This technique finds the values of the parameters which maximize the probability of obtaining the data that we have observed.
• For ARIMA models, MLE is similar to the least squares estimates that would be obtained by minimizing T
• MLE gives exactly the same parameter estimates as least squares estimation.)
• Note that ARIMA models are much more complicated to estimate than regression models, and different software will give slightly different
answers as they use different methods of estimation, and different optimization algorithms.
• In practice, R will report the value of the log likelihood of the data; that is, the logarithm of the probability of the observed data coming from
the estimated model. For given values of p, d and q, R will try to maximize the log likelihood when finding parameter estimates.
Information Criteria
• Akaike’s Information Criterion (AIC), which was useful in selecting predictors for regression, is also useful for determining the order of an
ARIMA model.
• It can be written as
• where L is the likelihood of the data, k=1 if c≠0 and k=0 if c=0. Note that the last term in parentheses is the number of parameters in the
model (including σ2σ2, the variance of the residuals).

Good models are obtained by minimising the AIC, AICc or BIC.


Our preference is to use the AICc.
It is important to note that these information criteria tend not to be good guides to selecting the
appropriate order of differencing (dd) of a model, but only for selecting the values of p and q.
This is because the differencing changes the data on which the likelihood is computed, making
the AIC values between models with different orders of differencing not comparable.
So we need to use some other approach to choose d, and then we can use the AICc to
select p and q.
Maximum Likelihood Estimation
Principles of Forecasting

• If we are interested in forecasting a random variable yt+h based on the


observations of x up to time t (denoted by X) we can have different
candidates, denoted by g(X).
• If our criterion in picking the best forecast is to minimize the mean
squared error (MSE), then the best forecast is the conditional
expectation, g(X) = EX(yt+h).
• we assume that the data generating process is known (so parameters
are known), so we can compute the conditional moments.
Prescriptive Analytics
• Prescriptive Analytics is one of the steps of business analytics, including descriptive and predictive analysis.
• It suggests decision options to take advantage of the results of descriptive and predictive analytics.
• It can be utilized to find a solution among various variants, using different simulation and optimization techniques to indicate the path
that should be taken.
• With Prescriptive Analytics, companies can get smart recommendations to optimize the next steps in their strategy.
• Along with predictive analytics, prescriptive analytics help to create a more effective data-based strategy.
• Both predictive and prescriptive analytics is critical to making business decisions based on data.
• A prescriptive analysis is based on:
• Operations investigation
• Predictive Analysis
• Mathematical techniques and statistics
• Its application seeks to determine each assumption’s limitations based on the study of data and applying mathematical algorithms and
probabilistic techniques.
• It can be said that it is a learning process that adapts to obtain the best possible result in all real situations that must be faced.

prescriptive analysis, it is possible for companies to make future decisions, such as:
• Calculate past sales of a product to determine the number of replacements.
• Know the tendency of customers in certain products to launch marketing campaigns, according to users’ needs.
• Predict equipment failures, which provides for maintenance at the right time.
• Know customers’ purchasing habits and punctuality of payment to determine whether it is appropriate to grant credit.
Benefits of Prescriptive analysis
:
• Optimization of processes, campaigns, and strategies.
• Minimizes maintenance needs and interconnects them for better conditions.
• Reduce costs without affecting performance.
• It increases the likelihood that companies will approach and plan for internal growth properly.
• Qualitative research method — know the characteristics that distinguish it.
• Production optimization.
• Efficient supply chain management.
• Improved customer service and experience.
Forecasting using ARIMA
• 1. Download the rainfall/ any timeseries CSV dataset
2. Install dependencies:
- pip install statsmodels OR conda install statsmodels
- pip install patsy OR conda install patsy
• plot how the rainfall data varies with time
• also check for any autocorrelation that may occur in the dataset.
• There seems to be slight correlation when the lag time is short (0–5 days) and when it is sufficiently long (20–
25 days), but not in between the intermediate values. mplementing ARIMA model in Python
• First, we would need to import the statsmodels library.
• from statsmodels.tsa.arima_model import ARIMA
• Then, we define the model with these initial hyperparameters for p, d, q (as defined earlier in the What is
ARIMA? section).
• # fit model
• p,d,q = 5,1,0
• model = ARIMA(first_30['rainfall'], order=(p,d,q))
• model_fit = model.fit()
• Let’s get the result summaries.
• print(model_fit.summary())
Forecasting using ARIMA
• Forecasting Using ARIMA
• Let’s first expand our dataset to include 365 days instead of 30 .
• data = df[:365]['rainfall'].values
• We then split the data into train (66%) and test set (34%).
• train_size = int(len(data) * 0.66)
• train, test = data[0:train_size], data[train_size:len(data)]
• 2. And initialize the historical and prediction values for comparison purposes
• history = [x for x in train]
• predictions = list()
• 3. Now we train the model and make future forecast as stored in the test data
• for t in range(len(test)):
• model = ARIMA(history, order=(5,1,0))
AIC, BIC, and HQIC metrics -The lower these values • model_fit = model.fit()
• pred = model_fit.forecast()
are, the better the fit of the model is. • yhat = pred[0]
can perform further hyperparameter tuning or data • predictions.append(yhat)
preprocessing to achieve better results! • # Append test observation into overall record
• obs = test[t]
• history.append(obs)
• 4. Lets evaluate our performance
• from sklearn.metrics import mean_squared_error
• from math import sqrt
• rmse = sqrt(mean_squared_error(test, predictions))
• print('Test RMSE: %.3f' % rmse)
• >>> Test RMSE: 20.664V
Analytics Techniques
Prescriptive Analytics
1. Descriptive analysis
This type focuses on summarizing data from the past. It is widely used to track KPIs. Data is visualized in dashboards or reports
and updated continuously, daily, weekly or monthly. This is the easiest type of analysis. You extract data from a database and you
can start visualizing.
2. Diagnostic analysis
To dig deeper and to find out why things happened, diagnostic analysis comes in. This type of analysis takes the insights found
from descriptive analytics and drills down to find the causes of those outcomes. An example is a root cause analysis.
3. Predictive analysis
Predictive analysis wants to tell you something about the future and predicts what will likely happen. This is done using forecasting
or machine learning techniques.
4. Prescriptive analysis
Recommendations about ‘the best next thing to do’ falls under prescriptive analysis. Determining the course of action to take in the
current situation can be hard, but this is why prescriptive analysis has the potential to add most value to a business. It’s possible to
use AI or mathematical optimization here (besides other techniques).
• Google Self-driving cars, Waymo is a preferred example showing prescriptive analytics. It showcases millions of calculations on
every trip. The car makes its own decision to turn in whichever direction, to slow/speed up and even when and where to change
lanes- these acts are normal like any human being’s decision-making process while driving a car.
Mathematical Optimization
• Mathematical optimization falls in the prescriptive analysis section and this makes it a really valuable
technique
• It is widely used in areas like energy, healthcare, logistics, sports scheduling, finance, manufacturing
and retail. You can optimize the routing of packages, choose the most cost-effective way to deliver
electricity, create a working schedule or divide tasks in an honest .
• But what exactly is mathematical optimization? And how does it work? It all starts with a business
problem. Imagine you are part of a delivery company and you discover packages arrive too late at
customers.
• You receive complaints and you start analyzing. Something must be wrong with your delivery process.
You find out that every deliverer just grabs a random amount of packages and delivers those. After the
delivery of one package, the deliverer uses Google Maps to find out how to get to the next address.
Wow, so many optimization possibilities! You start thinking: what if the delivery vans are filled
completely, with packages that are near each other, and the deliverers follow the shortest route
possible? This would have a huge effect on the delivery process! The delivery time would improve, this
will result in less complaints and more happy customers! The deliverers can deliver more packages in a
shorter amount of time and the vans are using less fuel. Only wins here! 🎉
• Finding the optimal routes for the deliverers, choosing packages near each other and filling the vans are
all examples that can be solved using optimization. To solve these kind of problems, you should take
the following steps:
Mathematical Optimization
• Understanding the Problem
• This involves defining the problem, setting boundaries, talking to stakeholders and finding out what value you want to
minimize or maximize.
• The next (and often hardest) step is modeling the problem.
• You should translate everything you discovered during the first step to math. This means defining the variables, constraints
and objective. You can think of the variables as the values you can influence. For example, if you want to select an x amount
of packages near by each other, every package will receive a group number. The group numbers related to the packages could
be the variables. The constraints are the limits you want to use in your model. Let’s say a van can hold a maximum of 600
kilos of packages, this is an example of a constraint, the total weight of the selected packages for a van should not exceed 600
kilos. Last but not least, the objective is the formula you want to maximize or minimize. If we are talking about a routing
problem, you can imagine you want to minimize the total number of miles traveled. After you modeled your problem using
math, you can continue to the next step.
• Solving the problem
• This isn’t hard if you did the previous step correctly and know how to code. For the solving step you need a framework and a
solver. Some example frameworks are pyomo, ortools, pulp or scipy. Examples of free solvers are cbc, glpk and ipopt. There
are also commercial solvers available, they are a lot faster and if you want to solve problems with many variables and
constraints you should use a commercial solver. You code your problem using Python for example, you call the solver and
wait for the results. Now you can continue to the last step.
• Analyze the results
• the solver came up with to discover the improvement in performance. You can compare these results to the current processes
and see if it’s worth to put the model in production to optimize your process every once in a while.
Mathematical Optimization
• Mathematical optimization is also one of many techniques that fall under the discipline of artificial
intelligence.
• Gartner defines AI as “a computer engineering discipline that uses a series of mathematically or logic-based
techniques that uncover and capture coding knowledge and leverage sophisticated and clever mechanisms to
solve problems to interpret events, support and automate decisions, and take actions.”
Network Modelling
• Artificial neural networks are forecasting methods that are
based on simple mathematical models of the brain. They
allow complex nonlinear relationships between the response
variable and its predictors.
• A neural network can be thought of as a network of
“neurons” which are organised in layers. The predictors (or
inputs) form the bottom layer, and the forecasts (or outputs)
form the top layer. There may also be intermediate layers
containing “hidden neurons”
Neural Network Auto Regression
• Neural network autoregression or NNAR model.
• With time series data, lagged values of the time series can be used as inputs to a neural The surface of the sun contains magnetic
network regions that appear as dark spots. These
• Consider feed-forward networks with one hidden layer, affect the propagation of radio waves, and
• NNAR(p,k) to indicate there are p lagged inputs and k nodes in the hidden layer. For example, so telecommunication companies like to
a NNAR(9,5) model is a neural network with the last nine observations (yt−1,yt−2…,yt−9) used predict sunspot activity in order to plan for
as inputs for forecasting the output yt, and with five neurons in the hidden layer. any future difficulties. Sunspots follow a
• A NNAR(p,0) model is equivalent to an ARIMA(p,0,0) model, but without the restrictions on the cycle of length between 9 and 14 years. In
parameters to ensure stationarity. Figure , forecasts from an NNAR(10,6) are
• With seasonal data, it is useful to also add the last observed values from the same season as inputs. shown for the next 30 years. Box-Cox
• For example, an NNAR(3,1,2) model has inputs yt−1,yt−2…and yt−12, and two neurons in the hidden transformation is set with lambda=0 to
layer. ensure the forecasts stay positive.
• In general NNAR(p,P, k)m model has inputs (yt−1,yt−2…,yt−p yt−m,yt−2m…,yt−mp and k neurons in the
hidden layer.
• A NNAR(p,P,0)m model is equivalent to an ARIMA(p,0,0)(P,0,0)m model but without the
restrictions on the parameters that ensure stationarity.
NNAR

PI=FALSE is the default, so prediction intervals are not computed unless requested. The npaths argument in forecast() controls
how many simulations are done (default 1000). By default, the errors are drawn from a normal distribution. The bootstrap
argument allows the errors to be “bootstrapped” (i.e., randomly drawn from the historical errors).
Multi Objective Optimization

• Most real-world problems involve simultaneous optimization of


several objective functions.
• • Generally, these objective functions are measured in different units
and are often competing and conflicting.
• Multi-objective optimization having such conflicting objective
functions gives rise to a set of optimal solutions, instead of one optimal
solution because no solution can be considered to be better than any
other with respect to all objectives.
• These optimal solutions are known as Pareto-optimal solution
• Optimization problems with a number of objective functions to be
satisfied.
• The objective functions may be conflicting with one another.
• In order to simplify the solution process, additional objective
functions are usually handled as constraints.
• • Multiple objective functions are handled at the same time.
Pareto Optimality -Pareto Improvement

• With a set of feasible solutions and different objective functions to


pursue, Pareto Improvement is a movement from one feasible solution
to another that can make
• At least one objective function to return a better value
• With no other objective function becoming worse off Pareto efficient
or Pareto optimal
• A set of feasible solutions are Pareto efficient or Pareto optimal when
no further Pareto improvements can be made.
Evolutionary Algorithms
• Evolutionary algorithms are particularly suitable to solve multi-objective optimization problems as they deal
with a set of possible solutions simultaneously.
• Evolutionary algorithms are capable of finding several members of the Pareto-Optimal set in a single run of
the algorithm.
• Evolutionary algorithms are less susceptible to the shape or continuity of the Pareto Front.
• Evolutionary algorithm can deal with discontinuous or concave Pareto Fronts.
• There are both Non-Pareto and Pareto techniques available for MultiObjective Optimization using
Evolutionary Optimization techniques
Non-Pareto Techniques
• Approaches that do not incorporate directly the concept of Pareto optimum.
• Incapable of producing certain portions of the Pareto Front.
• Efficient and easy to implement, but appropriate to handle only a few objectives.
Pareto Techniques
• Suggested originally by Goldberg (1989) to solve multi-objective problems.
• Use of non-dominated ranking and selection to move the population towards the Pareto Front.
• Require a ranking procedure and the technique to maintain diversity in the population.
• Non-Pareto Techniques
• Aggregating approaches
• Vector evaluated genetic algorithm (VEGA)
• Lexicographic ordering
• The Epsilon(Ɛ –Constraint) Method
Pareto Techniques
• Multi-objective genetic algorithm (MOGA)
• Non-dominated sorting genetic algorithm-II (NSGA-II)
• Multi-objective particle swarm optimization (MOPSO)
• Pareto evolution archive strategy (PAES)
• Strength Pareto evolutionary algorithm (SPEA-II)
• Non-dominated Sorting GeneticAlgorithm-II
• Non-dominated Sorting Genetic Algorithm-II is known as NSGA-II •
Proposed by Deb et al. in 2000 Key features:
• • Emphasizes non-dominated sorting
• • Use diversity preserving mechanism
• • Does crowding comparison
• • Uses elitist principle: Some of the parents go directly to the next
generation based on the above-mentioned conditions.
Stochastic Modelling

• https://towardsdatascience.com/stochastic-processes-analysis-f0a116999e4

1.Poisson processes: for dealing with waiting times and queues.


2.Random Walk and Brownian motion processes: used in algorithmic
trading.
3.Markov decision processes: commonly used in Computational Biology and
Reinforcement Learning.
4.Gaussian Processes: used in regression and optimisation problems (eg.
Hyper-Parameters tuning and Automated Machine Learning).
5.Auto-Regressive and Moving average processes: employed in time-series
analysis (eg. ARIMA models).
Stochastic Process

• Deterministic and Stochastic processes


• In a deterministic process, if we know the initial condition (starting point) of a series
of events we can then predict the next step in the series.
• Instead, in stochastic processes, if we know the initial condition, we can’t determine
with full confidence what are going to be the next steps.
• That’s because there are many (or infinite!) different ways the process might evolve.
• In deterministic processes,
• all the subsequent steps are known with a probability of 1. On the other
hand, this is not the case with stochastic processes.
• In stochastic processes, each individual event is random, although hidden
patterns which connect each of these events can be identified
Stochastic Modelling

• Definition of stochastic processes in statistical terms


• Observation: the result of one trial.
• Population: all the possible observation that can be registered from a
trial.
• Sample: a set of results collected from separated independent trials.
• For example, the toss of a fair coin is a random process, but thanks to
The Law of the Large Numbers we know that given a large number of
trials we will get approximately the same number of heads and tails.
Stochastic Process
• Poisson processes
• Poisson Processes are used to model a series of discrete events in which we know the average
time between the occurrence of different events but we don’t know exactly when each of these
events might take place.
• A process can be considered to belong to the class of Poisson Processes if it can meet the
following criteria’s:
1. The events are independent of each other (if an event happens, this does not alter the probability
that another event can take place).
2. Two events can’t take place simultaneously.
3. The average rate between events occurrence is constant.
• Example:
• power-cuts. The electricity provider might advertise power cuts are likely to happen every 10 months on
average, but we can’t precisely tell when the next power cut is going to happen. For example, if a major
problem happens, the electricity might go off repeatedly for 2–3 days (eg. in case the company needs to make
some changes to the power source) and then after that stay on for the next 2 years.
Stochastic Process

• For this type of processes, we can be quite sure of the average time
between the events, but their occurrence is randomly spaced in time.
• From a Poisson Process, we can then derive a Poisson Distribution
which can be used to find the probability of the waiting time between
the occurrence of different events or the number of possible events
in a time period.
• A Poisson Distribution can be modelled using the following formula
(Figure 2), where k represents the expected number of events which
can take place in a period.

Stochastic Process
• Random Walk and Brownian motion processes
• A Random Walk can be any sequence of discrete steps (of always the same
length) moving in random directions (Figure 3). Random Walks can take
place in any type of dimensional space (eg. 1D, 2D, nD).
• at we are in a park and we can see a dog looking for food. He is currently in
position zero on the number line and he has an equal probability to move
left or right to find any food
• andom Walk is used to describe a discrete-time process. Instead, Brownian
Motion can be used to describe a continuous-time random walk.
• Some examples of random walks applications are: tracing the path taken by
molecules when moving through a gas during the diffusion process, sports
events predictions etc…
HMM
• HMMs are probabilistic graphical models used to predict a sequence of hidden (unknown) states
from a set of observable states.
• This class of models follows the Markov processes assumption:
• “The future is independent of the past, given that we know the present”
• Therefore, when working with Hidden Markov Models, we just need to know our present state in
order to make a prediction about the next one (we don’t need any information about the previous
states).
• To make our predictions using HMMs we just need to calculate the joint probability of our hidden
states and then select the sequence which yields the highest probability (the most likely to happen).
In order to calculate the joint probability we need three main types of information:
• Initial condition: the initial probability we have to start our sequence in any of the hidden states.
• Transition probabilities: the probabilities of moving from one hidden state to another.
• Emission probabilities: the probabilities of moving from a hidden state to an observable state.
Hidden Markov Model
• In order to calculate the joint probability we
need three main types of information:
• Initial condition: the initial probability we
have to start our sequence in any of the
hidden states.
• Transition probabilities: the probabilities of
moving from one hidden state to another.
• Emission probabilities: the probabilities of
moving from a hidden state to an observable
state.
• One main problem when using Hidden Markov
Models is that as the number of states
increases, the number of probabilities and
possible scenarios increases exponentially. In
order to solve that, is possible to use another
algorithm called the Viterbi Algorithm.
Hidden Markov Model
• Gaussian Processes
• Gaussian Processes are a class of stationary, zero-mean stochastic
processes which are completely dependent on their autocovariance
functions. This class of models can be used for both regression and
classification tasks.
• One of the greatest advantages of Gaussian Processes is that they can
provide estimates about uncertainty, for example giving us an estimate
of how sure an algorithm is that an item belongs to a class or not.
• In order to deal with situations which embed a certain degree of
uncertainty is typically made use of probability distributions.
• A simple example of a discrete probability distribution can be the roll of
a dice.
• Imagine now one of your friends challenges you to play at dice and you
make 50 trows. In the case of a fair dice, we would expect that each of
the 6 faces has the same probability to appear (1/6 each).
Hidden Markov Model

• This process is known as Bayesian Inference.


• Bayesian Inference is a process trough which we update our beliefs about
the world based on the gathering of new evidences.
• We start with a prior belief and once we update it with brand new
information we construct a posterior belief. This same reasoning is valid for
discrete distributions as well as continuous distributions.
• Gaussian processes can, therefore, allow us to describe probability
distributions of which we can later update the distribution using Bayes Rule
(Figure 9) once we gather new training data.

• Figure 9: Bayes Rule [8]


Risk Analysis
• For a project manager the Bayesion decision tree analysis[6] is used to mitigate the risk and cost of
the decision. It is useful when analyzing a process that is a combination of many decisions as the
user is able to calculate the effect of each decision. The decision tree is a model with three types of
stages:
● A decision node (square)
● An event node (circle)
● A cost/consequence node

• At the decision node the project manager has to take an active choice to move on. The event node
is the effect that can happen from a choice taken. The last type of node is the cost/consequence
which is the end result of decisions and events occurring.
Figure 3 shows the basic decision problem situation. It is a medicinal company that is investigating
if they should install an extra power generator. If the power generator breaks down they would have
to shut down production until it is fixed. A shut down could lead to massive expenses and on the
same time investing in an extra power generator is also expensive. First the decision node is
whether they should install the power generator or not. Both of the decisions leads to an event
where there is a chance of two possible outcomes.

• The theory behind the decision tree is to calculate the average outcome of each decision
taken. In short called the Expected cost and can be expressed as this simple equation:

Expected cost = Probability of risk * Impact of the risk


E=P*I
• To complete the decision tree. Cost information of each stage and the probability of each
event occurring must be estimated. This is typically done in the identification and
assessment phase. For the ease there have been made some cost and probabilities for this
case. The extra power generator cost $15000 to install and the aftereffect of the generator
breaking down will cost $50000
If no action is taken there is a 30% chance that the production is safe and 70% it is not.
By installing the generator the production has a 99% to be safe. With extreme bad luck
there is 1% chance that both the generators break down. In the end of each branch the
expenses of each decision and events are summarized. The expense will therefore add up
to $65000 for the path where both generators break down as the production will get
stalled as well. Based on the prior information the expected cost can now be calculated for
each branch as seen with E1 and E2

• The best decision would then be to take the minimum of the two expected costs. In this
situation it is the path with the extra generator. In that decision there is still the chance of
having to pay the full amount of an extra generator plus production shut down. But in the
perspective of the probability of the other choices outcome this one delivers the best
opportunity.
Decision Anlaysis _Posterior
• If additional information becomes available the decision tree model can be
updated and re-evaluated with the posterior analysis.

Given the same case the company would try to perform a repair on the
current power generator to improve the probability of success.

The project manager wants to hire an external company that could do an
investigation/repair on the generator for the price of $2500. The external
repair company promises that they can deliver a repair which have a 90%
chance of fixing the problem.

The decision table is now getting additional information and the updated
probability can be calculated with this equation (Bayers' rule
Decision Analysis
Decision analysis - Pre-posterior
• The decision maker normally have the possibility to buy extra information before making his decisions (In this case it was the repair company) The
information is worth buying if the cost is low compared to the value of the information. If different options to improve the decision are available the
project manager must chose the option which yield the overall largest expected value. With pre-posterior analysis the decision maker is able to
facilitate if the information is worth buying or not.

Limitations
• The decision tree offers many advantages when comparing to other decision making tools as it is easy to understand and simple to use. However the
decision tree also has its disadvantages and limitations.
• The information in the decision tree relies on precise input to provide the user with a reliable outcome. A little change in the data can result in a
massive change in the outcome. Getting reliable data can be hard for the project manager for example how would you set the probability of a repair
being a success or failure. The estimated cost could be way off if several events in a row has been estimated 10% wrong.
• Another fundamental flaw is that the decision tree is based on expectations that will happen for each decision taken. The project managers skills to do
predictions will however always be limited. There can always be unforeseen events happening from a decision taken which could change the outcome
of the situation.
• At the same time the decision tree is easy to use it can also be very complex and time consuming. This is seen when using it on large problems. There
will be many branches and decisions which takes long time to create. With extra information added or removed the manager would probably have to
re-draw the decision tree.
• Having large project can easily make the tool unwieldy as it can be hard to present for colleagues if they have not been on the project from the start.
• Even though the decision tree seem to be easy it requires skill and expertise to master. Without this it could easily go wrong and could be at high
expense for the company if the outcome was not as expected. To ensure the expertise the company would have to maintain their project managers
skills which could be expensive.
• Having to make a decision based on valuable information is good. However having to much information can go in both ways. The project manager
can hit the "paralysis of analysis"[7] where he got a massive challenge to process all the information which will slow down the decision making
capacity. Having to much information could therefore be a burden in both cost and time on analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy