Datamining and Analytics Unit V
Datamining and Analytics Unit V
Unit V
2
3
Time Series
Stationary Time series
https://www.youtube.com/watch?v=OUiBqhvT_r0
Exploratory Data Analysis
• Exploratory data analysis (EDA)
• Originally developed by American mathematician John Tukey in the 1970s
• It helps determine how best to manipulate data sources to get the answers you need, making it easier for
data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
• EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing
task and provides a better understanding of data set variables and the relationships between them.
• It can also help determine if the statistical techniques you are considering for data analysis are
appropriate.
• Purpose of EDA :
• To help look at data before making any assumptions.
• To help identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
• Data scientists can use exploratory analysis to analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods.
• to ensure the results they produce are valid and applicable to any desired business outcomes and goals.
• helps stakeholders by confirming they are asking the right questions. EDA can help answer questions
about standard deviations, categorical variables, and confidence intervals.
• Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data
analysis or modeling, including machine learning.
EDA tools
• Python:
• An interpreted, object-oriented programming language with dynamic semantics. Its
high-level, built-in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for rapid application development, as well as for use
as a scripting or glue language to connect existing components together.
• Python and EDA can be used together to identify missing values in a data set, which
is important so you can decide how to handle missing values for machine learning.
• R:
• An open-source programming language and free software environment for statistical
computing and graphics supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians in data science in developing
statistical observations and data analysis.
Time Series Analysis
• Where,
where
• where |φ| ≤ 1 and εi is white noise. If |φ| = 1, we have what is called a unit root. In particular, if φ = 1, we have a random
walk (without drift), which is not stationary. In fact, if |φ| = 1, the process is not stationary, while if |φ| < 1, the process is
stationary.
• consider the case where |φ| > 1 ,in this case the process is called explosive and increases over time.
• This process is a first-order autoregressive process, AR(1
• The Dickey-Fuller test is a way to determine whether the above process has a unit root. The approach used is quite
straightforward. First calculate the first difference, i.e.
• If we use the delta operator, defined by Δyi = yi – yi-1 and set β = φ – 1, then the equation becomes the linear regression
equation
• where β ≤ 0 and so the test for φ is transformed into a test that the slope parameter β = 0. Thus, we have a one-tailed test
(since β can’t be positive) where
• H0: β = 0 (equivalent to φ = 1)
• H1: β < 0 (equivalent to φ < 1)
• Under the alternative hypothesis, if b is the ordinary least squares (OLS) estimate of β, and so φ-bar = 1 + b is the OLS
estimate of φ, then for large enough n
The Dickey-Fuller test
• We can use the usual linear regression approach, except that when the null hypothesis holds the t coefficient
doesn’t follow a normal distribution and so we can’t use the usual t-test. Instead, this coefficient follows a tau
distribution, and so our test consists of determining whether the tau statistic τ (which is equivalent to the usual
t statistic) is less than τcrit based on a table of critical tau statistics values shown in Dickey-Fuller Table.
• If the calculated tau value is less than the critical value in the table of critical values, then we have a significant
result; otherwise, we accept the null hypothesis that there is a unit root and the time series is not stationary.
t = 1, 2 . . . . . . . . T
Lecture 5
Forecasting Tool for Time Series Data
Moving Average
Exponential Smoothing
Holt Exponential Smoothing:
Holt-Winters Exponential Smoothing
• Weighted Averages:
• A weighted average is simply an average of n numbers
• where each number is given a certain weight
• denominator is the sum of those n weights.
• The weights are often assigned as per some weighing function like logarithmic, linear, quadratic, cubic and exponential.
• Averaging as a time series forecasting technique has the property of smoothing out the variation in the historical values while calculating the forecast.
• By choosing a suitable weighing function, the forecaster determines which historical values should be given emphasis for calculating future values of the time series.
The impact of previous time spots is decided by the coefficient factor α at that particular period of time.
The price of a share of any particular company X may depend on some company merger that happened overnight or maybe the
company resulted in shutdown due to bankruptcy.
This kind of model calculates the residuals or errors of past time series and calculates the present or future values in the series in
know as Moving Average (MA) model.
The MA model can simply be thought of as the linear combination of q past forecast errors.
ARIMA Model
Step-by-step general approach of implementing ARIMA:
• Step 1:
• Load the dataset and plot the source data. (Check if the data has any seasonal patterns, cyclic
patterns, general trends)
• Dealing with missing values: ARIMA models don’t work on data that have NAs.
• Plotting the data*
• Step 2:
• Apply the Augmented Dickey Fuller Test (to confirm the stationarity of data)
• Implementation: adfuller()
• If the data is stationary, proceed with ARMA or ARIMA. (It’s your choice!)
• If the data is not stationary, proceed with ARIMA.(Because, data needs to be differenced to make it
stationary: ‘I’ component of ARIMA does this)
• Step 3:
• Run ETS Decomp osition on data (To check the seasonality in data)
• Implementation: seasonal_decompose()
Seasonal ARIMA
• https://blog.paperspace.com/time-series-forecasting-autoregressive-
models-smoothing-methods/
• https://towardsdatascience.com/time-series-models-d9266f8ac7b0
• https://towardsdatascience.com/time-series-forecasting-with-
autoregressive-processes-ba629717401
• https://online.stat.psu.edu/stat501/lesson/14/14.1
References
• https://www.machinelearningplus.com/time-series/time-series-
analysis-python/
• https://www.kaggle.com/code/kashnitsky/topic-9-part-1-time-series-
analysis-in-python/notebook
Parameter Estimation
1.Method of Moments
• One of the easiest methods of parameter estimation is the method of moments (MOM).
• The basic idea is to find expressions for the sample moments and for the population moments and equate them:
• The E(X r ) expression will be a function of one or more unknown parameters.
• If there are, say, 2 unknown parameters, we would set up MOM equations for r = 1, 2, and solve these 2 equations
simultaneously for the two unknown parameters.
• In the simplest case, if there is only 1 unknown parameter to estimate then we equate the sample mean to the true mean of the
process and solve for the unknown parameter.
PI=FALSE is the default, so prediction intervals are not computed unless requested. The npaths argument in forecast() controls
how many simulations are done (default 1000). By default, the errors are drawn from a normal distribution. The bootstrap
argument allows the errors to be “bootstrapped” (i.e., randomly drawn from the historical errors).
Multi Objective Optimization
• https://towardsdatascience.com/stochastic-processes-analysis-f0a116999e4
• For this type of processes, we can be quite sure of the average time
between the events, but their occurrence is randomly spaced in time.
• From a Poisson Process, we can then derive a Poisson Distribution
which can be used to find the probability of the waiting time between
the occurrence of different events or the number of possible events
in a time period.
• A Poisson Distribution can be modelled using the following formula
(Figure 2), where k represents the expected number of events which
can take place in a period.
•
Stochastic Process
• Random Walk and Brownian motion processes
• A Random Walk can be any sequence of discrete steps (of always the same
length) moving in random directions (Figure 3). Random Walks can take
place in any type of dimensional space (eg. 1D, 2D, nD).
• at we are in a park and we can see a dog looking for food. He is currently in
position zero on the number line and he has an equal probability to move
left or right to find any food
• andom Walk is used to describe a discrete-time process. Instead, Brownian
Motion can be used to describe a continuous-time random walk.
• Some examples of random walks applications are: tracing the path taken by
molecules when moving through a gas during the diffusion process, sports
events predictions etc…
HMM
• HMMs are probabilistic graphical models used to predict a sequence of hidden (unknown) states
from a set of observable states.
• This class of models follows the Markov processes assumption:
• “The future is independent of the past, given that we know the present”
• Therefore, when working with Hidden Markov Models, we just need to know our present state in
order to make a prediction about the next one (we don’t need any information about the previous
states).
• To make our predictions using HMMs we just need to calculate the joint probability of our hidden
states and then select the sequence which yields the highest probability (the most likely to happen).
In order to calculate the joint probability we need three main types of information:
• Initial condition: the initial probability we have to start our sequence in any of the hidden states.
• Transition probabilities: the probabilities of moving from one hidden state to another.
• Emission probabilities: the probabilities of moving from a hidden state to an observable state.
Hidden Markov Model
• In order to calculate the joint probability we
need three main types of information:
• Initial condition: the initial probability we
have to start our sequence in any of the
hidden states.
• Transition probabilities: the probabilities of
moving from one hidden state to another.
• Emission probabilities: the probabilities of
moving from a hidden state to an observable
state.
• One main problem when using Hidden Markov
Models is that as the number of states
increases, the number of probabilities and
possible scenarios increases exponentially. In
order to solve that, is possible to use another
algorithm called the Viterbi Algorithm.
Hidden Markov Model
• Gaussian Processes
• Gaussian Processes are a class of stationary, zero-mean stochastic
processes which are completely dependent on their autocovariance
functions. This class of models can be used for both regression and
classification tasks.
• One of the greatest advantages of Gaussian Processes is that they can
provide estimates about uncertainty, for example giving us an estimate
of how sure an algorithm is that an item belongs to a class or not.
• In order to deal with situations which embed a certain degree of
uncertainty is typically made use of probability distributions.
• A simple example of a discrete probability distribution can be the roll of
a dice.
• Imagine now one of your friends challenges you to play at dice and you
make 50 trows. In the case of a fair dice, we would expect that each of
the 6 faces has the same probability to appear (1/6 each).
Hidden Markov Model
• At the decision node the project manager has to take an active choice to move on. The event node
is the effect that can happen from a choice taken. The last type of node is the cost/consequence
which is the end result of decisions and events occurring.
Figure 3 shows the basic decision problem situation. It is a medicinal company that is investigating
if they should install an extra power generator. If the power generator breaks down they would have
to shut down production until it is fixed. A shut down could lead to massive expenses and on the
same time investing in an extra power generator is also expensive. First the decision node is
whether they should install the power generator or not. Both of the decisions leads to an event
where there is a chance of two possible outcomes.
•
• The theory behind the decision tree is to calculate the average outcome of each decision
taken. In short called the Expected cost and can be expressed as this simple equation:
• The best decision would then be to take the minimum of the two expected costs. In this
situation it is the path with the extra generator. In that decision there is still the chance of
having to pay the full amount of an extra generator plus production shut down. But in the
perspective of the probability of the other choices outcome this one delivers the best
opportunity.
Decision Anlaysis _Posterior
• If additional information becomes available the decision tree model can be
updated and re-evaluated with the posterior analysis.
•
Given the same case the company would try to perform a repair on the
current power generator to improve the probability of success.
•
The project manager wants to hire an external company that could do an
investigation/repair on the generator for the price of $2500. The external
repair company promises that they can deliver a repair which have a 90%
chance of fixing the problem.
•
The decision table is now getting additional information and the updated
probability can be calculated with this equation (Bayers' rule
Decision Analysis
Decision analysis - Pre-posterior
• The decision maker normally have the possibility to buy extra information before making his decisions (In this case it was the repair company) The
information is worth buying if the cost is low compared to the value of the information. If different options to improve the decision are available the
project manager must chose the option which yield the overall largest expected value. With pre-posterior analysis the decision maker is able to
facilitate if the information is worth buying or not.
Limitations
• The decision tree offers many advantages when comparing to other decision making tools as it is easy to understand and simple to use. However the
decision tree also has its disadvantages and limitations.
• The information in the decision tree relies on precise input to provide the user with a reliable outcome. A little change in the data can result in a
massive change in the outcome. Getting reliable data can be hard for the project manager for example how would you set the probability of a repair
being a success or failure. The estimated cost could be way off if several events in a row has been estimated 10% wrong.
• Another fundamental flaw is that the decision tree is based on expectations that will happen for each decision taken. The project managers skills to do
predictions will however always be limited. There can always be unforeseen events happening from a decision taken which could change the outcome
of the situation.
• At the same time the decision tree is easy to use it can also be very complex and time consuming. This is seen when using it on large problems. There
will be many branches and decisions which takes long time to create. With extra information added or removed the manager would probably have to
re-draw the decision tree.
• Having large project can easily make the tool unwieldy as it can be hard to present for colleagues if they have not been on the project from the start.
• Even though the decision tree seem to be easy it requires skill and expertise to master. Without this it could easily go wrong and could be at high
expense for the company if the outcome was not as expected. To ensure the expertise the company would have to maintain their project managers
skills which could be expensive.
• Having to make a decision based on valuable information is good. However having to much information can go in both ways. The project manager
can hit the "paralysis of analysis"[7] where he got a massive challenge to process all the information which will slow down the decision making
capacity. Having to much information could therefore be a burden in both cost and time on analysis.