ISYE 6402 Lecture Transcripts
ISYE 6402 Lecture Transcripts
This document contains the text and slides of Dr. Nicoleta Serban’s video lectures for
the ISyE 6402 Time Series Analysis course. The content was obtained by downloading
and combining text from the .txt text files, and images from the .ppt Powerpoint slides
located on the edX platform (SP19: Time Series Analysis).
The objective of this effort is to gather all of the course’s video lectures into a single
searchable document.
Unfortunately, the written transcription of the actual lecture audio contains numerous
errors. Our intention during the Spring 2019 semester is to correct the transcripts in red
so the revisions can be approved by Dr. Serban at the end of the course, and then
spliced back into the .txt and .srt files on edX for the benefit of future classes. The
concepts and material for this class are already hard enough without the
misunderstandings created by transcription errors!
If you would like to help correct the transcripts, please keep the following guidelines in
mind:
1. Since we need to know where the changes are to update the source files, please
highlight all changes in red using the Text Color cell in the top-of-page ribbon.
2. If words are being removed from the transcript to match the audio, please use Format
> Text > Strikethrough (Alt+Shift+5).
3. If the audio itself appears to be incorrect, or is missing a word or two, please highlight
the correction in [bold red] surrounded by square brackets.
4. If you would like to amplify a point, or provide a reference link, or start a discussion,
please use Insert > Comment to place your thoughts in the margin. Otherwise, please
don’t add content to the text body that isn’t in the video lecture.
5. Slides were originally inserted after the accompanying text. It makes better reading if
the slides precede the text. Some slides have been rearranged accordingly, but it is a
work in progress. Please feel free to assist in this effort; completing a lesson at a time
would be good so that at least there is that level of internal consistency.
Thanks again for all of your help!
1
ISyE 6402 Lecture Transcript
3
3.5.3 VAR(X) Model Fitting 286
4
Unit 1: Basic Time Series Analysis
Moments of Distribution
What are the moments of a distribution for a random variable x with density function f(x)?
The l-th moment of the distribution of x is the expectation of x to the l-th power
or the integral of xI times the density function. Similarly, we can define the l central moment
which is the expectation of x minus the mean of x to the power l. Thus for this moment we
center the round of random variable to this mean.
Two classic examples of moments are the expectation and the variance. The expectation is
the first moment and the variance is the second central moment. Other examples include, the
skewness which is the third central moment divided by the center of standard deviation of at the
power of 3, measuring how symmetric the distribution of x is. (See image below.)
Another example is the Kurtosis which is the fourth central moment divided by the standard
deviation at the power of 4, measuring how fat the tails of the distribution are.
5
Note that in this light, I described the moments assuming x has a continuous distribution. We
can also define moments for discrete distributions. In this class, we primarily focus on
continuous distributions.
Statistical estimation refers to identifying a function of the random data, also called a statistic,
to be used to in obtaining approximate or estimates for one or more parameters or statistical
summaries of a distribution. Thus, first we start with realizations or observation denoted with
small letters call them x1, x2, xn. This These are realizations of random data denoted in
statistics with capital letters X1, X2 up to big Xn.
The random data or random variables access X’s, are assumed here to have the same
distribution, which is f(x) with the parameter theta. Which is now the distribution is known up to
the parameter or vector's parameter is theta. Because theta is unknown we use statistical
estimation to estimate it or to obtain approximates or estimates for theta that are accurate and
or and/or precise with respect to the true parameter.
Specifically, we develop a function of the data to estimate theta that has good statistical
properties such as small bias and/or small variability. Two common approaches in statistics to
obtain estimates for statistical estimation are the method of moments and the maximum
likelihood approach.
6
Some classic estimators commonly studied in the basic statistical courses are the sample mean
and the sample variance. Specifically we begin with data x1 to xn and estimate a sample mean
by the average of the x's or the noted denoted x bar. This is called sample mean.
For the sample variance we first center the axis x’s by the estimated mean, or sample mean.
Then we take the sum of the squared centered data and all divided by n minus 1.
We can also construct similar estimator for the skewness and kurtosis using the sample
skewness and sample kurtosis as proved provided on the slide.
For both we need to first obtain a estimator for the mean and the variance parameters given the
sample mean and the sample variance. Then we center the x’s by the sample mean, take the
power 3 for Sample skewness and power 4 for the Sample kurtosis and sum them up.
We divide those sums by n-1 and the sample standard deviation to the power of 3 for the
skewness and power of 4 for the kurtosis.
The mean and the variance are parameters usually of a distribution where the skewness and
kurtosis are statistical summaries of a distribution.
7
Similar estimators can be constructed for other central moments. I will note here that since we
are using empirical data to obtain estimates of such parameters, or statistical summaries. We,
we obtain realizations from the data to plug in this formula to get the corresponding estimated
values for one set of realizations. Because every time we observe from the data, we have
different realizations, different observations. We, we do not get the true values for the statistical
summaries, mean, variance, skewness, or kurtosis. But approximate values, estimates, which
will change with each different set of observed data from the random data.
Because the estimators of a parameter or statistical summaries are functions of the random
data they are also random variables. For example, the estimator of the mean is big X bar which
reflects all possible realizations of the sample mean. If the data are normally distributed then the
distribution of x bar is also normal with mean mu, the true parameter, mu. And the variance,
sigma square divided by n, where sigma square is a true variance of the data. Moreover, even if
the data are not normal, for large sample size the distribution of x is approximately normal
according to the central limit theorem.
The distribution of the estimator for the variance sigma squared is provided as on this slide.
S-squared is the sample estimator for the variance. If multiplied by n-1 and divided by the true
variance, the sample distribution is at a chi-square distribution with n-1 degrees of freedom.
8
Based on the sampling distribution of an estimator, we can derive important statistical properties
for the estimator. For example, two important statistical properties are Unbiasedness and
Consistency. Unbiasedness refer to the property of an estimator that in an expectation is exactly
equal to the true parameter. Where consistency means that for large sample data,
the estimator is similar to the true parameter, where similarity is in a probabilistic sense.
9
Next, I'll introduce the two methods used to estimate parameters in distributions. The first one is
method of moments estimation. In this method we equate distribution or population moments
to the sample moments to obtain equations in the parameters to be estimated, which will be
solved for the parameters of interest.
To be more specific, we equate the population pth moment to the pth sample moment to obtain
this equation. This equation will be a function of the parameters for specific power values of p.
We usually consider equations for small power of p. For example, p equal to 1, p equal to 2, and
rarely higher, here. Here are some examples.
If we have data from a normal distribution we want to estimate the mean mu. We equate the first
moment, the expectation, to the sample mean. Note that the first moment is actually equal to the
parameter mu. Solving the equation will give us an estimator mu hat which is the average of xi.
If we also want to get an estimator of sigma squared, we can use a the second of moment,
which is equated to the second sample moment. Using the first equation where we derive the
sample mean mu hat we can now plug it into this second equation to get the estimator for sigma
squared which is provided on this slide. Note that, the estimator for sigma square obtained
using method of moments is slightly different as a sample variance because here we have n
instead of n minus 1.
This is the second approach used for estimating parameters. Maximum likelihood estimation,
or abbreviated MLE, is the most popular method for estimating parameters for a distribution or a
model. The idea is to maximize the likelihood of observing the realizations given the planners
parameters.
10
Thus if we want to estimate the planner parameter theta on of the distribution of the data, we
need to first establish the likelihood function which in the case of independence would be the
product of the density functions for each individual xi's in the data.
More generally the likelihood function is a joint density function of the random data x1 to xn. The
likelihood function is a function of theta, given the data. And thus, when we maximize MLE, we
maximize with respect to theta.
Since I mentioned joint distribution, let's discuss about multivariate distribution. It's important to
understand the difference between joint conditional and marginal distributions.
For two random variables x and y we have a bivariate distribution f(x, y) called the joint
distribution. We can decompose this joint distribution into a product of conditional distribution of
x given y times the marginal distribution of y. We write the conditional density as f(x|y).
This is also equal to the product of the conditional distribution y given x times the marginal of x.
It's important to note that if x and y are independent then the conditional distribution f(x|y) is the
marginal distribution of x. And the conditional distribution of y given x is the marginal of y. Thus,
when we have independence, the joint distribution is gonna be the product of the marginal
distributions.
We can expand this to three variables in a similar way. The joint distribution is a trivial trivariate
distribution which can be decomposed in three parts. The product that of the conditional
distribution of x given y and z times the conditional of y given z, times the marginal of z.
11
We can expand this even to n random variables. Similar as with three variables, we will have n-
1 conditional and one marginal. Under independence of the n random variables, this becomes
again the product of the marginal distributions.
For example, if the conditional distribution of xi given xi-1 to x1 has a normal distribution with
mean mu i and sigma i squared. And if we ignore the marginal of x1, then the joint distribution
of the f(x1, …, xn) is an as in this slide, this. This particular derivation will come up in some of the
modules models in this course.
One of the most important aspects in statistical analysis is statistical inference, for example,
hypothesis testing. An example of a hypothesis test is when we're interested in making an
inference on a particular parameter, theta. Here for example, we may want to test the null
hypothesis that theta = theta0 versus the alternative that theta is not = theta0. Here theta0 is
unknown a known null value.
We can also have hypothesis testing for distributions. For example, we may be interested to test
whether the data come from a normal distribution versus not. In all such hypothesis testing
procedures, we always have a null and alternative hypothesis.
We often make decisions based on a p-value, rather than rejection rules. And in this class we’re
going to focus on making decisions based on a p-values. A p-value is a measure of the
possibility plausibility of the null hypothesis. P-value also can be interpreted as the smallest
significance level at which we check reject the null hypothesis, where a significance level is the
probability of type-one error.
12
Another classic statistical inference is through confidence intervals, which involves finding a
range of values for parameter theta given the confidence level one minus alpha. We get this
range by solving that probability of theta being in the range is equal to the confidence level.
While this is how we get the confidence interval, its interpretation is more subtle.
For example, if we take a 1- alpha confidence interval where 1-alpha would refer to a 95%
confidence interval. If we obtain 100 different sets of realizations from the data and we estimate
the confidence intervals for all the hundred sets of realizations will take we’ll obtain around a
hundred confidence intervals. If the confidence intervals are 95% confidence intervals, we
interpret that 95 out of a 100 confidence intervals constructed, with 100 different sets of
realizations, include a the true parameter. That is, five of the confidence intervals, will not
include a the true parameter, our list at least in on average.
13
1.1.2 Basic Definition and Examples (of Time Series)
This lecture is on the basics of time series analysis, and for this lesson, I'll focus on basic
definitions of time series, and I'll illustrate these concepts with a few examples.
What is a time series? It is a sequence or collection of random variables with some similarity
in terms of the probability distribution, called a stochastic process. For a time series, the
sequence of random variables is indexed by time.
We commonly refer to a time series as both the stochastic process from which you observe, and
the realizations or observations from the stochastic process. It's important to highlight that the
time series data are observations in practice, and thus come with uncertainty.
14
There are many examples in which data are correlated in time. These examples on the slide are
just a very small subset of examples. Time series data arise in any field, from economic data to
finance to healthcare to climate, among many others. The level of time granularity from yearly to
minutes depends on the objective of the problem at hand. The time granularity also depends on
resources needed to observe the time process of interest, and on the smoothness of the time
process as it changes over time.
Our first example on this slide is the US yearly GDP. The second example is the monthly sales
of wine. A third example is the monthly accidental death rate in the US. A fourth example is the
monthly interest rate. Often, yearly and monthly data are observed over several years, tens of
years. A fifth example is the daily average temperature. A sixth example is a daily stock price of
a company such as IBM.
Ideally, we'll observe daily data, as it allows us analysis of multiple other resolutions.
Since daily data can be aggregated into weekly data, monthly, or yearly data, depending on the
objective of analysis. But sometimes, we have data at even more granular time, for example,
the minute or even a second. For example, 1-minute intraday S&P 500 return.
More generally, the level of time granularity in the time series model to be considered in drawing
inferences on a time series process, depends on the characteristics of the time series.
● One very important characteristic is the trend. The trend can be a long-term increase or
decrease in the data over time, or it can fluctuate smoothly or not smoothly.
● Seasonality is influenced by seasonal factors, quarter of the year, month, or day of the
week. When seasonality repeats exactly at the same time intervals and with exactly the
same regular pattern, we have periodicity. But there could be other cyclical trends that
do not repeat over similar or the same periods of time.
● Cyclical trends are when data exhibits rises and falls that are not of a fixed period.
15
● For some time series, we can observe heteroskedasticity, which means the variability
in the data changes with time. For example, financial indicators of companies that have
been around for decades will be less variable in early years but more variable in the
recent years.
● Last (dependence), the correlation with time can be positive, in other words,
successive observations are similar. Or negative, in other words, successive
observations are dissimilar.
Let's look at the few examples of time series. Here is the plot of the gross domestic product for
the US. The trend is clearly increasing monotonically. There is no seasonality, periodicity, or
cyclical trends. The variability of the observations does not change over time.
Here is a plot of the close price of the stock for IBM, where IBM is a company standing for
International Business Machines. The trend is overall increasing monotonically, with fluctuations
in the later decades. There's no clear seasonality that could be evaluated visually. This is not
uncommon when looking at daily data. However, what is clear is the heteroskedasticity of the
time series. The variability in the stock price clearly shows a big change, from being significantly
smaller in the early years versus much larger in the later years.
16
Data on a much lower granularity than day, week, or month is common in some fields. Including
in high frequency financial data, or in patient monitoring, such as cardiac monitoring using EKG
technology. This is an example, here is the S&P 500 intraday stock return. We can see that the
intraday data is very different from one day to another, with much (more) variability some days
than others.
17
Time Series: Objectives
In time series analysis provided in this course, we'll focus on several objectives.
Description: First, we'll perform a descriptive analysis, for example, plot the data, and obtain
simple descriptive measures of the main properties of the time series.
Explanation: Second, we'll explain the time series, particularly the dependence in a time series
through finding a model that will describe appropriately the dependence in the data. This
objective relies heavily on the first step, the exploratory data analysis, since this first step can
provide insight on the type of dependence in the data.
Forecasting: Modeling the time series by capturing the properties of the times series is
important for the third objective, forecasting a prediction of future realization of the time series
data.
Control/Tuning: Last, it's often the case that the modeling and forecasting will provide
additional insights about the behavior of the time series, suggesting some tuning of the model.
I'll illustrate all these objectives through several data examples in this course.
In the time domain approach, we assume that the correlation between adjacent points in time
can be explained through dependence of the current value on past values.
In the other approach, in the frequency domain approach, characteristics of interest related to
periodic or systematic periodic variations in the data are often caused by biological, physical,
and environmental phenomena.
All the models introduced in this course are in the time domain. But it's important to remember
that the area of time series modeling is much broader than that provided in this course.
18
1.1.3 Decomposition: Trend Estimation
● Presentation of the basic decomposition of a time series.
● Two approaches for estimating one component, the trend.
For the simplest time series analysis, the response data are Yt, the time-varying observations
where t is the time index, for example, day, month, week. In order to account for trend and
seasonality or periodicity, we decompose a time series into three components, mt, st and Xt.
Where mt is the trend, st is the seasonality, and Xt is a time process after accounting for trend
and seasonality.
Xt is often assumed to be a stationary component, in other word, its probability distribution does
not change when shifted in time. Most classic time series models assume that the time process
is stationary. And thus we need to first estimate the trend and the seasonality components to
subtract them from Yt, then further model Xt. Note that Xt is not an error term.
In this lesson, I will describe common approaches for dealing with or estimating the trend. Since
most time series have a trend, in the basic time series model, we first will remove the trend, then
model the residual stationary process if there is no seasonality.
19
I generally prefer the first approach since estimating a trend allows us to gain more insights
about the behavior of the time series. However, if we do not have a good model for estimating a
the trend, then the residual process after removing the trend may not be fully trend-free and
thus not stationary.
The three approaches commonly employed to estimate the trend in a time-varying process or
time series are:
● The moving average approach
● parametric regression
● non-parametric regression.
The selection of the width in the moving average is important since it's the calibration parameter
controlling the bias-variance trade-off. If it is too large, then the estimated trend is very smooth,
that is, it has a low variability but high bias. If it's too small, the estimated trend is not smooth,
that is, it has high variability but low bias. For moving average trend estimation, this is the most
challenging part.
20
On another note, for those who have a broader understanding of statistical modeling. The
moving average is a particular case of kernel regression when we use a kernel with equal
weights.
We can also use a standard linear parametric regression to estimate the trend. We can
define the trend as a polynomial of order p where the betas will be the regression coefficients,
the. The most common polynomial trends are with p = 1 or p = 2. For higher polynomials, it is
better choose non-parametric regression particularly if p is not known in advance.
Generally, in this approach we estimate a the betas using the linear regression approach you
have learned in the regression analysis course. For this, the predicting variables are t, t
squared, t cubed and so on up to t to the power of p.
If p is large, one could also employ model selection to select which of the terms in the
polynomial should be included for a good fit. However, this needs to be considered with caution
since the terms in t or the predicting variables in the regression model are highly correlated.
Alternatively, if we do not know p or we do not which terms to keep in the polynomial we can use
non-parametric regression which does not make any parametric assumptions. The simplest
approach is kernel regression, which is a linear smoother which looks like the one on the slide.
The function li(t) depends on the kernel function in kernel regression and can be viewed as
weights in a local moving average. Recall that I mentioned earlier that moving average is a
particular case of kernel regression.
21
Another approach is local polynomial regression which involves fitting a polynomial within a
moving window at each time point. Kernel regression is a particular case of this approach.
Last, there are many other approaches, for example, non-parametric regression using splines,
wavelets or fully orthogonal functions.
But which one to choose among so many approaches? Local polynomial smoothing usually
performs better than kernel regression on the boundaries and is not biased by the selected
design of the time points. Other methods can be selected depending on the smoothness level of
the function to be estimated. For example, wavelets are adaptive to less smoothness. But
generally, for estimating trends in time series, local polynomial smoothing or splines regression
will perform well.
22
1.1.4 Trend Estimation: Data Example
● Estimation of trend using a series of data examples.
I'll illustrate trend estimation with a time series of the average monthly temperature in Atlanta,
Georgia where Georgia Tech is located. The data are available from the iWeatherNet.com. The
Weather Bureau, now the National Weather Service, began keeping weather records for Atlanta
since October 1878, for more than 138 years. Data for this example runs from January 1879
until December 2016, with138 complete years of data. The temperature values are provided in
Fahrenheit degrees.
The question we’d like to address is did the temperature increase over the past 138 years? If
yes by how much? To begin the analysis, we'll first need to read the data file in R. We'll use the
R command read.table where the input is the data file called AvTempAtlanta.txt. I also specified
that the data file has a row with the names of the columns and thus header is equal to T or True.
Now, the data object in R is a data frame. In order to learn about the columns in the data, we
can use the command names(data) where the output is here. According to this, the first column
provides the year, the next is the average temperature for January in each of the year. The
second column is for February and so on. We also have a column with the annual average
temperature at the end.
To get all temperature values in one vector., I use the command as.vector, which takes a matrix
and stacks up the columns in a matrix into one vector. However, we want to stack up the rows,
not the columns in order to keep the temporal order of the data. For that, I took the transpose of
the matrix of the data using the T command in R. In the transpose, the rows of the initial matrix
are now columns and thus, we can use the as.vector command.
Note that I discarded the first and the last column in the data matrix since those correspond to
year and annual columns. Last, I converted a vector of temperature values into a time series
using the ts R command with the specification that the first year is 1879. And that these are
monthly observations thus frequency equal to 12.
23
The plot of this time series can be displayed using TS Plot ts.plot R command. The plot is here.
as As expected, there is seasonality. We will explore this in the next two lessons. The range of
temperature values goes from as low as 30 Fahrenheit degrees to as high as 85 degrees. Note
that those are average monthly temperature and does not capture in some extreme temperature
values.
Looking at this plot alone, it is difficult to identify a trend over time. We’ll employ the approaches,
the estimation approaches, for a trend discussed in the previous lesson.
We will begin with estimation of the trend using moving average. I first define the vector of time
points, because the time points are equally spaced because this is monthly data. I can simply
consider the time domain between 0 and 1 with the time points in this time domain equally
spaced as the R code on the slide provides.
24
Next, I use the ksmooth R command which is used for kernel regression. Note that I mentioned
In in the previous lesson that moving average is a particular case of kernel regression where we
use a constant kernel within a moving window, which in this case is called box kernel. Thus,
specifying kernel equals box in this R command will give us a moving average estimator of the
trend.
With the ts() command, I transformed the fitted trend into a time series, then plot it overlaying on
top of the observed time series using the lines command. I also added the constant line to
compare the fitted trend with the constant line using the abline command. The plot is here, the
purple line is the fitted moving average trend and the blue line is the constant line.
From this plot we see a slight increase of the trend and compared to the constant line although
it is hard to notice the difference since the variability of the data is large as compared to that of
the estimated trend.
25
For fitting a parametric regression model we can use the lm command in R used for fitting a
linear regression model. The predicting variables are the linear and quadratic time points. In the
lm command, we specify the time series on the left, which is the response data in fitting the
linear regression model. And on the right in this command we have the two predicting variables.
With this we fit a quadratic trend.
The summary R command provides information about the model fit. Here is only a portion of the
output of the model fit. This shows the estimated coefficients along with inferences on the
statistical significance of the coefficients. From this output each of the coefficients corresponding
to the two predicting variables are not statistically significant because the p values are large.
Similar to the moving average trend, I use a TS ts command to transform the fitted values based
on this model into a time series. I used a similar approach to plot the trend against the observed
data. The plot is here. The green line is the fitted trend and the blue line is the constant line.
From this plot, we also see a slight increase of the trend as compared to the constant line.
26
On this the slide, I'm providing the arguments R commands that can be used to fit a trend using
local polynomial regression and the splines regression. The common function used to fit a local
polynomial regression is loess. The input consists of the time series on the left, and the time
points on the right. Once that's fitted, we can transform the figured fitted values using the ts
command into a time series.
For splines regression, there are many r libraries and commands that can be used. I generally
prefer using the gam command from the mgcv library. Because because this rR command is
flexible in terms of the type of the model to be fitted. Using this command we'll need to transform
the time point using s of the time point, to specify that I want a splines fit. If we did not specify
this, the gam will fit a linear model in the time points. Similarly, I transformed the fitted values of
this model using the ts command. Next, I overlaid the fitted trends from the two models to the
observed time series. The plot is here. In this blot plot it is at last clear whether there is an
increasing trend over time.
27
Last, I'm comparing the estimated trends based on the four approaches,: moving average,
parametric quadratic regression, local polynomial regression and spliced splines regression.
This time, I only plotted the trends to see clearly the difference in the fitted trends. The first two
command lines are to provide the scale of the plots such that all the trends will be plotted on the
same scale. Then I plotted the trends along with the legend command, which allows
authentication identification of the lines with the corresponding trend estimation method. The
plot is here.
From this plot we see that the moving average trend is quite weakly. weak, slightly Slightly
capturing some of the trend seasonality thus not a good estimate. The estimated trend using the
parametric model is comparable to that fitted using this the splines regression. Except that the
fitting model is slightly quadratic, but generally both capture a similarly increasing trend over
time.
The fit using local polynomial regression shows more complexity in the trend with some smooth
ups and downs. Although generally it also shows an increase in the temperature over time.
Generally from all four approaches we see an increase in temperature with about 2 degrees in
the last 138 years.
28
1.2 Basic Concepts: Trend, Seasonality and Stationarity
This slide provides a snapshot of the basic modelling of time series. For the simplest time series
analysis, the response data are Yt. The, the time variant observation where t is the time index
for example, day or month. In order to account for trend and seasonality or periodicity, we
decompose a time series into three components mt, st, and Xt. Where mt is the trend, st is the
seasonality, and Xt is the time process after counting accounting for trend and seasonality.
Xt is often assumed to be stational or a stationary component. Most classic time series models
assumes that the time series process is stationary, and thus we need the to first estimate a
trend and the seasonality components. Subtracting, subtract them from the time series Yt, then
find the model Xt. In this lesson, I will describe common approaches for dealing with a seasonal
component.
Not all time series have a trend, but for those which do, we'll need to correctly account for
seasonality. In the basic time series model, discussed in the previous slides, seasonality is an
additive component and can be estimate estimated from the trended detrended data if there is
also a trend component.
29
We can remove the trend [correction: seasonality] using two approaches.: 1) Estimations of
the seasonality and remove it. Or, or 2) differencing the data which directly removes the
seasonality.
To estimate the seasonality, we can use a seasonal averaging or use parametric regression in
two ways. First, we can fit a regression model where each of the seasonal group, for example,
month of the year, is specified as a dummy variable in the model. If the periodicity of the
seasonality has many seasonal groups, we fit many regression coefficients using this approach.
Second, we can use a cosine-sine function to fit the predicting variables, where the predicting
variables are the cost and assigned cosine and sine functions for different frequencies.
The first method is very simple. It takes all time series values corresponding to each seasonal
group k from 1 to d, and averaging them to get the wk. Note that here, d is the periodicity or the
number of seasonal groups.
For example, if the seasonality is monthly then d is 12. And one seasonal group corresponds to
one month. Because we have that the sum of the seasonal effect as T st is equal to 0. The
estimate, this is for effect 4 ST, the estimated seasonal effect for st is wt minus the average
across all effects.
One way to perform this simple averaging is through a regression model where each seasonal
group's mean is estimated via an ANOVA type model as presented in the next slide.
A linear regression model approach to estimate the seasonal effects st is to consider another an
ANOVA type of model where we group the data by seasonality group. If the periodicity is D, then
we have D seasonal groups or effects. Thus, we treat seasonality as a categorical variable.
30
To fit such a model, we need to set up indicator variables or dummy variables that indicate the
group [to which] each of the data points belongs. When fitting categorical variables with d
groups, we only include d-1 dummy variables if the model has an intercept.
Thus, this approach reduces to estimating a linear regression model. This can be convenient
since we have a trend. We can estimate the trend and the seasonality in one regression model.
But, the previous model can have many predicting variables. This reduced regression model
can have many predictive variables, particularly if you we have multiple seasonalities. For
example, month and week of the year, and/or the periodicity D is large. For week of the year, we
will need to have 52 different dummy variables, for example.
An alternative approach to using linear regression without having linear regression coefficients
to estimate, is to use a cosine-sine model. In this approach, we assume a cosine curve for the
seasonality specified by three parameters,: the amplitude, which is how high or low the curve
will go. The, the frequency f, which is the inverse of the periodicity of the seasonality and
the phase, which sets the arbitrary origin of the time axis. The only known parameter in this
formulation is f. Since, since it is provided by the periodicity of the seasonality.
The cosine curve is thus inconvenient for estimation because the amplitude and phase
parameters do not enter the expression linearly. Fortunately, a trigonometric identity is available
that allows reparameterization of the cosine curve as provided on the slide.
Now we have a linear parameterization where beta1 and beta2 are the regression coefficients
with the predicting variables given by cos and sin of 2pif 2 pi f. Conveniently, if the seasonality
has different frequencies. For, for example, month and week of the year. We, we can use the
31
same formulation for each of the frequencies and introducing only the regression coefficients for
each frequency.
Since we have both trend and seasonality, we can use the approaches discussed so
far on estimation of a trend and seasonality. To to remove them in order to get to a stationary
process.
To do so, at the first step we can obtain initial estimate for the trend, remove it from the time
series process and the second step estimate the seasonality. Lastly, removes the seasonality
from the time series process and obtain an updated estimate of the trend. The method used in
the step three [correction: two] can be any of the methods used to estimate seasonality not
only the averaging approach as provided on the slide.
32
Since we use regression to estimate seasonality and the trend, we can just solve do so
simultaneously. For seasonality, we only have linear terms as the regression model regardless
whether using the seasonal means model or the cosine-sine model.
However, the trend can be linear in the predicting variables if assuming a polynomial with known
order or it can be non-parametric. To combine both into one joint model, we simply use linear
regression if both the seasonality and the trend are specified by linear terms. Or we can use a
semi-parametric model in which the trend is non-parametric, but the seasonality is added to the
model linearly.
33
Last, we can also use differencing to remove the trend and seasonality jointly. If we have
seasonality or period d, the differencing operator applied to the time series Yt, means that we
take the difference of Yt and Yt-d. If we decompose far at for this difference we also have Mt-
minus Mt-d, which could also remove the trend if the trend is slowly changing. Generally, this
method is recommended when the time series is observed over a long period of time to allow for
differencing over long periodicities/seasonality.
34
1.2.2(6) Decomposition Examples: Seasonality Estimation
The lecture is on Basic of Time Series Analysis. And for this lesson, I will focus on a data
example that will illustrate estimation of seasonality.
In this example, we'll return to the data example where we studied the average monthly
temperature for Atlanta, Georgia. In this example, we have data for 138 years and the question
we would like to address is, is there seasonality and trend in temperature in Atlanta?
A first model we discussed in the previous lesson is the seasonal means model. In this model,
we fit a linear regression model with dummy variables, with each one specifying the seasonal
effect of one month. This is equivalent with an ANOVA model, where the data are categorized
into seasonal groups.
So For this approach, as well as for the second estimation approach presented in the previous
lesson, we’ll need the R commands in the library TSA. For using the R functions in this library,
we'll first need to load this library using the library (TSA) command. I'll point out that if this is the
first time when you'll use this library, you'll [have] to install it. We'll use a the season function in
this library, which will create the 12 dummy variables corresponding to the 12 months.
We can fit a model with intercept as in model1 where we specify the time series on the left and
the seasonal dummy variables on the right. Where those dummy variables will represent the
month, but this model omits the January coefficient. In this case, the February coefficient is
interpreted as a difference between February and January average temperatures. The March
coefficient is interpreted as the difference between March and February [correction: January]
average temperatures, and so on.
35
Seasonality: Season Models (cont.)
## Estimate seasonality using cos-sin model
har=harmonic(temp,1)
model3=lm(temp~har)
summary(model3)
har2=harmonic(temp,2)
model4=lm(temp~har2)
summary(model4)
Next, is the implementation of the cos, sin model to create the two predicting variables, the cos
and sin of 2π𝑡 for the frequency 1 over 12 or 1 per year. We can use the harmonic function in R
where the input is the time series and 1 stands for the frequency. Additional cosine functions or
other frequencies will frequently be used to model cyclical trends.
For monthly series there are a harmonic frequencies such as 2 over 12 or 3 over 12 that are
specially especially pertinent and will sometimes improve the fate fit of? [, at the] expense of
additional parameters. For example, we can use the model with 2 over 12 as provided in
model4.
One way to evaluate whether we will need additional cosine curves is by evaluating the
significance of the model parameters or using a model selection criteria.
This is a portion of the output of model2, the model with all dummy variables without the
intercept. The estimated regression coefficients provide the means for each individual seasonal
36
group. For example, the average for all temperature values in January is 43.02. And the
average for all temperature values in July is 79.
In order to get the seasonal effect, we need to subtract from each of the means to the average
across all of the estimated means. According to this model, July and August are the hottest
months of the year whereas January is the coldest.
In the output, we also have the R-squared or so-called the coefficient of determination, which is
an indication of the variability explained by the predicted predicting variables in the linear model.
For this model, the R-square is very large, 99.7% indicating the seasonality. That that
seasonality explains most of the variability in the monthly average temperature over years.
This is the output for the two models using the cos-sin decomposition of seasonality. The
difference between the two models is that in the second, we're adding one more set of cos-sin
harmonics. We can see from the output that estimated coefficients for the first set of harmonics
are similar for the two models. Well, over Moreover, the second set of harmonics, with a
frequency 2 over 12 have a statistically significant coefficients, providing provided that the
harmonics with a frequency 1 over 12 are a moda in the model.
The R-squared is slightly lower for these two models as compared with the R-squared for the
seasonal mean model. Moreover, the adjusted R-squared increases with the addition of the
other set of harmonics, an indication that perhaps including har higher frequency harmonies
harmonics improves the fit of the seasonality.
37
Here we compared the fit seasonality for the two models to see whether there is a significant
difference in the fit. For the first model, this is now provided by the regression coefficients and
expressed in the R code by the R function coef() of the fitted model. Whereas for the second
model, this is now provided by the fitted values of the regression.
We can compare the fit of the seasonality by overlaying in to all the fitted seasonality in to one
plot. The plot is here as provided by the R code, this is now the seasonality fitted using the
Seasonal Means Model. It's is in black and the sine-cosine model is in brown. As seen in this
plot, both are very similar in fitting the seasonality. Because the fit is almost the same, we would
prefer the model with fewer regression coefficients particularly the sine-cosine model.
38
In a different lesson, we found that there is a slightly increasing trend in temperature over time.
Thus, we would like to estimate both seasonality and trend jointly.
We'll begin with a the simplest parametric model, where we consider a quantity quadratic trend
and seasonality modeled using the cosine-sine model. For this, we specify in the LMR lm R
command, the response variable, the time series for the average monthly temperature values.
And the predicting variables for both the trend and seasonality together.
In order to evaluate the statistical significance of the trend seasonality, you can further analyze
this model. For example, one could test whether the trend adds any explanatory power for the
model. As it's given, the, given that seasonality is already in the model. Such evaluation method
for linear regression has been introduced in the regression analysis course.
Here, we'll only plot the residuals introduced after remove in removing the trend and seasonality.
Specifically, subtracting from the time series, the fitted values of the linear model.
In a different lecture, we'll study ways to evaluate whether such process is stationary. In other
words, whether the trend and seasonality components have been removed.
39
We can also fit a non-parametric trend along with the seasonality given by the cosine-sine
model. The gam function in R allows fitting non-parametric and linear components jointly. To do
so we specify that the trend in time is non-parametric while not imposing any transformation on
the seasonality thus, adding the harmonics to the non-parametric trend.
Next, we compare the residual processes for both methods. Here is the plot comparing the
residual processes, where brown is the process for the parametric model and in blue is residual
process from the non-parametric model. The differences are small between the two approaches.
40
1.2.3(7) The Concept of Stationarity
The lecture is on Basics of Time Series Analysis and for this lesson, I will focus on the Concept
of Stationarity.
Let's first define the autocovariance function. The autocovariance function is the covariance of
any two variables of the stochastic process. For example, if we consider the time points r and x
s, the autocovariance of 𝑋 𝑟 and 𝑋 𝑠 is the expectation of the product of the mean centered
variables Xr and 𝑋 𝑠
as provided on the slide.
Further, we define the time series Xt to be weakly stationary if: it has constant mean for all
time points, t; it has a finite variance, or more specifically, has a finite second moment; and the
covariance function does not change when shifted in time. That is, the dependence between Xr
and Xs is the same as for the shifted Xr + t and Xs + t.
41
Two classic examples of stationary processes are the white noise process and the IID process.
A white noise process Xt has constant variance, sigma squared, and correlated uncorrelated
and constant mean = 0. The IID process has constant variance, independence and constant
mean = 0. Both are stationary, since the conditions 1, 2, 3 hold, assuming that sigma squared is
finite.
As the two processes are defined on the slide, it may seem like they are actually the same.
What is the difference?
42
The white noise process assumes uncorrelated data. That means Xt are uncorrelated, whereas
the IID process assumes independence. Independence implies uncorrelated variables, but not
the opposite. Thus, it is possible to have uncorrelated variables, but not independent.
How to sample such processes in R? We can use the random generator for the normal
distribution, rnorm, for example, if we're interested in random normal data. We get four for other
distributions, we can use rpois for the price of Poisson distribution, rbinom for the binomial
distribution, and so on.
This [the chart shown on the slide] is the time series plot obtained for white noise with 0 mean
and variance = 1 obtained using the r function rnorm.
43
A classic example of a non-stationary process is the so-called random walk. For a random
walk, we start with IID normally distributed random variables, Xt, with mean 0 and variance
sigma squared.
Our random walk process St is the sum of Xts up to time t. Here is the plot of one random walk.
As you noted from this plot, there is no stationarity in this process since it goes up and down
differently across the time span and thus with a different distribution as the process shifts
overtime over time.
44
Another example is the moving average of order 1, which I'll introduce in more detail in a
different lecture. This process is not stationary for some specific values of theta. [Correction:
The MA process is always stationary for finite values of theta]
One example is in this plot, comparing. Comparing to the previous examples, stationarity is not
so clearly identified in this example.
For a stationary process, the autocovariance function does not change as time shifts. Thus, we
can express the autocovariance function in terms of the lag rather than time. Specifically, for a
stationary time series, the autocovariance function is as a function of the lag h, is the covariance
of Xt and Xt + h.
The autocovariance function has the following properties. The autocovariance function of lag 0,
which is the variance of Xt, is positive, this. This is because the variance is positive. The
absolute value of autocovariance function of lag h is bounded above by the variance, or the
autocovariance function is in 0. Last, the autocovariance function in the lag h is equal to the
autocovariance function in the lag -h, hence its symmetry. This last property comes from the fact
that the distribution does not change if shifted to lag or lagged h before or after a given time
point.
We can further define the autocorrelation function of a stationary process Xt as the scaled
autocovariance function. We divide the autocovariance function in [with lag] h with
autocovariance function in [with lag] 0. Because of this rescaling, the autocorrelation function
has an addition additional properties property that evaluate d as at [lag] 0, it’s equal to 1.
45
The autocovariance and autocorrelation functions are functions of the stochastic process, Xt,
and thus are unknown. We can obtain estimates of these functions given their realizations or
observations of the time series. Specifically, given the observations of the time series X1 to Xn,
we estimate using the sample autocovariance on and sample autocorrelation functions. The
sample autocovariance function at h is the sum of the products between any two mean centered
observations of the time series observed within a lag h.
The formula is provided on this slide. The sample autocorrelation is simply the sample
autocovariance divided by the sample autocovariance at 0.
46
1.2.4(8) The Concept of Stationarity Data Example
The lecture is Basics of Time Series Analysis. And in this lecture, I'll focus on stationarity with a
data example. We'll return once more to the data example where we're interested in modeling
monthly average of the temperature in Atlanta.
We have seen so far how to estimate a trend and the seasonality. We'll take a closer look at the
stationarity of the residual process after removing those two components. The general approach
for a time series analysis is as follows. We first begin with an exploratory analysis for insights on
whether the time series is already stationary or whether there is a trend and, or seasonality to
account for in the modeling of the time series. We also look for other aspects, such as outliers or
unusual patterns. If trend and, or and/or seasonality are present, then we can estimate them
and remove them from the observed time series.
Next, the residual process is checked for stationarity. And if stationary, the time series model is
farther employ further employed. So far, I illustrated steps one and two for the analysis of the
monthly average temperature.
47
Next, we'll study the stationary stationarity of the residual process. I'm showing you this, the
residual process again. This is comparing the residuals from fitting [removing] both truly this is
the trend and seasonality using the two approaches, the linear regression model and parity the
non-parametric regression model.
The question that we'd like to answer: is this residual process stationary? A stationary process
is described by the behavior of its autocovariance or autocorrelation function. So let's look at the
sample autocorrelation functions from for the residual plots from the two approaches.
The arc R command is ACF, which allows estimation of both autocovariance and autocorrelation
functions. The default of the arc R command is the autocorrelation. We're plotting here the ACF
for both the temperature time series process and the residual processes after removing the
trend and seasonality using the two approaches. A stationary process is just scrapping
described by a the behavior of its autocoronous autocovariance or autocorelation
autocorrelation function.
48
Let‘s look at the sample autocorrelation from for the residual plots from the two approaches. The
arc R command is ACF which allows us dimension estimation of both the autocovariance and
autocorrelation, the sample autocovarience autocovariance and autocorrelation functions. The
default of the arc R command is autocorrelation. We're plotting here the ACF for both the
temperature time series process and the residual process obtained after removing trend of and
seasonality using both approaches discussed before.
The ACF plot at of the temperature process is here. The X-axis shows the lag 8 h at which we
estimate the autocorrelation function and the Y-axis is for the sample autocorrelation. The ACF
plot shows a clear seasonality pattern. The pattern repeats for each block of lags specified at 1,
2, 3 and 4 on the X-axis. The trend cannot be noted from this plot, cannot be observed from this
plot. So just so it is slowly changing all the over years as we have seen in the lesson where we
learn about the trend estimation for this particular time series.
The ACF plot of the residuals from the parametric approach for estimating the trend and
seasonality is here. While this is not the seasonality observed in the ACF of the temperature
time series is not present anymore, we still observe some sort of cyclical !pattern that
corresponds to the periodicity of the times series. Moreover, m of the ACF values are positive
showing some buyers bias in the estimation of trend and,/ or cyclical pattern.
Last is the ACF plot for the residuals where the trend is estimated using non-parametric
regression. The ACF plot for this residual plot does not have a cyclical pattern of the ACF plot of
the previous residual plot and it shows less of the bias in the removal of the trend and,/ or
cyclical pattern. However, none of these observations about residual plots guarantee
stationarity. We'll expand more on this aspect in a different lecture.
49
1.2.5(9) Linear Processes & Prediction
This lecture is on basics of time series analysis. And for this lesson, we'll learn about the
definition of linear process and prediction.
A time series is a linear process if it can be represented as a linear combination of a white noise
process, where the coefficients in the linear combination have the property that a the sum of
their absolute values is finite. A special case is the moving average process, in which all the
coefficients with a negative index are zero.
Most models discussed in this course are linear processes or they can be expressed as linear
processes under specific conditions. In our models, however, the processes are finite sums.
50
An example of a linear process is that the autoregression, in short AR process with a one order
order, expressed as Xt equal to phi times Xt-1 plus a white noise process Zt.
Comparing this formula with that from previous slide. There's, there’s not a clear mapping
between them. But it can be shown using sum some algebra that the AR with 1 order is a linear
process if the coefficient phi in absolute value is smaller than 1. And and the white noise
process Zt is uncorrelated with the process X’s before the time point t. An example of an AR
process of order one with phi equal to 0.8 is provided here [shown in the graph on the slide
above].
51
For linear processes as provided in this course, and more generally for stationary processes,
we'll particularly focus on one type of prediction.: linear predictors which are linear combinations
of past data.
We define the prediction of lag h, given the past n observations as the linear combination of the
n past data with different weights provided for the more recent data versus the older data.
Here, the coefficient a1 is for the most recent data of the time series, Xn; a2 is for the next recent
data of the time series Xn-1, and so on. In order to get the coefficients a1, a2, to an, we minimize
52
the mean squared error with respect to these coefficients, which will define the best linear
predictor. In prediction of stationary processes, our goal is to obtain best linear predictors.
This slide provides the notation that we'll be using in the prediction ba3.3.sed on models
introduced in this course. Again, we begin with a stationary process with mean, mu. The best
linear predictor is expressed in terms of the linear combination of the past data centered around
the mean, mu, if mu is not zero.
This formulation is equivalent with the one provided on the previous slide, except that we now
express the constant a0 in the previous slide in terms of mu. Here, I define alpha n as the vector
of coefficients in the linear combination of the past data in the linear combination for the
prediction. The coefficients in alpha n are directly related to the autocovariance function of the
time series through this equation.
In this equation, Gamma n (Γ𝑛) is a matrix where the i,j elements in this matrix are the
autocovariance functions at i minus j. Gamma n (Γ𝑛) times the vector of coefficients is equal to
the vector gamma n (γ𝑛) at h. The prediction error which is a measure of accuracy of the
prediction can be further obtained once we have the coefficients in the alpha n vector.
53
Let's consider a very simple one step prediction, that is predict Xn+1 from the past data of the
time series X1 to Xn. If the matrix gamma n is non-singular, then we can obtain the coefficients in
the alpha n vector by multiplying both sides with the inverse of gamma n. Once we obtain alpha
n, we can get the prediction and the prediction error.
But gamma n-1 can be too large to invert for n large n. There are two common recursive
algorithms that can be used instead of inverting this large matrix, and those recursive algorithms
will give us directly the predictions.
54
The first algorithm is the Durbin-Levinson Algorithm. In this algorithm, we're getting the
coefficients in the alpha n vector to be used to obtain the prediction and the prediction error.
This algorithm does not require the inversion of the matrix Gamma n, but will require getting the
coefficients in alpha 1, alpha n, and so on, all the vectors up to alpha n, vectors recursively.
55
In the second algorithm called also the Innovations Algorithm, we do not obtain the
coefficients in the alpha n vector but we are getting directly the predictions. In this algorithm, we
first compute the elements of the K matrix in which the i, j element is the expectation of Xi and
[times] Xj. This is a one-time computation. Then we get the one step prediction using a formula
involving the weighted sum of the differences between past data, Xn +1- j, and their prediction
and those are called also innovations hence the name of the algorithm. The weights are the
theta nj, which are functions of the elements in the key K matrix.
56
1.3 Data Example: Emergency Department Volume
Reports from different countries including the United States, the United Kingdom, China and
other countries have a shown an increase in demand for emergency department care. Resulting
resulting in frequently overcrowded ED's. Plenty, lengthy waiting times for assistance. And, and
an overall perception by patients of poor healthcare.
Prolonged waiting times are described as a major factor for dissatisfaction with ED care, and
patients are more likely to leave without being seen as waiting time increases. While common
practices to using data on the lead daily patient volume for better planning of personal
resources might increase ED efficiency as well as improve ED patients’ care quality.
Time series models can be used to forecast future ED patient visit volume based on the
estimated effect of predictor bar rows variables. And such forecast, forecasts can be useful for
proactive better bed and staff management and for facilitating patient flow.
A number of factors can influence daily ED visits and the patient, visit for casting oil patient visits
forecasting model should include those factors. Previous studies have shown that ED visits
present cyclical variations according to calendar variables such as day of the week, time of the
year. The objective of this analysis is to develop models for identifying temporal patterns and for
predicting the volume of patients in an Emergency Department.
The data consists of daily number of patients seeking daily ED care in a hospital in Georgia in
the United States. The ED volume was observed over a period of more than five years from
2010 until about mid 2015. In the study we'll only consider temporal factors, although, although
other factors can be used. For example, external factors such as temperature, rainfall, major
holidays such school season, among others, along with hospital data. For example, the
percentage of patients seeking care in ED who have public insurance versus commercial
insurance among others.
57
We'll begin with one important aspect in the time series analysis, processing time or dates of the
observations. The first step is to read the data in R. The command used here is read.csv, used
for reading CSV data files. The input consist of the data file as and specification where
the columns in the data file have names. So this is the case for the data file we will read in R.
We specify header=T or for true. The data columns include year, ranging from 2010 to 2015,
month, taking value 1 to 12 and day of the month along with the column volume,
providing the number of patients taken seeking care in the ED for the corresponding day.
The first three columns provide the complete information on the date, date the observations
were made. But, because the information is provided across the three columns, it needs to be
converted into one column of dates to be used in visualizing the data. To do so, we pull the
three columns into one matrix.
I define here a function paste.dates which takes our as input a vector of three variables,
variables: day, month and year. The function returns a date using the information from these
three date factors. I apply this function to each row in the date map to output a date for each row
in this matrix.
Last, I converted the dates as output from this R command into dates that R can identify as
such, using the as.Date command. This vector of dates will be the input into the vision visual
analytics displays to display time.
58
Our first plot of the time series is the plot of time expressed as dates, provided in the previous
slide, and the values of the daily ED volume. And a few observations to point out are as follows.
There is a generally increasing trend in the ED volume. There is some cyclical pattern, although
not necessarily a seasonality, since it does not repeat at exact periods of time. And there's some
clear outliers, for example beginning of 2011 and in 2014.
59
I will note that count data within a time unit, such as the number of patients seeking care per
day, has a poisson distribution. But the standard linear regression model employed to estimate
trend of or seasonality assumes normality with mean varying over time, but constant variance.
One of the limitations of using the standard linear regression model under normality to model
count data, is that the variance of the error terms is non constant, and thus, a departure from
the model assumptions.
In order to use this model, we thus need to transform the data using a variance stabilizing
transformation.That is, transform the count data, or Poisson distributed data, such that the data
have approximately constant variance. The transformation has also the role of normalization of
the count data. The classic transformation use for count data is the square root of the count's
counts +3/8. Thus, we apply this transformation to the count data and provide the time series
analysis on this transformed data.
We next compare the histograms of the ED Volume and the Transformed ED Volume
to evaluate the distribution without and with transformation. These are the histograms, the first
one is for untransformed and the second one is for the transformed ED Volume data. The
transformation has also the role of normalization, since with this transformation the distribution is
approximately symmetric.
60
This slide compares the time series without in the upper plot versus with transformation in the
lower plot. The pattern is generally similar with trending a cyclical trend with outliers at similar
time points, the main difference is that there is a reduction in the variability.
61
1.3.2(11) Trend and Seasonality Estimation
This lecture is on basics of time series analysis. In this lesson, I will introduce a trend and
seasonality estimation. With with the data example on modelling ED volume.
Let's begin with the estimation of the trend. I applied here only non-primatic non-parametric
estimation and purchased approaches, the local polynomial and the Splines Regression
Methods. Similarly, to the Atlanta Temperature Data example, we need to first define the data
points, the time points which are simply equally spaced values between zero and one. Then I
applied apply the R commands, the lowest loess() for local polynomial, and gam() for splines
regression.
I am overlaying here the estimated trends from the two models. For for the transformed ED
volume data. In brown is the local polynomial trend, and in red is the splines regression trend.
The plot is here, the trends are similar. They both show an increasing pattern over time, with
some smooth variations over time. This splines regression trend shows more variation capturing
some of the sequel per cyclical pattern.
62
This is the output of the summary of the fit using splines regression, first, . First, the p value for
the statistical significance of the smooth term of the trend is to signal the statistically significant
indicating a statistically non constant trend. Second, the adjusted r square is 0.296, implying
that about 30% of the variability in the ED Volume is explained by the trend alone.
63
This is the resulting plot. We can see that the fitted values are a step function. This is because I
fitted the seasonal means model. The fitted line In in red does capture the trend. However, the
seasonality shows differences from the observed data in some places for some period of time.
Indicating indicating that they there may be more of a cyclical pattern than a monthly
seasonality.
Moreover, the monthly seasonality may be more related to other factors they have than the
month. For example, school seasons or flu season may be more predictive of the cyclical
pattern.
64
Next, I'm adding another layer of seasonality, due to the date day of the week. It may be
expected that on Sundays to see a higher volume In in the ED, since the physician offices are
not open, and because people may have more TVC's activities that could lead to injuries and
hence, ED visits.
For this, we need to first identify which of the dates correspond to Monday's, which correspond
to Tuesday's, and so on. A simple way is to use the weekdays, commanding, r command in R.
Where the input is an object of, derived using aes(date) as.Date command. I'm converting this
into a factor to specify this is a categorical variable, starting. Starting with the model. Where
where I fitted a trend the trend and monthly seasonality alone. alone, I'm adding an additional
linear term. The: the categorical variable corresponding to day of the week seasonality.
Last, I compared it the two fitted models by overlaying the fitted values from the two models.
Where where the red line. Correspond corresponds to the simpler model with monthly
seasonality only. Here is a plot, the. The fitted line including both monthly and day of the week
seasonality, clearly shows the added seasonality to the day of the week. All together, estimated
simultaneously, the monthly seasonality is picked up somewhat better.
65
This is a portion of the summary output of the model including non-parameter parametric trend
including along with the two layers of seasonality. The first portion of the output shows the
estimated coefficients for For the monthly seasonality. In the first column we have the estimated
coefficients, in the second column we have the standard errors, and in the last column we have
the P values for statistical significant significance. From this output we find that most of the
regression coefficients corresponding to the monthly effect are statistically significant given the
day of the week's seasonality and the non-parametric trend In in the model.
The second portion of the model output shows estimated coefficients along withe with the p
values for statistical inference for the day of the week seasonality. From this output we find that
some of the regression coefficients corresponding to the day of the week effect are statistically
significant given. That that monthly seasonality and the non-formative parametric trend are in
the model. The output for the trend also shows the statistical signficance significance of the
smooth component given the seasonality in the model.
Last, the R score squared is 0.627, or 62.7% of the variability in a the time series is explained
by the trend And and the two layers of seasonality. This is to be compared to 0.296, the r square
for the model where only the trend was fitted.
66
From the previous output, we found that for both layers of seasonality, there is statistical
significance for some of the regression coefficients.
The question that we will address next is does the date day of the week seasonality add
predictive power to the model. For this, we will compare the models without and with this layer
of seasonality.
I thus fitted a linear regression model with both layers of seasonality, and the model with only
month to monthly seasonality, and compare them using the partial f test through the command
nova nr anova in R. the The input in this command consists of the two models. The output of the
nova anova command is here. The p value of the test is very small. Indicating, indicating that we
reject the null hypothesis as a that the simpler model with monthly seasonality only is as good
as the one with both layers of seasonality, suggesting that both monthly seasonality and day of
the week seasonality are predictive.
67
I'm comparing the fitted regressions without and with trend to assess whether the trend, indeed
as adds any addition of additional predictive power. And this is the plot, in black is the fitted
regression for the full model, the one in blue is for seasonality with the the two layers excluding
trend, and the one in red is the fitted trend alone. The trend does affect the fix fit significantly,
thus the full model provides the best representation of the nonstationary components of the time
series were represented by ED volume.
68
69
1.3.3(12) Stationarity
The lecture is on basics of time series analysis, and for this lesson I will continue with the data
example for Modeling the ED volume of patients. And we will continue with analyzing with
stationery the stationarity of the residual process after removing trend and seasonality.
Here we'll compare three residual processes. First is the residual process after removing the
trend along alone. Second is a time sets series after removing seasonality including monthly
and day of the week. Third is the residual time series removing both trend and seasonality.
The first R commands are to obtain this these three residual time series. Next, I define the
minimum and maximum values across all three processes in order to plot those time series on
the same scale. The next set of R commands plots all three residual time series overlaying them
into one plot.
70
And this is the resulting plot. In black is the time series after removing the trend alone, in blue is
the time series after removing the seasonality. The time series in black clearly have some
non-stationarity patterns while all three have outliers.
To better evaluate the stationarity of a time series, we can instead evaluate the auto-correlation
plot of the time series. Here I'm providing the R commands for the sample autocorrelation of old
all three time series.
71
, the The first one is for the residual time series after removing only the trend. And An indication
of non-stationarity is that the sample outer autocorrelation is large, or outside of significance
bands for many large lags. This is the case here. We see that even for a lag of 45 the
autocorrelation barely drops within the band. Moreover, we can see a pattern of ups and downs,
an indication of seasonality.
The second one is for the residual time series after removing the seasonality. While we do not
see the cyclical, or the seasonal pattern in the ACF values, the sample autocorrelation
decreases slowly. An indication of the the presence of a trend in this process.
Last ACF plot is for the residual time series after removing decisionality the seasonality and the
trend. The sample ACF values clearly decrease faster than for the previous result residual time
series with small values within the significance band, starting around the lag 15. This is an
indication of possibly a stationary process. We'll expand more on stationarity based on the ACF
plot and more rigorous ways to evaluate in a different lecture.
72
Now, we'll I will conclude with some of the findings from this time series analysis. Some of the
findings from the study are as follows. There is a significant increasing trend in the Emergency
Department patient volume over of the past five years. Seasonality is more complex, both
monthly and day of the week are statistically significant. There are cyclical patterns that may not
be fully captured by seasonality. Other cyclical factors, such as flu season, or school season
may explain the cyclical pattern.
73
Unit 2: Auto-Regressive and Moving Average
Model
This lecture is on ARMA modeling. In this lesson, I'll introduce one of the most useful time
series models, the ARMA model, which is the basis of all the models introduced in this course.
We commonly refer to a time series as both the stochastic process from which we observe, and
the realizations or observations from the stochastic process. It's important to highlight that the
time series data we commonly observe in practice are realizations of stochastic processes and
thus come with uncertainty.
74
The ARMA Model and its elements
The model introduced in this lecture is the Autoregressive and Moving-Average (ARMA) Model.
A time series is an ARMA process if 1. it is stationary and 2. for every time t, it is modeled as on
this slide.
The model consists of two parts.The first, the AR or Auto Regressive part, models the
relationship between Xt and past or lagged observations.
Lagged variables Xt-1, Xt-2 up to Xt-p. For example, if we observe a high GDP this quarter,
we would expect that the GDP in the next few quarters are high as well. In this AR formulation,
p is called the auto regressive order of the model. This first part is an intuitive way to model a
time series based on past data.
The second is the so-called MA or Moving Average part of the model. Using the MA, we can
model that the time series, the time t is not only affected by the shock of time t, but also the
shocks that have taken place before time t.
For example, if we observe a negative shock to the economy, say a catastrophic earthquake or
a storm, like we had in the past few weeks, then we would expect that its negative effect affects
the economy not only for the time it takes place, but also for the near future. The moving
average is a linear combination of white noise.
75
In this formulation, q is called the moving average order of the model. An ARMA model can
include both or just one of the two components. It can simply be an AR process or an MA
process. Practically if the AR order p is 0 then we only have an MA process, reversely if the MA
order is 0 then we only have an AR process.
Last, the ARMA model defined on this slide is for process or a time series with mean 0. If the
mean is mu then practically Xt - mu is an ARMA process as defined on this slide.
76
Let's establish important notation used throughout this lecture. We often write the ARMA model
in a more compact form as provided on this slide. That is, we multiply Xt with phi of the operator
B, representing the left portion of the model equation. And multiply Zt with the theta of the
operator B, representing the right portion of the model equation. The functions phi and theta are
defined further, as on this slide.
Phi is a polynomial of order p with coefficients given by the AR portion of the model. Theta is a
polynomial of order q with coefficients given by the MA portion of the model. The polynomials
are called autoregressive and moving average polynomials. The operator B is a lag operator.
When applied to Xt we move the index back one time unit giving Xt-1. When applied multiple
times to Xt, for example, k times, we move the index back k units, given Xt-k.
77
ARMA models are at the core of modeling time series for a series of reasons. First, recall that
we defined the autocovariance function for a stationary process depending only on the shifting h
lag. We'll continue to use this definition for an ARMA process.
The condition here that the covariance function decreases as we increase the lag h means that
the dependence in a time series becomes smaller and smaller as we increase the lag between
two time points of the time series.
If this condition holds, then it is possible to find an ARMA process with the same autocovariance
function. That is, there exists an ARMA process for any autocovariance function with this
property.
A second important property of an ARMA process is that the linear structure for ARMA models
makes prediction easy to carry out.
78
2.1.2 Basic Concepts: ARMA Simulation
In this lesson, I will illustrate ARMA models and estimated covariance function using a
simulation study.
Let’s begin with the simplest ARMA process, the white noise process, which consist of a
sequence of uncorrelated, random variables with equal mean, and equal variance. Note, that
the distribution of the random variables does not need to be normal. In fact, I simulated here in
this example, white noise, when the random variables have a normal distribution and an
exponential distribution.
79
Here are the plots, the upper plots are the time series for normal on the left and exponential on
the right. We do see a difference on how the observations of the two times series vary over
time. For normal distribution, they vary around zero quite symmetrically, while for exponential all
values are positive lacking symmetry. These are the characteristics of the two distributions.
More generally, however, the patterns are very choppy and patternless.
The bottom plots are the sample autocorrelation functions. An autocorrelation function plot is a
bar plot, where each bar corresponds to one lag h starting from 0 to the maximum lag. The
height of each bar corresponds to the value of the sample autocorrelation for the corresponding
lag. We can also provide similar plots for the auto-covariance function, for the sample
auto-covariance function where the bars will be the values of the auto-covariance function.
As highlighted in these two plots, we see that the sample autocorrelation is large for lag equal to
0. In fact, it's equal to 1. This means that for these processes the sample auto-covariance is
small, approximately 0, for all lags, except for the 0 lag. This aligns with the property of the
auto-covariance function of a white noise, which is nonzero at lag 0 and is equal to the variance,
but is 0 otherwise.
This ACF plot for the two simulated processes in this example will be also identified, will be seen
for all white processes regardless of the data generation process. In fact, we'll use these
characteristics to evaluate whether the process is a white noise or not, as we'll see in many
examples in this course.
80
R: Simulating and Plotting A Moving Average White Noise Function
In the next simulation, I simulated again white noise from a normal and an exponential
distribution, but this time, the white noise will be filtered to obtain a moving average process.
The two processes we'll simulate are shown on the box on the right. The first process has
coefficients -0.5 and 0.2, whereas for the second process I changed the first coefficient from
-0.5 to 0.5, to see how this change will be reflected in the autocorrelation function.
Next, I use the filter command in R to generate a moving average process given the white noise
process. The input in the function consists of the white noise and the vector of coefficients
specified and filter equal to a, and side=1 which means that we are generating an MA process.
Last, I will only consider the process starting with three since the first two values are lost
because we need to generate the x3 given a Z1 and Z2. But there is no past white noise series
to generate x1 and x2. We apply this procedure to both sets of coefficients and for normal and
exponential distributions.
81
The ACF plots are here. The upper plots are for the normal and the bottom are for the
exponential distribution.
Across all ACF plots, we see one important common characteristic. All of them have the first two
values of the ACF large, outside of the confidence band. Whereas for other lags the sample
autocorrelation is small close to 0. We'll see that this is a characteristic of an MA process.
Last, there is a difference between the ACF plots for the two different sets of coefficients. The
sample ACF is positive for lag equal to 1, for the processes with coefficient 0.5 and 0.2 versus it
is negative, a lag equal to 1 for the processes with coefficients -0.5 and 0.2. From these
examples, and other examples, we see that although Xt is modeled as a moving average
model, thus, a linear combination of white noise, then Xt and Xt-1 are still dependent on the
sample sequence of white noise. Thus, they are correlated. This correlation is reflected in the
ACF plots.
82
In this example, we'll take a closer look at an MA process with a non-stationary noise Zt.
We begin by generating the white noise, then we filter a white noise that is not stationary.
Specifically, the MA process has coefficients -0.2, 0.8, and 1.2. But the noise is white noise
multiplied by 2 times t + 0.5, thus the noise is non-stationary.
The time series plot for this simulated MA process is here. From this plot, we see clearly that
there is no stationarity in the process. Its distribution changes as time shifts. We also note a
large variability at the later time points compared to the earlier time points. The ACF plot is
provided here.
The ACF plot does not indicate, however, non-stationarity. In fact, we see that the sample ACF
is large for lag 0, 1, 2, and 3, but not for higher lags, which is an indication of an MA process.
Thus, the sample ACF plot indicates an MA process of order three even though the noise Zt is
non-stationary. This may not be the case for other processes with non-stationary noise.
83
Last, let's take a closer look at AR processes. Here, I simulated AR processes with lags 1 and 2.
To do so, I simulated as white noise first, then applied filter R command with a coefficient
specified by filter equal to a2 or a1. And method=’recursive’ to specify that we simulate
an AR process instead of an MA process. The coefficients of the first process are 0.8 and 0.2.
Whereas, for the second process the coefficient is 0.5. The AR(2) process is non-stationary
whereas the AR(1) process is stationary.
Here are the times series plots for the AR(2) and AR(1), along with their ACF plots. The time
series of the non-stationary AR(2) process shows clear non-stationarity, since the distribution of
the process changes as time shifts. The ACF plot of the non-stationary process also indicates
non-stationarity. When the ACF values decrease slowly as for this AR(2) process this is an
indication of non-stationarity.
On the other hand, for the AR(1) process this sample is ACF is large for lag 0, 1, 2, and 3 only.
Note that we cannot use ACF plot to identify the lag of the AR process as we did for the MA
process. We'll get back to this aspect in a different lesson in this lecture.
84
2.1.3 Causal and Invertible Processes
Let's return to the notation of the ARMA process. We'll write the ARMA model in a more
compact form as provided on the slide. That is, we multiply Xt with phi of the operator B,
representing the left portion of the model equation, and multiply Zt with theta of the operator B,
representing the right portion of the model equation.
The function phi and theta are defined further on this slide. Phi is a polynomial of order p with
coefficients given by the AR portion of the model. Theta is a polynomial of order q with
coefficients given by the MA portion of the model.
The polynomials are called autoregressive and moving average polynomials, respectively.
The operator B is a lag operator. When applied to the time series Xt, we move the index back
one time unit, giving Xt-1. When applied multiple times to Xt, for example k times, we move the
index back k units, giving Xt-k. [Yes, this slide was already covered back in lecture 2.1.1.]
85
Here, we establish the necessary and sufficient condition for an fwhite model to have stationary
solutions.
The question we'll address now is when Xt is stationary given its representation as an ARMA
process, or when the ARMA model generates stationary processes. As provided on this slide,
the condition only involves the AR polynomial phi. And it says that this polynomial does not have
solutions on the unit circle. Or, that the solutions of the polynomial phi(z) are not equal to 1 in
absolute value.Thus, for checking whether an ARMA process is stationary, we simply need to
get the solutions to phi polynomial and check whether on the unit circle.
86
Given a time series probability model, usually, we can find multiple ways to represent it. Which
representation to choose depends on the context of our problem. However, not all ARMA
processes can be transformed from one representation to another. In this lesson, we'll consider
under what condition we can invert an AR model to an MA model and invert an MA model to an
AR model.
The concept of causal process means that we invert an AR or ARMA process into an MA
process, where the order can be up to infinity. An ARMA process Xt is causal if we can express
it as a linear process in Zt, which is a white noise such that the sum of absolute values of the
coefficients of the linear process is finite. The simplest example is an MA process of order q.
When does [is] an ARMA process is causal? Similarly to the property of stationarity not all
ARMA processes are causal. First, we assume that the two polynomials phi and theta do not
have any common zeroes. That means the solutions to the equation phi(z) are not equal to the
solutions of the polynomial theta is equal to 0. Then a necessary and sufficient condition for the
process Xt to be causal is that the zeroes of the AR polynomial are outside of the unit circle. In
other words, for any z which is a solution to the equation phi z is equal to 0. If any z, when the
equation phi z is equal to 0, then that z is larger than 1 in absolute value.
Note that this condition implies that if a process is causal, it's also stationary but not the reverse.
That is, there are stationary ARMA processes which are not causal. Given the two polynomials
of an ARMA process, we can then obtain the coefficient Psi of the linear process representation
using this relationship which is an invertibility property of the AR polynomial. This relationship
will come in handy in the estimation of the ARMA process and in prediction.
87
One simple example for a causal process is the AR of order one. The polynomial for such
process is 1- phi(z), which equals zero when z is equal to 1 over phi. Thus, Xt is stationary if and
only if, the solution is not equal to 1 or equivalently, phi is not equal to 1. Moreover, Xt is causal
if and only if the solution 1 over phi is larger than 1 or equivalently phi is smaller than 1.
Thus, for an AR(1) process, it is straightforward to check whether the process is stationary or
causal.
Here, I simulated AR(1) processes with phi = 0.1,- 0.1, 0.9 and -0.9. The simulation is similar as
the one provided in the previous lesson. We first simulate the white noise. Then we apply the
filter command with method='recursive'.
88
The time series plots are here. Upper slots are for small phi = 0.1, or- 0.1. And lower plots are
for phi close to 1, 0.9 or -0.9.
The upper times series plot show clear stationarity as there is not a pattern in this time series.
On the other hand, the plots on the bottom show some sign of non-stationarity. The departure
from the mean 0 is very prolonged. This is because the coefficient phi is 0.9 in absolute value,
which is close to 1. As pointed out on the previous slide, an AR(1) is not stationary when the
coefficient is 1 or -1. Thus, as a coefficient phi reaches 1 in absolute value, the processes also
will show signs of non-stationarity.
89
We can also go from an MA, or Moving Average process, or a more general ARMA process to
an AR process, where the order can be infinity, meaning that the ARMA process is invertible.
More specifically, for a process to be invertible, there exists constants in pi j, were the sum of
the absolute value of the constant is finite such that a linear combination of past Xt-j weighted
by pi j is a white noise Zt.
And this is an insufficient [correction: a necessary and sufficient] condition for ARMA process to
be invertible, is that the zeroes of the MA polynomial, theta(z) are outside of the unit circle. That
is, a solutions to the equation theta is equal to 0 are all larger than 1 in absolute value.
Given the two polynomials of an ARMA process, we can then obtain the coefficients, pi of the
linear representation using this relationship, which is an invertibility property of the MA
polynomial.
We prefer the invertible ARMA processes in practice because if we can invert an MA process to
an AR process, we can find the value of Zt which is not observable, based on all past values of
Xt which are observable. If a process is non-invertible, then in order to find the value of Zt,
we have to know all future values of Xt.
90
Here is why the property of causality is important. Given an ARMA process with AR and MA
polynomials, phi and theta. If it is stationary, that is,the roots of the phi(z) polynomial are not on
the unit circle, then there exists polynomials pi tilde and theta tilde and a white noise Zt such
that the ARMA process with these polynomials, as AR polynomial, and MA polynomials is a
causal ARMA process. Moreover, if the roots of the MA polynomial theta(z) are not on the unit
circle, then theta tilde can be chosen such that the ARMA process with theta tilde as an MA
polynomial to be invertible.
What can we learn from this property? We can learn that any stationary ARMA process can be
transformed into a causal process. This property is important, since when we fit an ARMA
process to a time series, we can assume fitting a causal process.
Again, as we'll see in the next section in the next lesson, we can derive the autocovariance
function for an ARMA process using the invertibility property of the AR polynomial for a causal
process. Hence, the applicability, the utility of this property.
91
2.1.4 Autocovariance and Partial Autocorrelation Function
This lecture is on ARMA modeling. And in this lesson, I'll introduce the estimation of the
autocovariance function. But also another dependence measure important for ARMA models,
which is the partial autocorrelation function. I remind you once more the notation of an ARMA
process with the AR and MA polynomials describing such a process.
We have learned about the general definition of the autocovariance function. For stationary
process, the autocovariance function is very useful in describing the process. Here we'll learn
about the property of the ACF for causal ARMA process. Specifically, if Xt is a causal process
with a presentation as a linear process, with psi j, the coefficients of the linear process, the ACF
is equal to sigma squared times the sum of all products between psi j and its lagged coefficient
psi j+ h. That is, is the sum of the products of all combinations of the coefficients that are h
lagged away from each other.
However, in order to get autocovariance function using this formula, we'll need to first derive the
coefficients of the linear process, the psi coefficients.
92
How to derive those coefficients? Recall that the relationship between the psi polynomial and
the ARMA polynomials with ARMA of the ARMA causal process is as on the slide.
We can start with this relationship, which can be rewritten as the product between psi z and phi
z is equal to theta z. Expanding out of the polynomials on both sides and equating coefficients of
z to the power of m, for m equal to 0, 1, 2, 3, and so on. We get the system of linear equations
where the psi's are unknowns.
The system of equations is provided on this slide. Thus deriving the psi coefficients reduces to
simply solving a system of linear equations.
For estimation of the ACF we can use the same approach as for any other stationary process.
Specifically, we can use a sample autocovariance function. We estimate the sample
autocorrelation by dividing the sample autocovariance by the estimated variance, or the
estimated covariance at lag zero.
93
Correlation between two observations in a time series at two time points can result from a
mutual linear dependence on observations at other time points called confounding.
Another important measure of the dependence for a time series that accounts for this
confounding is the partial autocorrelation function, which gives the partial correlation of a time
series with its own lagged values controlling for, or conditioning on, the value of the time series
at all shorter lags. In contrast, the autocorrelation function does not control for other lags.
94
Here is the formal definition of the partial autocorrelation function or abbreviated PACF.
It is a function of the lag h. It is 1 at lag 0. For lags larger than zero, it is the last element of the
vector alpha h obtained from multiplying the inverse of the matrix gamma h with the vector
gamma h. Where the matrix gamma h is a matrix of the autocovariance function at i- j, for all
combinations of i and j. And the gamma h vector is a vector of the autocovariances at the lags
1, 2, up to h.
You probably recall to have seen the matrix gamma h and the vector gamma h before. They
came up when I introduced the prediction of linear processes.
In fact, there is a direct relationship between PACF and prediction of a stationary time series.
Assuming that the dependence reduces as we increase the lag, then the 1 lag linear prediction
is a linear combination of the past observations of the time series. If we select the coefficients in
this linear prediction by minimizing the mean squared error, then we obtain the so-called the
best linear unbiased predictor or BLUP. We defined this prediction for linear processes in a
previous lecture. Given this definition of BLUP, the partial autocorrelation is the coefficient ah in
the best linear predictor, as defined here on the slide.
95
In order to estimate PACF, we can first estimate the autocovariance using the sample
autocovariance function. And plugging the estimated values in gamma h matrix and gamma h
vector to obtain their estimates. Then we can use the formula for the PCF, but this time, using
the estimate as provided on the slide.
The sample partial autocorrelation function and the sample autocorrelation function are
important in identifying a good model for a given realization of a time series, as I will illustrate
next.
We consider now Xt to be an MA model of order q. That is, Xt is equal to theta b times the white
noise Zt. An MA process is a causal process with a finite number of non-zero coefficients.
Recall also the formula of the autocovariance function for a causal process provided on the
slide.
96
Now replacing the coefficients of the causal process with the coefficients of the MA process, it
follows that the autocovariance function is 0 for all lags larger than q. Thus, we can identify the
lag of the MA process used in the ACF plot.
For an ARMA process, it can be shown that the PACF has the property, that it has 0 for lags
larger than the order p of the AR process. Thus, we can identify the lag of the AR process using
the PACF plot.
97
In summary, in this lesson, I introduced a derivation and estimation of two dependence
measures that are important in ARMA modeling. Particularly the autocovariance and partial
autocorrelation functions.
98
2.1.5 ACF and PACF: AR & MA Simulation
In this lesson, I will continue the introduction of the concepts of the ACF and PACF, the two
correlation autocorrelation functions, with a simulation study.
This is a simulation from a previous lesson, where I simulated MA processes as shown in the
box in the right. The first process has coefficients 0.5 and 0.2, where the second process, I've
changed the coefficient from 0.5 to -0.5.
The ACF plots are here, the upper plots are for the normal distribution, and the bottom plots are
for the exponential distribution.
99
Across all ACF plots, we see one important common characteristic. All of them have the first two
values of the ACF plot to be large outside of the confidence bands, that are shown in blue.
Whereas for other lags, the sample autocorrelation is small, close to 0. Thus, these plots say
that the sample autocorrelation is large for lag equal to 1 [and 2], but approximately 0 for lags
larger than 1 [correction: 2].
Note that we do not have the exact autocorrelation function, but an estimate of this function.
Moreover, its estimate is not exact, but it's a random variable with a distribution as I will explain
in more detail in a next lesson. Because of this, we do not expect the sample autocorrelation to
be exactly 0 for lags larger than the order of the MA process, but to be approximately 0. Thus,
the ACF plot indicates that the order of the MA process is 1 [correction: 2], which is the order of
the process we actually simulated.
This is another example I used for illustration of ARMA processes. For this example, while I
simulated an MA process, the noise used to simulate the process is non stationary. The process
simulated in this example is in the box in the right.
This is a time series plot and the ACF plot for this time series. The order of the simulated MA
process is 3.
100
From the ACF plot, we see that the sample autocorrelation is large outside of the confidence
band for the lags 1, 2 and 3, but close to 0 for lags larger than than 3. Indicating once more that
the order of the MA process is equal to 3.
Note that the property of the autocorrelation function is only for stationary processes. While in
this case, the simulated process is not stationary. However, even in this case, we find that the
property of the autocorrelation function is maintained.
101
This is an example of AR processes. For this example, I simulated AR processes with order two.
The coefficients of the first process are 0.8 and 0.2, whereas for the second process, the
coefficients are 1.8 and -0.9. The simulated error processes are in the boxes in the right.
The AR polynomials of the simulated AR(2) processes are quadratic polynomials. In order to
evaluate whether the processes are stationary, we need to solve the equation of the AR
polynomial equated to 0. For the first process, one solution to this equation is equal to 1, thus,
this process is not stationary. For the second AR(2) process, none of the zeros of the AR
polynomial are on the unit row circle, thus, it is stationary.
In this example, we are not going to look at the ACF plot, but the PACF plot, because of the
property of the PACF for an AR process.
The upper plots are the time series plots. From the time series plots, the first simulated AR(2)
process is clearly non-stationary, as we note an increasing trend over time. For the second
simulated AR(2) process, there is not a clear trend, however, we do distinguish some sort of
pattern.
102
The ACF plots are on the bottom. The PACF plots are similar to the ACF plots in that they are
bar plots with each bar corresponding to a lag. One difference is that the PACF plot begins with
lag 1, not 0. The PACF plot of the non-stationary AR(2) process shows a large value only for lag
1, where all others are small, close to 0.
If we were to use the PACF plot to identify the lag of the AR process, we identity an AR process
with order one. However, the property of the PACF for an AR process does not hold for
non-stationary processes. And thus, we should not use the PACF plot to identify the order of this
process.
On the other hand, for the second AR(2) process, the PACF plot clearly indicates an order 2 AR
process since the values of PACF are large for lags one and two.
In order to simulate an ARMA process, I used a different R command, arima.sim, which allows
simulation of ARMA processes. To use this command, if the white noise is Gaussian, then you
do not need to specify. Otherwise, you need to specify through an option of the function called
rand.gen. The important input of this command is the length of the process (here it is 500) and
the AR and MA coefficients.
103
The ARMA process generated here is in the box below. For this process, the orders are p = 2
and q = 2, thus, an ARMA(2,2) process. The AR coefficients are 0.88 and -0.49, and the ARMA
coefficients are -0.23 and 0.25. I also specified a standard error for the white noise.
104
The time series plot is here, along with the ACF and PACF plots on the bottom.
The simulated ARMA process shows large volatility around the zero line with some pattern but
quite choppy.
The ACF plot is on the left and the PACF plot is on the right in the bottom. The sample ACF is
large for lags up to four, and the sample PACF is large for lags up to three.
If we were to use this plot to identify the lags for the AR and MA, in the simulated ARMA
process, we would identify an ARMA process of orders p = 3 and q = 4. However, the simulated
ARMA has orders p = 2 and q = 2, thus not correct. As I pointed out in the previous lesson, we
cannot use the ACF and PACF plots to identify p and q in an ARMA process.
In summary, in this lesson, I illustrated the estimation of the autocovariance function and partial
autocorrelation function, using a simulation study of the AR and MA processes.
105
2.2: Model Estimation
Commonly, we estimate the mean parameter mu first, and subtract it from the process,
assuming [leaving] a zero mean process. Thus, we only estimate the AR and MA coefficients,
and the variance parameter. Importantly we estimate the parameters assuming fixed orders p
and q. But when we fit an ARMA process, it's uncommon to know exactly the orders and thus
we need to employ a rigorous approach to identify the order p and q also. We'll discuss this
aspect in the next lesson.
106
Let's begin with the simplest approach to estimate an AR process. The AR process describes
the relationship between Xt and the lag observations of the time series Xt-1 to Xt -p in a linear
fashion.
One important limitation of this approach is that statistical inference can be performed only by
assuming normality of the white noise.
107
Alternatively, we can use the so called Yule-Walker equations to estimate the AR coefficients,
which is a method of moment estimation approach.
To derive the Yule-Walker equations, we multiply both sides of the AR model equation by xt -j,
for j taking values from 0 to p. For each j, we obtain one equation as provided on the slide. We
thus have p +1 equations to be used to estimate the AR coefficients and the variance of the
white noise, a total of p + 1 parameters. The system of equations is linear and thus
straightforward to solve.
We can alternatively write the system of equations in a matrix format as provided on this slide.
Capital gamma p is a matrix where the ij element is the auto-covariance at (i- j). The gamma p
of 1 is a vector of the auto-covariance at lags 1 up to p. We estimate the vector of coefficients
phi from this equation. Furthermore, once we estimate the AR coefficients we can use the first
equation from the previous slide to estimate the variance.
108
The Yule-Walker equations are expressed as a function of the autocovariance function which is
unknown and thus it needs to be estimated. Thus, in the equations provided in the previous
slides, we replaced the true unknown autocovariance at lag 0 to p, with a sample
autocovariance. Thus, we can now estimate the AR coefficients from the system of linear
equations using the sample autocovariance function.
When p is large, instead of inverting the gamma p matrix to solve for phi's we can instead use a
so-called the Durbin-Levinson Recursive Algorithm, which requires little computational effort.
With the estimated AR coefficients, the fitted model is the AR model, replacing the true unknown
coefficients with the estimated ones.
109
It's also important to establish the statistical properties of the estimators for statistical inference
and for better understanding when the approach will provide good estimators based on the bias
and variance of the estimators.
It has been shown that the estimators for the AR coefficients using the Yule-Walker equations
have an asymptotic normal distribution. Where asymptotic here refers to large sample size or
large length of the time series. Importantly, the estimators are also asymptotically unbiased,
meaning that the expectation of the estimator is equal to the true parameters, that the true
parameter estimated under large sample size.
Moreover, because the variance of the estimator depends on 1 over n, then it can be shown that
the estimators for the AR coefficients are also consistent. Meaning that as we increase the
sample size, the length of the time series, the estimators reach the true parameter in a
probabilistic standpoint. The estimators for the variance parameters, sigma squared is also
consistent.
Last, we can also show that the partial autocorrelation has an asymptomatic asymptotic normal
distribution with mean 0 and variance 1/n for all lags larger than p, the AR were order.
We can use these asymptotic properties, particularly the properties of the sample PACF to test
for the AR order, as illustrated in the previous lesson. And we can calculate approximate
confidence intervals for the parameters to make inference on the AR coefficient for example
whether they are statistically significant.
110
But Yule-Walker approach does not work for an MA or, more generally, for an ARMA process.
It can lead to non-unique solutions and/or can provide high variance estimators.
111
The properties of the estimators for the innovation algorithm are more complicated to derive. But
loosely speaking, for fixed order q, it can be shown that the estimated MA coefficients are
asymptotically getting closer to the coefficients of the linear process, from which the MA process
is derived. Thus, some sort of consistency of the estimated coefficients.
We can expand the idea of innovation algorithm to estimate the coefficients of the more general
causal ARMA process. This is because the causal ARMA process can be represented as a
linear process. And thus, the innovation algorithm which applies to linear processes also,
applies to causal ARMA processes. Note that the innovations algorithm does not apply to ARMA
if not causal, hence the relevance of the property for an ARMA model to be causal.
112
2.2.2(7) Parameter Estimation: Simulation Example
In this lesson, I will illustrate the estimation approaches provided in the previous lesson, with a
simulation study.
We'll first simulate an AR process of order two. The simulation in R is implemented similarly as
in previous lessons using the filter R command. The AR process has coefficients 1.2 and -0.5,
and it's stationary. Further, we can define the predictive variables and the responses as in data2
matrix, where x1 corresponds to xt-2, and x2 corresponds to xt-1, and y is the response.
We apply the lm R command to fit a linear regression model, with the input of the response on
the left of tilde, and of the predictive variables on the right of tilde. In order to obtain a summary
of the estimated coefficients along with statistical inference and their statistical significance, we
can use the summary command with the input of the fitted model.
113
A portion of the output of the model is here.
In this output, we have information about the estimated coefficients for x1 and x2 on the second
and third row. The estimated coefficients for x1, corresponding to xt-2, and for x2, corresponding
to xt-1, are similar to those from the simulating model, with slight variations. -0.48 instead of
-0.5, and 1.17 instead of 1.2.
Note that the estimates are not expected to be exactly the same as the true parameters, since
there is randomness in the data. But if the model appropriately captures the variations and
patterns in the data, they will be close to the true parameters. We'll look at this aspect at the end
of this lesson.
From this output, we also find the p values for x1 and x2 provided in the last column, and those
p values are very small, indicating that the regression coefficients corresponding to those
predictors are statistically significant.
We can further use the thetas T-test to evaluate whether the coefficient are plausibly equal to
the true parameters. That is, for example, if the regression coefficient for x1 is plausibly equal to
-0.5. For this, we recompute the t value of the t test as the difference between the estimated
coefficient minus the null value, -0.5, divided by the standard error of the estimated coefficient
available in the R output. Then we can compute the p value of the test as 2 times 1 minus the
probability of the normal evaluated t value.
The p value from this test is large, indicating that it's plausible for the coefficient to be equal to
the true value -0.5. Such statistical inference can be learned from the regression analysis
course, also available online.
114
The upper plots are the time series plots of the simulated AR from the previous slide, along with
the PACF plot. The time series plot shows some non-random pattern, and the PACF plot shows
the PACF is large for the first two lags, indicating an AR(2) model. The lower plots are the time
series plot for the residuals from the linear regression model from the previous slide, along with
the PACF plot of the residuals.
The time series plot of the residuals looks patternless. And the PACF plot has all of the bars
within the confidence band. Indicating that the PACF is plausibly close to 0 for all lags, an
indication of an AR(0) or a white noise process. Thus, a linear regression provides a good fit for
the AR model.
In this slide, I simulated an AR(1) model, but fitted a linear regression model with the predicting
variables corresponding to xt-1 and xt-2, thus fitting an AR(2) model.
115
The output of this R fit for the linear regression model using the two predicting variables is here.
We contrast the estimated coefficients with a simulation model, which is an AR(1) model with
AR coefficient equal to 0.5. The estimated coefficient corresponding to x2 is 0.504, which is
close to the value of the true parameter.
On the other hand, the estimate coefficient corresponding to xt-2 is -0.02. And the p value for
the statistical significance for this coefficient is large, indicating that this coefficient is plausibly 0
given that xt-1 is in the model. Thus, the model fit suggests an AR(1) model.
Let's also explore how we can fit an AR model using the Yule-Walker equations approach. And
the AR2 model has been simulated in a previous slide in this lesson, and it has coefficients 1.2
and -0.5. It is the same AR model used to illustrate estimation using the linear regression
approach.
116
Here, I'll fit this AR(2) model by solving the Yule-Walker equations and assuming actually an
AR(3) instead of an AR(2) model. To do so, I first need the sample outer auto-covariance
function obtained using the acf R command and specifying that the type is covariance, since
otherwise, it outputs the outer autocorrelation.
Next, given the vector of the sample autocovariance functions, l am providing the Gamma matrix
for an order three, since I'm fitting an AR(3) model. The R commands in the for loop will fill in the
Gamma3 matrix [correction: Gammamatrix] from the vector of sample autocovariance function.
Next, I'm providing the command line for Gamma1, which is the vector of the values of the
sample autocovariance for age h equal to 1, 2, and 3. Then I obtain the estimated AR
coefficients by inverting the Gamma3 matrix [correction: Gammamatrix], and multiplying it by
Gamma1, as I provided in a lesson on the Yule-Walker estimation approach.
117
The Gamma3 matrix [correction: Gammamatrix] is here. It's a 3 by 3 matrix since I'm feeding
fitting AR model of order 3. The Gamma1 vector is below, along with the estimated AR
coefficients. The estimated coefficients are not very different from the true ones of the simulated
AR(2) model.
The estimated phi1 is 1.205, versus the true 1.2, and the estimated phi 2 is -048 versus the true
-0.5. Do note also that the estimated coefficient for xt- 3, which is not present in the model, is
-0.02. That's very small.
In this slide, we'll evaluate how good the estimates are as obtained using the linear regression
and Yule-Walker estimation approaches.
For this, I simulated the AR(2) process provided in the box 500 times. And I estimated the AR
coefficients using the two approaches for each AR(2) times series for the simulated 500 time
series. Here I'm showing the code for this procedure only for the simulation using the
Yule-Walker estimation approach. A similar approach is implemented for the linear regression
estimation in the R code available with this lecture.
In the R code provided on this slide, I iterate over the index i from 1 to 500, I simulate the white
noise. Then I apply the AR(2) filter, I create the Gamma2 matrix [correction: Gammamatrix] and
the vector Gamma1, and estimate the five phi coefficients, which are then placed in a matrix, the
output of this simulation. The matrix consists of 500 rows and 2 columns, each row
corresponding to one simulation, and each column corresponds to one phi.
118
Here are the side by side box plots comparing the distributions of the estimates from the two
methods for the first coefficient equal to 1.2 in the upper plot, and for the second coefficient
equal to 0.5 on the bottom plot.
The two methods perform similarly, with no clear winner in terms of bias and variability in the
estimates.
119
In summary, in this lesson, I illustrated the estimation of approaches that are based on method
of moments for the ARMA model with assimilation a simulation study.
120
2.2.3(8) Parameter Estimation: Maximum Likelihood Estimation
This lecture is on ARMA modeling, and in this lesson, I will introduce another estimation
approach which is based on maximum likelihood estimation.
Let's review once again the parameters in the ARMA model. The AR coefficients are denoted
with phi 1 to phi p. And the MA coefficients denoted with theta 1 to theta q, are unknown
parameters. But if the process has non-zero mean, then mu, the mean of the process, is a
parameter along with the variance of the white noise, Zt.
Commonly, we estimate the mean parameter mu first. Then subtract it from the process,
assuming [yielding] a zero mean process. Thus, we only estimate the AR and MA coefficients in
and the variance parameter. Importantly, we estimate the parameters assuming fixed orders p
and q. But when we fit an ARMA process, it is uncommon to know exactly the orders. And thus
we need to employ an approach to identify the orders p and q also. We'll discuss this aspect in
the next lesson.
2.2.3 continued
121
An important aspect in maximum likelihood estimation, abbreviated MLE, is that we need to
assume a distribution for the time series process. Most often in time series, we'll assume that Xt
has a Gaussian distribution, a normal distribution.
More specifically, if the joint distribution of the stochastic process generating the time series is a
normal process, it's a Gaussian process. Meaning that for any combination of time points i1, i2,
to in, the variables in a time series process corresponding to these time points have a
multivariate normal distribution.
122
The assumption of Gaussian process is useful in the derivariation derivation of the MLEs, but
also because of the following property: if Zts are independent and identically distributive
distributed with mean 0 and variance sigma squared, then any causal invariable invertible
ARMA process with Zt as the white noise is a Gaussian time series.
123
This slide provides the likelihood function of the Gaussian time series. We first denote the vector
of n random variables in the time series x1 to xn. The x hat vector consists of the one lag
predictions for x1 to xn.
The likelihood function of the vector of time series variables is as on the slide. And it depends
on the gamma matrix of the autocovariances at i- j for all i and j from 1 to n. This likelihood
function can be directly computed using the innovation algorithm without computing the
determinant or the inverse of the gamma matrix, which could be computationally expensive.
2.2.3 continued
124
This becomes apparent in the equivalent formula for the likelihood function on this slide, which
depends on the x's on the x hats or the one-lag predictions which can be derived using the
innovation algorithm. We have seen a similar formula when I described the approach for
obtaining best linear unbiased predictions using the innovation algorithm.
To estimate the AR and MA coefficients, we minimize this likelihood function, more precisely, the
log of the likelihood function.
2.2.3 continued
One of the most straightforward implementations is for an AR model, the linear regression
approach described in one of the previous lessons in this lecture, provides MLEs, maximum
likelihood estimators, as soon as we assume that the error term Zt are IID with constant
variance from a normal distribution. For ARMA models, the estimation is more complicated.
125
So far, we have learned about several estimation approaches. You may wonder which one to
use. Here's a summary of the advantages and disadvantages of the MLE approach.
Generally, the MLE provides more efficient estimates or lower variance estimates than method
of moments, which is the Yule-Walker estimation approach for AR, or the innovation algorithm
for the more general ARMA. Often, for many time series, the Gaussian assumption is
reasonable. But even if we were to assume a different distribution, the asymptotic properties of
the MLE hold regardless, whether we assume a Gaussian or other distribution.
Some of the important disadvantages is that the estimation of MLE for quality correlated data,
such as time-series data, involves solving a complex optimization problem that is challenging to
optimize. The estimation is done only numerically for ARMA models. Another limitation is that
because MLE is done using numerical algorithms, we need to choose a good starting point.
Often we can use other estimators for this step.
2.2.3 continued
126
But we can choose preliminary parameter estimates using the method of Moments approaches.
For AR models, we can use a Yule-Walker estimation approach. For an MA process, we can
use an innovation algorithm. Or for the more general ARMA process, we can use the innovation
algorithm, or other algorithms such as the Hannan-Rissanen algorithm, which begins with
estimating an AR model of higher order, which is used to estimate or input the unobserved noise
Zt, called here Zt hat.
2.2.3 continued
Last, we can use the property of the MLEs, the estimated maximum likelihood estimators,
particular the property of asymptotically normally distributed, regardless of the assumption on
the noise Zt. The MLEs are asymptotically unbiased, that is, for large sample size, the
expectation of the MLEs is close to the true parameter values. The covariance matrix of the
estimator is directly dependent on the covariance of the time series. This asymptotic distribution
can be used for statistical inference on the AR and MA coefficients, as well as for the asymptotic
distribution of the ACF and PACF.
127
128
2.2.4(9) Order Selection & Residual Analysis
This lecture is on ARMA modeling and in this lesson, I'll continue the introduction of the
estimation of ARMA model with order selection and residual analysis.
The first approach uses a modified version of the Akaike Information Criterion, AIC. And in
regression analysis online course, you will find an entire lecture on model selection.
The AIC approach is one of the most common criteria for selecting multiple models. It applies in
particular to order selection for orders in ARMA modeling, since AIC is a likelihood based
criterion, because it's equal to the log likelihood function plus a complexity penalty.
For ARMA models, the complexity penalty is different from the penalty commonly used for the
classic AIC approach, and it's provided here. This modification of AIC has been shown to
improve on the order selection over the classic penalty.
The approach to order selection is as follows. Fit ARMA models with varying AR errors orders 0,
1, up to p, and varying MA orders 0, 1, up to q, jointly. And compute AICC for each combination
of orders, and select the orders with the smallest AICC value. If the maximum orders p and q
are large, this can be quite computationally expensive.
129
A much simpler and less computationally expensive approach is the so-called Extended
AutoCorrelation Function, or EACF, which was introduced by Tsay and Tiao in 1984.
The EACF derivation is a two step approach. In the first step, we first fit an ARMA with order k,
where k is the maximum AR order. Based on this fit, obtain the model residual Rt, then regress
Rt on the lag Xt variables, Xt-1, Xt -2, up to Xt- k, and the lag variables of the residuals Rt -1, Rt-
2, up to Rt- j, for j from 1 up to a large j value. For this regression, we obtain initial estimate for
coefficients for the coefficients of the AR, of order k, and MA, of order j.
In the second step of the EACF, we improve output on the estimation of the coefficients by
obtaining the residuals, by extracting from Xt the AR part, but with the estimated coefficients
from the previous step.
130
The sample ACF of these residuals is referred to as the extended sample ACF. Tsay and Tiao
summarize the information in the EACF table. The element in the kth row and jth column is a “x”
if the lag j+1 sample autocorrelation of these residues residuals is significantly different from 0,
where statistical significance is established using the confidence interval of the ACF.
This is the EACF for an ARMA(1,1) model. Again, “x” corresponds to statistical significance. 0
corresponds to lack of statistical significance, or that the ACF is plausibly 0. In this example, we
see the first 0 at AR order 1 and MA order 1, as expected for an ARMA(1,1) model. Then we
see zeroes all through the MA orders corresponding to AR order 1, indicating no improvement in
MA order from order 1. The same, we see zeroes for all the AR orders when MA is order equal
to 1. The sample EACF will never be this clear-cut in real practice.
131
In order to assess the goodness of fit, we can assess the model's assumptions using the
residuals in a similar manner as for visual analysis for linear regression models.
Because the variance of the residuals is not constant over time, it is important to standardize the
residuals. Moreover, it may be useful to rescale the residuals to identify outliers.
For example, how can we use the residuals to assess goodness of fit for an ARMA model?
The properties of the residuals should reflect the properties of the underlying process Zt.
In particular, the residuals should be approximately uncorrelated if Zt is white noise,
independent if Zt is independent, or normal if Zt is normal.
132
If we want to establish whether the residuals are uncorrelated, then we can use the sample ACF
and PACF plots along with hypothesis testing procedures. One hypothesis testing procedure is
the portmanteau test based on the sample autocorrelation of the residuals as on the slide.
Also, the Ljung-Box test is a slight variation of the Portmanteau test, while the McLeod-Li test is
based on the sample autocorrelation of the squared residuals.
133
In order to evaluate the goodness of fit with respect to a distribution to an example normal, we
can use the quantile-quantile plot, which contrasts the quantiles of the distribution, for example,
normal, versus the quantile of the process of interest, in this case, of the residuals.
I usually recommend to accompany the Q-Q plot with a histogram of the residuals, for more
detailed assessment of the distribution of the residuals.
In summary, in this lesson, I introduced methods for order selection and for performing residual
analysis in ARMA modeling.
134
2.2.5(10) ARMA Modeling: Data Example
This lecture is on ARMA modeling. And in this lesson, I will illustrate ARMA estimation and order
selection with a data example.
In this lesson I will return to the emergency care data example used to illustrate basic time
series modeling. In this example time series models can be used to forecast future ED patient
volume, based on the estimated effect of predicting variables. And search forecasts can be used
for proactive bed and staff management, and for facilitating patient flow.
A number of factors can influence daily ED visits, and the patient’s visits forecasting model
should include those factors. Previous studies have shown that ED visits present cyclical
variations according to calendar variables such as day of the week, time of the year, and the
occurrence of public holidays. The objective of the synapse this analysis is to develop models
for identifying temporary temporal patterns, and for predicting the volume of patients in an
emergency department.
The data consist of daily number of patients seeking ED care in a hospital in Georgia, in the
United States. The ED volume was observed over a period of more than five years from 2010
until about mid 2015. In this study we'll consider temporal factors, although other factors can be
used, for example external factors such as temperature, rainfall, major holidays, school season
among others. Along with hospital data for example, the percentage of patients seeking care in
ED who have public insurance.
135
In the previous lecture I illustrated how to estimate the trend and seasonality for this particular
data example. Note that we applied them at the modeling of the time series on the transformed
time series since the observations in the time series accounts are counts, and thus with poisson
distribution potentially indicating non constant variance of return over time.
The model I selected here to use for removing the trend and seasonality is the model using the
non-permitting non-parametric trend estimation using splice smoothing, along with modelling
monthly, and weekly seasonality. Specifically, I'm using the gam R command available in the
mgcv library to see fit the trend and seasonality. For this, I defined the vector of equally spaced
time points along with qualitative variables specifying the month, and day of the week for each
observation to be all integrated into the estimation of the trend and seasonality, simultaneously
using the gam function.
Last we take the difference between the transformed ed volume counts, and the estimated trend
and seasonality to obtain the residual process to be further analyzed using the ARMA modeling.
Note that here, we estimate the trend and seasonality in order to reduce the time series to a
stationary process. We want to subtract/eliminate the trend and seasonality, and we'll analyze
the residual process using the ARMA modeling.
136
Let's take a closer look at the ACF and PACF plots to assess stationary theme stationarity.
The ACF plot [is] on the upper plot, and PACF is on the lower plot. For both, we see that a
relatively small number of lags have a large sample ACF and PACF values, indicating that the
trend has been removed. For validating evaluating the performance of the model for removing
the seasonality we need to plot the sample ACF for a larger number of lags to capture two or
more seasonality periods.
Next we'll fit an ARMA [correction: AR] model only. The R function is AR with an input consistent
consisting of the time series, the residual proxies process in this case, and the maximum order
for which to fit the AR model. Here the maximum order is set to 20. In order to view the content
of the output for this model fit, you can use the summary command to get the order selected for
the best fit according to AIC, the value is mod$order, and it is equal to six for this model.
137
I also plotted the AIC values, but on the lock log scale, to better see where the minimum of AIC
is reached. The AIC plot is here. Again, this elected the selected order is six, as also shown by
this graph. This AIC is minimum for this order.
Let's consider the property of stationarity, and whether the fitted model is a causal process. For
that, we need to extract the roots of the AR polynomial, and check whether they are not on the
unit circle, indicating stationarity, and whether the roots are all outside of the unit circle indicating
a causal process.
NOTE: The following section differs from the current text and code of the video lecture.
So, the following is subject to change, when we figure out what the latest level is. See
Piazza @73.
138
From this plot all the root of the AR polynomial are not on the unit circle, thus the fitted AR
process is stationary. However, all the roots are within the unit circle, thus the process is not
causal.
Further we can evaluate the residuals for stationarity, and other properties. For this, we extract
the residuals from the fitted model, and then plot the residuals, the acf, and the pacf of the
model residuals, and the quantile quantity normality plot.
139
[Section End - 6:02 in video lecture]
Here are the residual plots. The residual plot does not show to have a pattern. The variance of
residuals is also constant. The ACF plot has only the first lag equal to one, where all other
values of the lags are small within the confidence band of the sample ACF. The same for the
sample PACF. The values are all within the confidence band. The Q-Q Norm Plot shows a tail
on the left.
Finally, we’ll fit an ARMA model, not only an AR model. The R command to fit an ARMA model
is arima, where we specify the time series, in this case the residual done time series, after
removing the trend and seasonality. We also need to specify the orders for the AR, and MA
components of the model. For this fate fit, I'm using order 6 for AR, as specified by the best AR
model fitted previously, and in order 1 for the MA part.
140
Last, we can specify the estimation method if we specify ML. This means that we fit using
maximum likelihood assuming normality.
Last, let's look at the residual plots of this fitted model. Here are the plots. The residual plot does
not show to have a pattern. The variance of the residuals is also constant. The ACF plot has
only the first lag equal to one where all all others are small within the confidence bands of the
sample ACF. The same for the sample PACF, the values are within the confidence band. The
Q-Q norm plot shows a tail on the left. These are similar observations from fitting the AR model
alone. And it's not surprising since we only added, an MA(1) part, to the AR(6) of the model.
141
Note that the orders used in feeding fitting the model in the previous slide are not necessarily
better in terms of model fit than other orders. In fact, the question we'd like to address is, what
are the orders that provide the best fit? We can apply the two methods discussed in the
previous lesson.
The first method is the simple EACF approach, which can be implemented in R, using the EACF
function available in the TSA library. The input of this function consists of the time series along
with the maximum AR and MA lags from which to obtain the EACF.
The EACF for divisidual the residual process is here. The EACF is provided for up to six lags for
both AR and MA components, as specified in the implementation of EACF. The EACF does not
provide a clear cut order selection. The first zero is for AR order 1 and MA order 1. But there are
other MA orders for AR order 1 for which they are non-zero EACF values. And there are other
AR orders for MA order one, for which they are non-zero EACF values. After AR order 4 and MA
order 1 [correction: AR order 1 and MA order 4], we see that all other EACF values are 0,
except for two cases. There is not a conclusive order selection using EACF.
We can also set the orders using the minimum AIC approach. For this, we need to fit the ARMA
model for all combinations of AR and MA orders, up to a maximum order for both. Here, I set the
maximum order to be 5, and consider orders from 0 to 5, for both the AR and MA components.
Note that I specified here, norder is equal to 6. But that means I consider the order from 0 to 5
as specified in the definition of P and Q. I looked looped over all combinations of the orders, and
fit the ARMA Model, then save the AIC values in a matrix.
142
Next, we obtain the p and q orders for which we have the minimum AIC value. The air R code
provided here first labels each AIC value with its corresponding p and q orders. Then the which
command finds the index in the vector of AIC values where the minimum is then extracted for
the orders for this minimum and from the p order and q order vectors. The selected orders are p
equal to five, and q equal to five, the largest orders allowed. Last we fit the model for the
selected orders called here the final model.
These are the residuals plots for the ARMA(5,5) fitted. The residuals do not improve or change
significantly from the smaller models AR(6) or ARMA(6,1), suggesting that a smaller model than
the one selected using the AIC approach may perform singularly similarly in terms of goodness
of fit, and thus [be] preferred over the selected ARMA(5,5). Less complex models, in other
words, those with smaller AR and MA orders are prepared preferred to avoid overfeeding
over-fitting, or to obtain better predictions.
I applied this test for both the ARMA(5,5) final model, as well for the smaller ARMA(6,1) model.
This is the output for this implementation with a summary of each of the tests. The P values are
all large, indicating that a null hypothesis of uncorrelated residuals is plausible. Thus according
to this test, both models perform well in modeling the temporal correlation in the time series
process.
144
In summary, in this lesson, I illustrated ARMA modeling estimation and order selection with the
example where we're interested in modeling the time series of the emergency department
patient volume.
145
2.3: Other Models
To review, the ARMA model consists of two components, joined AR and MA models. We'll
expand this model further as I will provide in a next slide.
A generalization of the ARMA model is the auto regressive integrated moving average or
abbreviated ARIMA. The ARIMA model, can be used to model non-stationary processes.
The non-stationarity is eliminated by a differencing step corresponding to the integrated part of
the model. The differencing can be applied more than once until the resulting time series after
differencing is stationary.
146
How can we express this formally? Given a time series xt we apply the differencing polynomial
1- b to the power of d. This additional polynomial applies 1 lag differencing d times. The
resulting process after differencing Xt is further modeled using an ARMA process. If the process
Yt after differencing is a causal process then Xt is said to be an ARIMA process of orders p, d,
q.
If we multiply the AR polynomial, phi(B) with the difference in polynomial 1-b to the power of d
we obtain the polynomial phi*(B). Thus, the ARIMA model can be written using a similar
structure as the ARMA process, except that the phi*(B) polynomial of ARIMA has d zeroes on
the unit circle.
One way to identify whether process is non-stationary or ARIMA is that ARIMA model usually
exhibits slow decay in the autocorrelation function.
147
In order to obtain best unbiased linear prediction for an ARMA process, we assume that the
ARMA process is a causal process. For an ARIMA process, this assumption does not hold, and
thus we can not apply the prediction methodology for ARMA. Importantly, the expectation and
the covariance of the ARIMA process are not uniquely determined.
In order to obtain the best linear predictor for Xn+h, given the past observations of the time
series, we need to first obtain the best linear predictor for Yn+h of the Yt process, obtained after
differencing. The prediction formula for Xn+h is provided here. It is the difference between the
base linear predictor of the Yn+h and the weighted sum of the predictions of the lower lag
predictions, Xn+h-j. Thus, to obtain the prediction in lag h, we need to solve this recursively.
148
For this prediction of the ARIMA model, the variance can be estimated approximately as the
sum of the squared coefficients psi times the variance of the white noise process zt, where the
coefficient psi are derived similarly to the coefficients of a causal process.
The ARIMA model described so far can be used when there's trend in the time series. The trend
can be eliminated through differencing.
149
Another extension to the ARMA model is the seasonal ARMA, which can be used when there is
seasonality. In order to model seasonal time series, we use different lags in the AR and MA
components. For example, for the seasonal MA process of order capital Q, we do not use Zt-1,
Zt-2 to Zt-Q lag, but we use Zt-s, Zt- 2s up to Zt- Qs, where s is the period and capital Qs is
small q.
Similarly for a seasonal AR process, instead of using the lags of Xt, Xt-1, Xt-p we use the lags
of Xt-s, Xt-2s, Xt-Ps. Combining this decomposition gives us the seasonal ARMA process.
150
In this slide, I'm providing a more formal representation of the seasonal ARMA process with the
specification of the ARMA orders of the stationary process. Small p and small q along with
waters the orders used to model seasonality as the previous slide capital P and capital Q.
To express this in the model equations, we define the four polynomials as provided on this slide
with the first two defining the autoregressive models and the second two defining the moving
average models.
We can now combine the two models, ARIMA with seasonal ARMA to obtain the seasonal
ARIMA. Let's take a closer look with one example.
151
Consider the basic decomposition discussed in a previous lecture, that is decompose a time
series into a trend component Mt, seasonality component, and the residual process, Zt.
Assume further a model for the trend and seasonality where the trend only depends on the
previous lag, and the period of the seasonality is s. If we apply the differencing 1- B for the
trend, and 1- B to the power of s for seasonality, then the residual process is stationary with
autocorrelation at lags 1, t-1, s, and s+1 [correction: lags 0, 1, s, and s+1], which agrees with the
seasonal model ARMA (0,1) for the stationary component and (0,1) for the seasonal component
with seasonal period s.
This is one particular example of a seasonal ARIMA. More generally, we can consider 1-B to the
power d differencing polynomial for the trend and 1-B to the power of s all to the power of capital
D for seasonality. Because there are many orders to determine those of ARIMA P, D, Q, along
with those of a seasonal ARIMA, capital P, capital D, capital Q, such a model is difficult to
implement.
In summary, in this lesson I introduced two extensions of the ARMA model. The ARMA, which is
an extension of the ARMA for a non-stationarity due to the trend. And a seasonal ARMA, which
is an extension for accounting for seasonality
152
153
2.3.2(12) ARIMA Modeling: Data Example
This lecture is on concept of ARMA Models. And in this lesson, I will conclude this lecture with
an example on illustrating ARIMA Modeling. In this lesson, I'll return to the Emergency Care
data example used to illustrate basic time series modeling. In this example, time series models
can be used to forecast future ED patient volume.
So far, we estimated the trend and seasonality of this time series. And then so structured
subtracted those two components from the time series to obtaining a proximal and distal obtain
an approximately stationary process to be modeled with ARMA.
In this lesson, we'll consider the differencing approach. First, let's consider differencing to
account for seasonality. Note that we've identified both monthly and day of the week
seasonality. To difference for monthly seasonality we need to instead difference annually, since
the ED volume data are recorded daily. For the day of the weeks seasonality, we can difference
with frequency equal to seven, which will apply the difference from consecutive Mondays,
Tuesdays, and so on. Note that in order to apply the difference R command, I transformed the
time series using the TS command with frequency equal to 365.25, indicating that this is daily
data.
154
We next compare the time series plots for the regional original? time series versus the
difference time series plot along with their ACF plots. These are the time series plots.
The first one is for the transform ED volume. The next is for day of the week differencing, and
the last is for monthly seasonality, which is in fact a yearly differencing time series. While we see
clear periodic patterns in the regional original? time series, those patterns are not evident in the
time series plot of the differencing processes.
These are the ACF plots for the three time series. The upper plots are the ACF ones, the ACF
plots for the original time series. Whereas, the bottom ones are the ACF plots for the
differencing processes. The ACF plot of the original time series cordially clearly shows a
periodic pattern while the PACF does not, as expected. The ACF plot for the weekly differencing
and for the annual differencing processes show an improvement in terms of removing the
seasonality. Although there is still some potential seasonality or some other residual source of
nonstationarity.
155
Next, I fitted a seasonal ARIMA. Because we have both trend and seasonality, we need to fit an
ARIMA to account for the trend and the seasonal ARMA to account for seasonality. The
ARCAMAT R command is the same as for ARMA fit, as ARIMA, except that now, we are fitting
multiple orders both for the ARMA and for the seasonal ARMA.
The input order for the ARMA are five, one and five, meaning that we have an AR order of five,
a differencing order equal to one, and MA order equal to five. We also use a seasonal order of
(1,0,1) which means that we're fitting a seasonal ARMA with AR order equal to 1, and MA order
equal to 1. Note that one limitation of the seasonal ARMA is that it only can model one type of
seasonality. In this data example, we establish both weekly and monthly seasonality and thus
we'd it would be more appropriate to fit both types of seasonalities.
156
These are the residual plots. The first one is placed displays standardized residuals. From this
plot we do not identify a pattern of [in the] residuals. With [Instead, we see] relatively constant
variance over time.
The upper plot on the right is the ACF plot of the residuals. Note that for this, I specify a large
lag to identify any other potential seasonality. The residuals are all small, most of them within
the confidence band, an indication that the residuals may be white noise.
The bottom plots investigate the assumption of normality, which is assumed in the maximum it
would likelihood fit. The zeros residuals do show skewness although it is not very severe.
157
Last, let's see how we can forecast with ARIMA.
For this, I left out two weeks of the data, the last two weeks in the time series. Then I fitted the
same ARIMA model as provided in the previous slide. But this time, to the time series without
the last two weeks of data. Last, I applied the prediction command in R that allows us to predict
given the fitted model as input and given the number of lags ahead to obtain the prediction, in
this case, two weeks.
In this example, we compare the predictions for the two weeks with the observed time series.
We first form the confidence band of the prediction with the upper band provided by ubound and
the lower band provided by lbound. Then we plot the time series data. We do not plot the entire
history of the time series but only the last four weeks of data. We overlay the predicted values in
red along with the confidence band in blue.
158
The corresponding plot is here. The black line is the observed time series, the red points are the
predicted values, along with the confidence band in blue.
All the observed values are within the confidence band and the predicted values, some are
close to the observed values, except that they do not capture the volatility in the observed time
series.
In summary, in this lesson, I concluded the lecture on ARMA modeling, with an illustrative
example of fitting seasonal ARIMA.
159
2.4: Data Examples
In this lesson, we'll focus on IBM, standing for International Business Machines. The company
was initiated in 1911 as the Computing-Tabulating-Recording Company, which was a merger of
four companies. Later in 1933 it became IBM. Since the company has been around for more
than 100 years, it has contributed to many innovations and has experienced many disruptive
events.
For example, in 1964 IBM announced the first computer system family, a breakthrough at the
time. In 1974, IBM developed the Universal Product Code, which changed the retail industry
and other industries since it allows tracking trade items. In 1981, along with the World Bank, it
developed financial swaps, a widely used financial product. In 1993, it experienced the highest
loss in its history to date, a loss of $8 billion. In 2005, it sold its personal computing business,
saying it was not sustainable. And in 2014 it sold its x86 server division. In 2015 and 2016, it
has started investing heavily in healthcare solutions and healthcare analytics.
160
What is a stock price? In financial terms, it is viewed as the perceived company's worth since
multiplied by number of shares gives the total company's worth. It is generally affected by the
number of things including volatility in the market, current economic conditions, popularity of the
company, and events as the ones I mentioned in the previous slide.
The analysis in this study is to develop a model to predict IBM's stock price given that no major
events are to be released. That is the usual business kind model. The model presented here is
general and applies to the stock price of other companies, although its performance will be
different. The data consists of daily stock price from January 2nd 1960 until April 18th 2017.
The daily stock price is available as open price, close price, and adjusted close high and low
price. We'll consider here adjusted close which is the common price analyzed when daily stock
price predictions are stopped sought.
Let's visualize a time series for drawing insights on the model to be considered. We'll begin with
reading the date of filing file in R. The R command is the Read.table since the data file is an
ASCII file. The input is the name of the file, IBM daily.txt, and header equal to t, since the data
columns have headers.
Next I use the as.date command in R to convert the date, providing the data into a date object in
R. For this command, the input is the vector of dates converted into character with the
specification on how the dates are recorded, for example month, day, and year in this order
separated by slash sign. It is important to get this correct. Otherwise you will read the dates
incorrectly. Then I add this column to the data matrix and use attach command with the data
matrix so that all columns in the data are read in R as individual vectors.
Next, I define one of the events called Truven to be the date of acquisition of the Truven
company in 2016. The other events are defined similarly. Then I use the ggplot command with
input, the date and the closed price to plot the time series along with the vertical lines indicating
the event acquisition of Truven in 2016, and also other events as will be provided in a plot.
The vertical lines for all other events are added similarly as for Truven. I do not provide the
entire R-code here. You can find the complete R code in this lecture.
161
Here is a GG plot of the daily stock price, specifically the adjusted close price along with the
vertical lines indicating when all the events mentioned in the previous slide happened.
There are several characteristics to point out. The trend is overall monotone increasing with
fluctuations in the later decades. There is no clear seasonality that can be evaluated visually.
However, what is clear is the heteroscedasticity of the time series. The variability in the stock
price clearly shows a big change from being significantly smaller in the early years versus much
larger in later years.
We also note that the events did not clearly trigger significant changes at least as seen on this
scale. For earlier events it is hard to distinguish any differences because of the difference in
scale. The price is low with low volatility around 1995 with the more distinguishable changes
thereafter. In fact, in order to better distinguish patterns throughout the entire time period, it is a
good idea to use transformations that stabilize the variance and amplifies local patterns such as
the log transformation.
Here (in the lower plot above) I use this transformation on the adjusted close price, but
everything else is the same as in the previous GG plot command including the indication of the
events with the vertical lines. This is a GG plot for the log transform stock price on the bottom.
On the log scale, the patterns in the earlier years are much easier to distinguish. For example,
we can see the increase in the stock price after the first family computer was launched in 1964.
And we also see a significant drop in the price when the company had a very large loss in 1993.
The margin merger and acquisition in 2015 and 2016 are followed by an increase in the stock
price.
162
Note that these observations is on increases and decreases of the stock price are associated
with specific events and not necessarily caused by the events but also can be triggered by other
macroeconomic events or company performance. In both plots, the increasing trend is the most
prevalent pattern.
One way to remove a trend is to consider the difference process. Here I redefine the stock price
as the time series with a frequency specifying that this is daily data starting with January 2nd,
1960. Then I use a difference argument to take the difference with a default of the first order
difference. I also plotted the ACF and PACF to see whether another difference in the data is
needed and/or whether the resulting process is stationary.
Here are the ACF and the PACF or the partial autocovariance function plots. Interestingly, the
two plots look just like those for the white noise we simulated in a previous lesson. Do we still
then, need to apply in our R an ARMA model to model the difference process, or is it enough to
take the difference since the difference processes seem to look like white noise? We'll address
this question in this lesson and the next lessons.
163
Next, I applied the ARIMA modelling. I used ARIMA to take into account the trend through
differencing. I assume only one difference, one order difference and thus set the difference
order fixed to equal to one. Here I only select the orders of the AR and MA polynomials. We
obtain here the p and q orders for the model with the minimum AIC value. For this we need to fit
the ARMA model for all combinations of AR and MA orders up to the maximum order for both
polynomials.
Here I set the maximum order to be five, and customer consider orders from zero to five for both
the AR and MA components. Note that I specify here an order equal to six, but that means I
consider the orders from zero to five as specified in the definition of P and Q.
I loop over all combinations of the worse orders and feed fit the ARMA model for all those
combinations and save the AIC values in a matrix for all the combinations. The selected orders
are P equal to two and Q equal to zero. Last, we fit the model for the selected orders called
here, the final model.
164
Here are the residual plots for the fitted ARMA Model.The R-code for obtaining this plot is
provided in the complete R-code for this data example available with this lecture.
The residual plot does not show to have a pattern. The variance of the residuals is also
constant.
The ACF plot has only the first lag equal to one, where all other values for the ACF, for
the sample ACF are small within the confidence band of the sample ACF. The same for the
sample PACF, the values are all within the confidence band. The Q-Q normal plot shows both a
left and a right tilt an indication that the residuals may have more of a T distribution than a
normal distribution, although otherwise quite symmetric.
165
Last, we also apply the hypothesis testing procedures for independence or serial correlation
using the Box-Pierce and Ljung-Box test. The R commands are on this slide.
The R output of these implementation is here. The R output provides the test values along the P
value of the test. Note that the null hypothesis in this test is that the time series process (here it
is the residual process) consists of uncorrelated variables. Thus, this is one rare case when we
want large P values so that we do not reject the new null hypothesis. The P values for both tests
are large, indicating that it is plausible for the residuals to be uncorrelated.
In summary, in this lesson I illustrated the model fitting of ARIMA modeling with the data
example where we're interested in the prediction and modeling the IBM stock price.
166
2.4.2 IBM Stock Price: Forecasting
This lecture is on illustration ARMA models with a particular data example. In this lesson, I will
focus on forecasting of the ARMA model for the IBM stock price.
I'll remind you that the analysis in this study is to develop a model to predict IBM's stock price.
We'll only consider here adjusted close price, which is the common price analyzed when daily
stock price prediction are sought.
167
We'll first implement and evaluate a 10 days ahead prediction. For this, I set apart the last 10
days of the time series. Then fit the model based on a time series discarding the last 10 days,
and then predict the last days to compare with the observed values of the time series for these
10 days.
The predict command in R takes as an input the ARMA object along with the number of lags
ahead for the prediction. The prediction provides the predicted time series values, along with
point-wise confidence intervals for the prediction. Then we plot the time series data along with
predicted values and the confidence band. We do not plot the entire history of the time series,
but only the last 50 days of the time series.We overlay the predicted values in red along with the
confidence band in blue.
A few noteworthy points to make are as follows. First, the time series is recorded from the most
recent days to the 1960s. Thus, I use the reverse command to reverse a time series. Second,
for both observed data and the predicted time series values, I transfer them back using the
exponential function since here we predicted the log of the stock price, and not the stock price
itself. Thus, for evaluating the predictions of the stock price we need to transform back with the
inverse of the log function, the exponential. The same exponential transformation is applied to
the lower and the upper bounds.
168
The corresponding plot is here. The black line is the observed time series. The red points are
the predicted values along with the confidence bands in blue. All the transform or all the
observed values are within in the confidence band. The predicted values are somewhat closer
to observed values, except that they do not capture the decreasing trend, which begins during
this 10 days of forecasting. The confidence band is quite wide, indicating there is large volatility
or uncertainty, in the prediction.
But, how good are this predictions? We can compare the predictions with the observed time
series values. Note that in the real world, we do not have observed time series in the future at
the time of prediction. And thus, we cannot evaluate the prediction accuracy of the model. In this
case however, we first pretend we do not have the observed time series for the last 10 days,
and predict the time series for these days to compare here with observed time series.
Generally, the question “How good is a prediction?” is comprised of two separate aspects.
First, measuring the predictive accuracy, per se. Second, comparing various forecasting models.
169
Here we simply evaluate a predictive accuracy. The most common reporting measures of
predictive accuracy are Mean Square Prediction Error, abbreviated MSPE, Mean Absolute
Prediction Errors, abbreviated MAE. Percentage measures, such as the Mean Absolute
Percentage Error, abbreviated MAPE, and Precision Error.
The three measures can be computed, the four measures can be computed using the R as
provided on the slide:
● MSPE is the sum of squared differences between predicted and observed.
● MAE is the sum of absolute values of the differences.
● MAPE is the sum of the absolute values of the differences, scaled by observed
responses.
● The precision error is the ratio between the mean square prediction error and the sum of
the square differences between the responses and the mean of the responses.
● MSPE is appropriate for validation prediction accuracy for models using the best linear
prediction approach. But it depends on scale, and it's sensitive to outliers.
● MAE is not appropriate to evaluate prediction accuracy of the best linear prediction, and
it depends on skills scale, but it is robust to outliers.
● MAPE is not appropriate again, to evaluate prediction accuracy of the best linear
prediction, but it does not depend on the scale, and it is robust to outliers.
● PM or the percentage [correction: precision] measure, it's is a appropriate for evaluating
prediction accuracy of the best linear prediction and it does not depend on the scale.
170
The PM measure is reminiscent of the regression R squared used in the linear regression.
It can be interpreted as the proportion between the variability in the prediction and the variability
in the new data. While MAE and MAPE are commonly used to evaluate prediction accuracy, I
recommend using the precision measure.
Another approach, a fourth [fifth] approach, for evaluating a prediction is by checking whether
the observed values fall within the prediction intervals as provided in the last command.
For this data example, the accuracy measures are provided in this R output.
The precision measure is 5.38, which means that the proportion between the variability in the
prediction and the variability in the new data is 5.38. That is, the variability in the prediction is
significantly larger than the variability in the data. The closer this is to zero, the better the
prediction is. However, the prediction measure is quite large indicating poor performance in
predicting the 10 days.
Last, we also note that all of observed responses fall within the prediction intervals.
However, as I pointed out before, the prediction events bands are quite large, indicating
significant uncertainty in the predictions.
171
Let's also consider prediction of the 10 days, but this time, on a rolling basis. That is for each
day, we fit the model with an entire time series up to that day, and predict only one day ahead.
Apply this for each of the 10 days. To do this I used a loop command for i from 1 to 10 where
each i corresponds here to one day. We then save not only the prediction but also the prediction
intervals. Last, I plug in the predictions based on this approach along with observations of the
time series. The complete R code is not provided here in this slide, but it's available with this
lecture.
The plot here compares the predictions from this approach, where the predictions for each of
the 10 days are provided on a rolling basis, to the predictions obtained from the previous
approach where the predictions for all 10 days are provided all at once. As expected, the
predictions using the one day ahead, shown in green, are much closer to the observed time
series than the predictions from the 10 days ahead approach, shown in red. Moreover, the
confidence events from the rolling basis predictions shown in purple are much tighter than for
the 10 days ahead predictions. This is again expected since there is less uncertainty from one
day to another, as compared from one day and looking 10 days ahead.
172
While the predictions using the one day ahead approach on a rolling basis is preferable, since
the predictions are significantly better, we can not always use such an approach since there are
situations when we want to predict 10 days ahead in the future rather than one day ahead.
Generally, different models may be used for different lags ahead prediction.
Similar to the previous approach, we can also evaluate the accuracy of the prediction using
accuracy measures. The R code is the same as before, except that it is now applied to the
predictions based on the one day ahead rolling basis approach.
173
The R output is here. I'll focus here on the interpretation of the precision of the percentage
measure, the PM measure. Compared to the 10 days ahead prediction, the PM measure is 0.4
as opposed to 5.38, which was the measure for the predictions with 10 days ahead. As
mentioned previously we seek a precision measure that is close to zero as it is for this
prediction.
Last, I also note that the observed responses fall within the prediction intervals. The confidence
intervals show also less uncertainty.
In summary, in this lesson, I illustrated the data example on predicting the stock price for IBM.
174
2.4.3 Alaska Moose Population: ARMA Modeling
This lecture is on ARMA modeling, illustrated with data examples. And in this lesson, I'll
introduce another example which is related to prediction of Alaskan moose population.
And I will apply the ARMA models and forecasting for this particular data example.
The population of wild moose in areas such as Alaska often fluctuates over several years, due
to both human influence and natural factors. While the fluctuations are rarely drastic across a
given year, they do exhibit a general rise-and-fall trend over time. The endpoint of this study is
to predict the Alaska moose population. In this lesson and the one following, we'll ignore other
factors which may impact the rise or fall of the moose population. We'll simply predict using the
total yearly moose population alone. In a different lecture, we'll consider other factors and
compare the predictions from this lecture.
Why is it important to predict the Alaska moose population? For many Alaskans, moose meat is
the largest source of food income for a family, as a typical moose weighs approximately 1,200
to 1,500 pounds and provides a typical family with 500 to 600 pounds of edible food. Thus, it is
of great importance to the Alaskan citizens to manage and control the moose population to
ensure the sustainability of this population. Not only harvested for its food source, the moose is
harvested for clothing purposes as well.
Moose also serve as the important tourist attraction which helps to bring several
thousand people to this great state every summer. The findings established in such a study can
be of particular use to the Alaska Division of Wildlife Conservation, which tracks populations of
species across several years for scientific and conservation purposes. Currently, extensive
surveys of the moose population are required every few years to approximate this population
and set the appropriate hunting quota.
175
The data provides relevant factors related to moose population including population of
Fairbanks, which is close to the location unit 20A, from which we have the date of moose
population, Alaska wolf population, average snowfall, and moose harvested. Again, we will
ignore these factors at this point, but will return to this study by accounting for these factors in a
different lecture.
The years of the data are from 1965 to 2006, and the data are observed yearly.The data comes
from multiple sources, but in particular, the Alaska Department of Fish and Game.
We begin by reading the data file in R and converting the columns into time series. Because the
data are observed yearly, the frequency for each time series is equal to 1.
176
The time series plots of the factors in the dataset are here. We'll focus on the total moose
population in the study.
We can see that many of the factors provided have some overlapping patterns with the time
series of the total moose population overall. The total moose population had a fall around 1975,
but it has increased steadily since then. There is a little volatility also. It seems that a trend
predominates the behavior of this time series.
Since trend is predominant in the variations in a time series across time, we'll explore first the
first order difference time series by plotting the different time series along with ACF and PACF of
the different processes.
177
The plots are here. In the upper left plot, you'll find a regional the original time series to compare
it to the differenced time series in the upper-right plot. While we see a clear trend in the original
time series, one differencing is enough to remove this trend. However, we'll still see some
patterns in the differenced time series, indicating that one order difference may not completely
remove the trend or other patterns that could lead to nonstationarity of the Of the time series.
The ACF and PACF plots do not indicate non-stationarity. They both have only one lag outside
of the confidence band which indicates stationarity.
We'll continue with the first approach by removing the trend and then applying ARMA to the
detrended time series. To remove the trend, we'll use a splines regression approach introduced
in the first lecture of this course. I first define the time points, which are simply equally spaced
values between zero and one. Then I apply the R command gam, for splines regression, where
the gam is from the library mgcv. Note that in order to specify a non-parametric trend, we need
to transform the time points' vector using the s option in gam. There are other splines regression
models implemented in the gam command. Make sure to read the help manual of this R
command for other options.
I then extract the fitted values of the model to obtain the estimated trend, and transform the
vector of fitted values into a time series with the same time specifications as the regional original
time series.
178
Here we plot the time series along with the trend. We thus overlay the fitted trend on observed
time series values. I also plotted here the ACF and PACF plot of the residual process, after
subtracting the fitted trend.
Here are the plots. In the first plot, we have the observed time series versus the fitted trend.
We can see that the black line, which is the observed time series, is very similar to the blue line,
which is the fitted line. The upper right plot represents the detrended time series. The detrended
time series shows that the trend has been removed. The ACF and PACF plots show also that it
is plausible that the detrended time series is stationary.
179
Next we apply the ARMA modeling to the detrended time series called here the residual
process. Similarly to other examples used in this lecture to illustrate ARMA, we'll first select the
AR in any and MA order using the AIC approach. We thus first feed fit the model for orders zero
through five for both the AR and MA orders. And then select the order for AR and MA
polynomials such that we have a minimum AIC value.
The selected orders here are two for the AR polynomial and 3 for the MA polynomial. Next, I
fitted the final model with the selected orders.
Here is the output for the test of uncorrelated data applied to the residuals of the final ARMA
model. The p values for this test are small, indicating that we reject the null hypothesis that the
residuals from the ARMA feed fit are uncorrelated.
180
Let's also look closer at the residual plots next. Here the residual plot we have seen before, and
the R commands used to produce such plots, and thus not provided here.
The residuals versus time are patternless with lower volatility in the middle. Both the ACF and
PACF plots indicate the plausibility of white noise. And from the Q-Q norm plot, the quantiles of
the residuals line up very closely with the quantiles of the normal distribution. Thus, while the
test of the uncorrelated variables suggests potential correlation in residuals, the residual plots
do not, except Do not accept that the variability of the residuals may not be constant over time.
In summary, in this lesson, I illustrated ARMA modeling with yet another data example, which
shows the wide applicability of the ARMA modeling.
181
2.4.4 Alaska Moose Population: Forecasting
This lecture is illustrating ARMA models with data examples. And in this lesson, I will continue
the data example of modeling the Alaska moose population with the focus on forecasting. While
we have data and factors influencing the population of moose in Alaska, in this lecture we only
focus, again, on the forecasting of the moose population without knowledge of other factors to
compare later with predictions obtained using also other factors.
I illustrate here, how to obtain predictions using the approach when we decompose the
processing to a trend and a process model using ARMA. Here, we'll forecast the total moose
population for four years, not on a rolling basis, but all four years ahead. We set aside the data
for the four years and faked fit a trend on the remaining years. That is from years 1965 to 2002
and fit the view trend based on this data.
We predict the trend using the predict command. To apply the predict command we need to first
define the new data for which we seek to predict future values, those being the four years
ahead. Note that when defining the new data, we need to define it as a data frame and we need
to specify that the new data is for the predictor using the fitted model, here being defined as x.
If you will not correctly define the new data as a data frame, and do not specify correctly that the
predictor is the one using a fitted model, you'll have an error message.
The residual process is now defined based on this fitted trends, since we assume we have not
observed yet 2003 to 2006 moose population. The ARMA model is fitted on this residual
process. We use the same orders as identified initially, although we should have applied order
selection approach once more using the data from 1965 to 2002. Next, we apply the predict
command, but for the ARMA fitted model. Last, we sum up the prediction of the trend and the
prediction of the ARMA model to get the final prediction based on this approach.
182
We'll next apply ARIMA modeling which takes into account both the non-stationarity due to the
trend and the ARMA jointly into one model. We'll specify only one order difference to account for
the trend and then select the orders of the AR and MA components of the ARMA, using a
similar approach as before. The final model is one where the ARMA polynomial order is equal to
1 and the MA polynomial order is 0 and a difference of order equal to 1.
This is our output for the two test for uncorrelated data. Similar to the previous fit on the trended
time series, the p values of the two tests are small. However, they're not smaller than 0.05
indicating that we do not reject the null hypothesis although at the significance level of 0.05.
But because they're smaller than 0.1 we reject the new null hypothesis of at the higher
significance level.
We conclude that there may be some evidence to reject the new null hypothesis of n=0
uncorrelated residuals. However, we need more data to conclude on the rejection of the new
null hypothesis.
183
These are the residual plots from the ARIMA fit. The residual plot is patternless, similar to the
residual plot from the previous approach with low variability in the middle indicating potentially
non-constant variance over time. Both the ACF and PACF plot indicated the possibility of white
noise. The QQ Normal plot, however, has a tail on the left indicating that the normalcy normality
assumption may not hold.
Currently we performed residual analysis to assess goodness of fir fit. If there is departure,
serious departure from the model assumptions, the results will indicate what those departures
are and thus, on how to revise, adjust the model.
Here the departures are non-constant variance and possibly non-normality. Both could be fixed
using a transformation of the white noise. We'll also learn about models for modeling
non-constant variance in a different lecture.
184
Here with we continue with forecasting the total moose population using the ARIMA modeling.
And compare it to the predictions based on the previous approach where we first fitted the trend
and applied an ARMA model on the detrended time series. We first refit the ARIMA model
based on the data from 1965 to 2002 thus discarding the last four years. Then we use the
predict command to predict the four years 2003 to 2006 all at once.
We'll next compare the predictions based on the first approach using trend estimation and
ARMA to the second approach using ARIMA model. We first have to specify the minimum and
maximum values of the y-axis through the y-min and y-max as on this slide, to make sure that
we plot the data and the predictions on the same scale. Then we plot a time series of the
observed moose population for the last two years of the data, overlaying the predictions from
the two approaches.
185
Here is a comparative plot. The black line shows the observed time series for years, 1987 to
2006. The red line is the prediction of the last four years using the first approach and the blue
line is for the second approach.
While neither prediction captured a decreasing trend in the last four years, the first approach is
much closer to the observed time series than the second approach. In fact, ARIMA simply
assumes an increasing trend capturing the trend from 2000 to 2003. This is because the ARMA
part of ARIMA adds little to the prediction since it is an ARMA model of order one while the
prediction, is primary driven by the value of the last observation of the time series.
186
So far, I illustrated the applicability of ARMA with three examples:
1. The first example was for modelling and prediction of the volume of patients in an
emergency department, an application in healthcare of great relevance for hospitals.
2. A second application was the forecasting of the stock price of IBM, a common prediction
exercise for many financial investment companies.
3. A third example was a prediction of the moose population which can be used by
environmental agencies to control the population of moose.
Add the applications from the fills fields of climatology, for example forecasting temperature.
From the field to economics, for example projecting GDP, among many others, are subjects of
modeling and forecasting using ARIMA modeling, but also other time series models as we'll
learn in the next lectures.
Overall, with the three examples used to illustrate ARIMA modeling and forecasting, I intended
to show you some of the limitations of ARIMA. ARIMA modeling is easy to implement, but it
captures the non-stationarity in trend assuming similarity among prior observations and thus it
can mispredict if there are large changes in the trend.
On the other hand, the trend plus ARIMA [correction: ARMA] estimation approach is more
difficult to implement, particularly if interested in obtaining confidence bands also, but it can
capture long-memory trends. Prediction using ARIMA can perform well if only short time
prediction periods are considered.
This lesson concludes the ARMA modelling lecture of illustration of ARMA modelling with one
example. And an overview of the importance of ARMA modelling and its limitations.
187
Homework 2 Questions
https://docs.google.com/document/d/1ObgMnZ5xAYRuDY778o85FhyhrWcx80ozKd-HtIaOUdw/
edit?usp=sharing
For reference only. Do not add answers. These have been listed in a separate document to
allow for their complete removal from the lecture transcripts (even from revision history) at the
end of the semester.
188
Unit 3: Multivariate Time Series Modeling
What is a univariate time series? A univariate time series is a sequence of random variables
with some similarity in terms of the probability distribution, called a stochastic process. For a
time series, a sequence of random variables is indexed by time. In previous lectures, again, I
introduced statistical properties of such univariate time series.
We can expand the concept of univariate time series to multivariate time series. Which means
that instead of having one time series, we have a collection of n time series, possibly related or
correlated in time.
189
Throughout this lecture,we will use multiple notation to refer to a multivariate time series. We
can write it as boldface Y, which is the collection of all n time series together. But we also can to
refer to it as Yt, which is the collection of the time series data at time t only. Note that here, Yt is
a vector of n variables, each being the time series value of one of the n time series.
Why do we need to model multivariate time series? Multivariate time series analysis is used
when one wants to model and explain the interactions and co-movements among a group of
time series variables. It is possible to improve the prediction of one time series by taking into
account other factors, as I will illustrate with three examples in this lesson. Simply taking into
account the historical data of the time series itself may not provide as good predictions as when
we also account for historical data of other factors.
Generally, multivariate time series has been applied in econometrics and financial data.
Examples are in simultaneous modelling of forward and spot exchange rates or interest rates,
money growth, income, inflation, or consumption and income. But as I will illustrate in this
lecture, there are many possible applications in other fields, for example, in econometrics,
climatology, among others.
I'll begin illustrating the concept of multivariate time series using a classic econometrics
example. The relationship between interest rates and unemployment, a topic of interest in the
past three years in the United States. As the Federal Reserve, or commonly referred to as the
Fed, has started to slowly increase the interest rates.The Fed meets eight times each year to
review economic and financial conditions, and decide on monetary policies.
Monetary policy refers to the actions taken that affect the availability and cost of money and
credit. At these meetings, short term interest rate targets are determined. Before starting the
increase of the interest rates,the Fed had established that it would start increasing the interest
rates if the unemployment gets below a specific threshold balance.
But what is the relationship between interest rates and unemployment? There's a more direct
correlation than most people understand.The relationship is driven by so-called wage pressure.
When you hear the Fed saying that the economy's heating up, we have to raise interest. What
they really mean is, labor is asking for and getting higher wages. We cannot have that, let's
make it harder to get money. Thus, we should expect that the interest rate is more accurately
predicted if we take into account not only historical data for the interest rates, but also the
unemployment rate. To do so, we will need to employ multivariate time series modeling.
190
This is a time series of the interest and unemployment rates observed monthly. We see over the
period of 60+ years, there were periods of time where the two series show similar patterns.
However, in the past years, since around 2009, the rates were 0, while the unemployment went
up and then down. Does unemployment rate lead to changes in the interest rate? After how
many lags in months does the interest rate respond to the reduction in the unemployment rate?
We'll address this question in this lecture.
For the second example, we'll return to the case study for casting forecasting the Alaskan
moose population. This case study was used to illustrate the application of the ARMA model in
the previous lecture. The population of wild moose in areas such as Alaska often fluctuates over
several years due to both human influence and natural factors. The endpoint of this study is to
predict the Alaskan moose population. In this lecture, we will consider other factors which may
impact the rise or fall of the moose population.
Why is it important to predict the Alaskan moose population? For many Alaskans, moose meat
is the largest source of food income for a family. Thus, it is of great importance to the Alaska's
citizens to manage and control the moose population, to ensure the sustainability of this
population. Moose also serve as an important tourist attraction that helps to bring several
thousand people to this great state every summer.
The findings established in such a study can be of particular use to Alaska Division of Wildlife
Conservation, which tracks populations of species across several years for scientific and
conservation purposes. The data provides relevant factors related to moose population,
including population of Fairbanks, which is close to the location unit 20A for which we have the
data on moose population, Alaska wolf population, average snowfall, and moose harvested. The
years of the data are 1965 to 2006, and the data are yearly.The data account for come from
multiple sources but particularly acknowledge the Alaska Department of Fish and Game.
191
The time series plot of the vectors in these data sets are here. We will focus on the total moose
population in this study.We can see that many of the factors provided have some overlapping
patterns with the time series of the total moose population. The question we'll address in this
lecture is whether we can forecast the total moose population with the aid of other factors. Does
the wolf population impact the moose population? Does the moose population tend to decrease
with heavy snow? Again, such questions can be addressed with a multivariate time series
analysis.
From econometrics to animal wildlife animal population, now I will demonstrate an application
from environmetrics. We will study the river flow of (the) Chattahoochee,the. The
Chattahoochee River originates in the Blue Ridge Mountains of Appalachian Highlands in
northeast Georgia. The upper Chattahoochee River watershed is the most important, if not
unique, source of water supply for Atlanta. Thus, it is important to accurately predict the river
flow or drainage. Our understanding of such environmental systems can be substantially
improved by using multiple multivariate time series analysis and modelling techniques. Two
main time series, namely, precipitation and ground air temperature, are used along with river
drainage. The data are observed monthly over 60 years, since 1956.
192
These are the three time series, the. The first time series is the river drainage or flow. The time
series presents some cyclical patterns, but there is not a clear trend over time.The next time
series is for ground temperature, as. As expected, we see a clear seasonality. The third time
series is for precipitation or rainfall. The seasonal variances in rainfall are not significant, even
though there are visibly higher amounts of rain during winters. We expect a causal relation of
river flow with respect to this these time series during winter months.
When the potential for transpiration evapotranspiration is lowest in the year, the stream flows
are mainly determined by the precipitation. While during summer months, when the potential for
transpiration evapotranspiration is higher than the amount of water falling within the watershed,
the stream flows are consistently low, despite of the variations of precipitation. Thus, it is
possible for the seasonality in rainfall and temperature to also improve the prediction accuracy
of the river flow.
How to deal with seasonality in the individual time series, especially when seasonality in
temperature and rainfall may impact to river flow? We'll address this, again, using multivariate
time series.
In summary, in this lesson, I introduced three examples that I would used to illustrate
multivariate time series.
193
3.1.2 Basic Concepts
In this lesson, I’ll briefly introduce the concept of multivariate time series.
To review, a multivariate time series is a collection of n time series, possibly related or correlated
in time. Throughout this lecture, we'll use multiple notation to refer to multivariate time series.
We can write it as bold-faced Y which is a collection of all n time series together, but we also
can refer to it as Yt which is a collection of the time series data at time t only. Note that here,
white Yt is a vector of n variables. Each, each being the time series variable of one of the n time
series.
Why do we want to model multivariate time series? Multivariate time series analysis is used
when one wants to model and explain the interactions on and co-movements among a group of
time series. It is possible to improve the prediction of one time series by taking into account
other factors as illustrated in the examples in the previous lesson. Simply taking into account the
historical data of the time series itself may not provide as good a prediction as when we would
also account for the historical data of other factors. As we'll learn in this lesson, dependencies
captures captured in multivariate time series are within and between time series. Thus, when
forecasting, one of the n time series, we take into account both types of dependencies.
194
Similar to any variate univariate time series, we also have the concept of stationary stationarity
in multivariate time series. A multivariate time series is stationary if the mean of each individual
time series is constant over time which means the expectation of the vector time series is a
vector of means, mu not varying in time. Moreover, the various variance-covariance matrix of
the vector times this time series Yt is a matrix. Defined her, defined here as as gamma 0, which
does not depend on t (on time). Note that gamma 0, the variance-covariance matrix, matches
the so-called contemporaneous dependence in the multivariate time series. I'll introduce the
covariance matrix for lag dependence on a different slide in this lesson.
195
We can estimate the mean contemporaneous various variance-covariance and correlations for a
multivariate time series using the sample data. For the mean vector, we estimate each mean in
the vector by the average of the time series corresponding to the mean for the variance
variance-covariance matrix. For the contemporaneous variance, variance-covariance matrix, we
can use a sample covariance estimator as provided on the slide.
196
In any multivariate stationary time series, we define the lag dependency using the auto-
covariance or autocorrelation functions. For stationary multivariate time series, we have the
autocovariance or correlation for individual time series. But we also have the so called cross
lead-lag covariances and correlations between all possible pairs of time series
.
How are these defined? Consider a pair of time series from the multivariate time series with
index i and j. The lead-lag of variance for this pair is defined here on this slide by gamma ij for
lag k which is the covariance between the ith time series at time t and jth time series at time t
minus k. The lead-lag correlation is simply the scale lead-lag covariance.
It's important to remember that the lead-lag covariances and correlations are not symmetric. For
example, gamma ij of the lag k is not equal to gamma ji of lag k. This means that the
dependence between the time series i, at time t and time series j at time t minus k is different
from the dependence between the time series i at time t minus k and time series j at time t.
The lead-lag covariance captures the lead and lag relationships between time series in the
multivariate time series. Yjt is said to lead Yit. If if the covariance between Yi at time t and Yj at
time t minus k is different from 0. It is possible that Yit leads Yjt and vice versa. In this case, we
say that there is feedback between the two time series.
Similar to the contemporaneous covariance and correlation, we can estimate the cross
covariance and cross correlation using the sample data instead.Instead of estimating based on
the contemporaneous products of pairs of time series. Now, now we use the lead-lag product of
pairs of time series. That is, take the sum of the products between Yt and Yt minus K. Again,
note that the matrices gamma k and Rk will not be symmetric. I'll illustrate this with an example
in the next lesson.
197
3.1.3 Data Examples
In this lesson I will illustrate the concept of autocovariance and cross-covariance functions for
multivariate time series using a data example. Specifically, I will illustrate the estimation of the
dependence of multivariate time series using the data example in which we're interested in
capturing the relationship between interest rate and unemployment rate.
The two times series are in two different data files with different structures, as you will see when
you open the files. The interest rate is read from the InterestRate.csv file, which consists of two
columns. One for the date and one for the interest rate value. Thus, we can directly read the
data column with the interest rate into a time series. The data file for unemployment is in the
form of a matrix, where each row corresponds to a year and each column corresponds to a
month. Thus, we'll convert the matrix of unemployment data values by using asavector
command, which stacks the columns of a matrix into one vector. Since we want to stack the
rows to keep the time order of the data, we first take the transpose of the matrix and then apply
the asavector command. We discard the first column of the data since it provides information of
the years.
After this data processing step, we can convert the vector of unemployment rate into a time
series. To combine the two time series into a multivariate time series, Nr in R, we can use the
ts.union R command. This step is needed in estimating the outer auto-correlation and
cross-correlation using the acf R command.
198
3.1.3 (continued)
The correlation plots are here on. On the diagonal, we have the autocorrelation plots for the two
time series. Of Off-diagonal we have the cross-correlation plots. The autocorrelation plots
indicate that neither of the two time series is stationary since the autocorrelation decreases
slowly with the increase in the lag. The cross-correlation plots are different. Recall that in the
previous lesson, I mentioned that the lead lag lead-lag correlation between any pair of time
series is not symmetric. That is, it depends on which time series in the pair is the lead and which
one is the lag TS.
In these plots, the first time series in the title of the sample cross-correlation plots is the lag time
series. For example, in the cross-correlation plot in the lower left panel, the interest rate is the
lead and the unemployment is the lag time series, that is, this plot presents the estimate
cross-correlation for interest rate at time t-k and the unemployment at time t. Note that the lags
are provided in reverse order. The last bar in this plot corresponds to the contemporaneous
correlation that is correlation of interest rate and unemployment rate both at the same time t.
The next to the last bar corresponds to the correlation between interest rate at time t-1 and
unemployment at time t and so on.
How can we interpret the lead-lag correlation from this plot? The contemporaneous correlation
is very small but there is a lead-lag relationship, as the cross-correlation values are large, in fact
199
very slowly increasing as the lag increases, which is an indication that unemployment indeed
lags interest rate but over a long period of time.
In contrast, if we look at the estimate cross-correlation where the unemployment is the lead time
series. We, we see that the cross-correlation is small, suggesting that the unemployment rate
does not lead the interest rate, which is counter the argument of the economic theory of setting
interest rate based on the level of unemployment.
If you would like to get the correlation values we can use the cor() command in R. For example,
the contemporaneous correlation is 0.16, while the lead-lag correlation within 1 lag are 0.17 and
0.15.
However, our interpretation of the cross-correlations from the previous slide line is for
nonstationary time series. It is more appropriate to provide such interpretations for stationery
stationary time series. For this, we can take the first order difference of the two time series, and
assess the relationship between the differences in a two time series. That is, does the difference
in unemployment rate impact the difference in the interest rate from one month to another?
3.1.3 (continued)
200
These are the outer auto-correlation and cross-correlation plots for the differenced time series.
The outer auto-correlation plots show that the first order difference improves the stationarity of
the two time series. The outer auto-correlation does not show a slow decrease over many lags.
The cross-correlation plots provide a different interpretation than in the previous slide.
In the cross-correlation plot, with unemployment as the lag time series and interest rate as the
lead time series in the the lower left plot, the correlation estimates are small. Suggesting
possibly plausibly no or small lead relationship of the change in interest rate to the change in the
unemployment rate.
However, from the cross-correlation plot in the upper right plot, we see that there may be a lead
relationship of the change in unemployment rate versus the change in interest rate, which aligns
with the economic theory. We can compare the lead-lag correlation using the two time series
and the difference time series. We can see the cross-correlation increases for the difference
time series.
3.1.3 (continued)
201
In the next two examples, I will simulate multivariate time series when we know how the outer
auto-correlation and cross-correlation plots need to look like for a better understanding of the
interpretation of the plots.
First, I simulated white noise, which would be the first time series. Where the second time
series, [is] the same simulated white noise, but lagged with 3 lags. We'll look at the
autocorrelation and cross-correlation plots.
3.1.3 (continued)
202
These are the plots, as. As expected, the autocorrelation plots indicate that the time series are
white noise. For the cross-correlation plots, only the one for which the time series e2 is the lead
on the second time series, and the first time series e1 is the lag time series, showing one large
correlation value at lag 3. This means that there is a large correlation between e1 at t and e2 at
t-3 from our simulation. The t-3 e2 is exactly the e1 at the time t and here hence the large
correlation at lag 3. In contrast, the other cross-correlation plot has not large correlations. This is
because e1 and e2 are white noise processes.
203
Next, I will provide a simulation example in which we simulate two correlated AR time series.
The first time series of white noise. Where, whereas the second is an AR 1 process, and the
lead process for the first time series with two lead lags. [Error: These two sentences are
repeated]
The R code for the simulation is here, the. The first simulate the first white noise time series
defined by x, and the white noise for the second time series defined by e, the vector e To
simulate the second time series, y, we loop over all k in 1 to 500. And for each index, k. The, the
time series Y is defined as an equation above, with a coefficient 0.8 for Y at t-1 plus the white
noise e, plus the coefficient 0.6 times Xt-1, and the coefficient 0.3 times Xt-2.
204
The ACF plot of this multivariate time series are here. The autocorrelation of x indicates white
noise while of y clearly shows that there is autocorrelation. In order to detect the order of AR for
y, we would need to look at the PACF plot. The cross-correlation plots show that x leads y, but y
does not lead x. As, as we see that the cross-correlations for lags 1, 2, 3 are outside of the
confidence band. Indeed, from our simulation model, x y lags y x with two lags, since y is a
function of xt-1 and xt-2.
In summary, in this lesson, I illustrated the lead-lag relationships between y time series with
simulated examples.
205
3.1.4 VAR Model
In this lesson, I'll introduce the VAR model, the most common model used for Multivariate Time
Series. The VAR Model, or the Vector Autoregressive Model, made famous in Chris Sims'
paper, Macroeconomics and Reality published in Econometrica in 1980, is one of the most
applied models in the empirical economics. Chris Sims received the Nobel Prize in Economics
in 2011. Before VAR, the relationship between time series was captured in a more exogenous
manner using standard regression analysis models.
Sims argu
ed
that such conventional macro models were “incredible” in quotes, that means they were based
on non-credible identifying assumptions. While the VAR model was highlighted in the Sims 1980
book, it was slowly adopted. But now, it is one of the most commonly used models for applied
macroeconomic analysis and for forecasting in central banks. This is a model you need to put
on your list of analytic skills if you are working or will be working in these fields.
The Vector Autoregressive model or the VAR model, is also easy to understand and to
implement in the analysis of multivariate time series. It's a natural extension of the univariate
autoregressive model to dynamic multivariate time series analysis. It often provides superior
forecast to those from univariate time series models and elaborate theory-based simultaneous
equations models. Forecasts from VAR models are quite flexible, because they can be made
conditional on the potential future path of specific variables in the model. In addition to data
206
description and forecasting, the VAR models are also used for structured inference and policy
analysis. I'll discuss and illustrate these properties of the VAR model throughout this lecture.
The VAR model with lag p for n varied time series, Yt, is defined similar to the AR model. Yt
depends on Yt-1 up to Yt-p linearly. Now, the coefficients are matrices, pi 1, pi 2, to pi p, which
are n by n matrices of coefficients. The error term, epsilon, is assumed to be white noise with
the covariance matrix sigma.
For illustration, consider the bivariate var(2) model. Meaning that we have two time series and
the autoregression order is two. The time series is a vector of two time series, Y1t and Y2t. The
matrices pi 1 and pi 2 are two by two matrices of coefficients, indicating that Y1t depends on the
lags Y1t-1 and Y1t-2. And also depends on the lags Y2t-1, and Y2t-2. The error term is the
vector of two error terms also.
We denote the contemporary covariance of two error terms by sigma 12, where the lagged
covariance of the error terms is 0. That is, the model captures the lagged relationship between
the two time series through the AR part and the contemporaneous relationship through the error
term.
Do note that the VAR model, also called the unrestricted VAR model, has the same set of
regressors for both y1t and y2t for both time series. This assumption is useful in the estimation
of the coefficients as we'll learn in the next lessons.
207
A stable VAR is the extension of the concept of causal AR model. First, the AR model presented
in the previous slide can be rewritten as pi(L) times the vector of time series, where L is a
multivariate operator, an extension of the b operator we learned for univariate time series.
A VAR(p) is stable if the roots of the VAR equations, as provided on this slide, are modulus
greater than 1. Note that this is a similar definition to the definition of causal AR model. We can
further translate this definition into that the eigenvalues of the companion matrix F have
modulus less than 1.
I will note here the difference between the two equivalent definitions. For the latter, again the
eigenvalues were smaller than 1 in modulus. It is important to remember that since the R
function used VAR to evaluate stability provides the modulus of the eigenvalues of F, not the
roots of the VAR equation. We, we will stick to values of smaller than 1 in modulus in order to
assess that the VAR process is stable.
Importantly, a stable VAR of order p process is stationary with time invariant means, variances,
and autocovariances. This is similar to the causality property of the ARMA model. If you recall,
the causality implies stationarity. The same here, stability implies stationarity.
208
For the simple bivariate VAR(1) model, the stability translates into the equation depending on
the coefficient's coefficients pi 11, pi 22, pi 12, and pi 21, as provided on a slide. If the cross
terms are 0, then the stability property reduces to univariate causality conditions for each time
series.
209
In a different lesson, we'll learn about generalization of the VAR model in that we can add to the
VAR components a deterministic component, often referring to trend and/or seasonality. And
exogenous relationships to the other time series. This model is particularly useful when the time
series present trends and/or seasonality. It can also be useful when we believe that one time
series predict or lead another but not the other way around.
In this summary, I introduced the concept of stability of a VAR model. With, with a more general
framework of presenting the Vector Autoregressive model.
210
3.1.5 VAR Model Simulation and Data Example
In this lesson, I will illustrate VAR modeling with a simulation study. In this example, I'm
simulating a VAR model. Because I'm using the multivariate norm function in R, I need to first
load the mass library. This command can be used to simulate multivariate normal data.
Here, I developed an R function that can be used to simulate a VAR model of order 1. You can
also use this example of an illustration of how you can write your own functions in R. The name
of the function is VAR.sim, and the inputs in the functions are phi, sigma Y0 and time. Where phi
is a matrix of coefficients, sigma is the covariance matrix of the error terms, Y0 is a starting
vector of values, or seed values for the time series, and time is the length of time.
First, n is the dimensionality of the time series or the number of time series in the vector time
series. This is the vector of the multivariate time series which starts with Y0. Then, it builds with
each time point by iterating over i, and simulating the multivariate time series for each time point
or index i. The function returns this multivariate time series. To apply this function, we need to
first create the input in the function consisting of phi, sigma and Y0 as provided on the slide. Phi,
as defined here, is a diagonal matrix thus there are no lead-lag relationships between the time
series.
211
I will note that the multivariate time series has a dimensionality of n equal to 3, since phi is a 3
by 3 matrix. The covariance matrix of the error terms is not diagonal, thus we have
contemporaneous correlation between the three time series. Last, we apply the VAR.sim
function with this input.
In this simulation I consider only three time series. But we can extend this to more than three
times series, since we can add more rows to the coefficient matrix and to the covariance matrix.
Next we look at the time series plots along with the ACF and PACF plots for all the three time
series. Instead of writing the three commands for plotting the three plots for each time series, I
iterate here over i from 1 to 3 with the same R commands. But for each i, I changed the time
series to be plotted and the totals titles of the plot.
212
This is the output of the R code for the plots, for the three time series. The first two time series
seem stationary, but not the last one. This is because the error coefficients of the AR1, is
stationary only if the coefficient is different than 1. In this case, the coefficient is very close to 1.
We observe significant autocorrelation at lag 1 for series 1 and 3 in the PACF plot, suggesting
an error 1 AR(1) model. The ACF plot for the second time series shows no large autocorrelation,
while for the first time series, there may be autocorrelation, as a sample AC of ACF value for
lags 1 and 2 are outside of the confidence band. For the third time series, the ACF points to
non-stationary non-stationarity.
213
How about cross correlations? We can use the ACF and the PACF R functions with the union of
the three time series.
## Check the cross-autocorrelation
acf(simulated.Data)
pacf(simulated.Data)
These are the ACF plots again. Again, on the diagonal, we have the autocorrelation plots, and
off diagonal we have the cross correlation plots.
When interpreting the cross correlation plots, you will need to be careful that some of the plots
have an increasing lag, and others have a decreasing lag. The cross correlation plots show little
lead-lag correlation. The correlation at k = 0, or so called contemporaneous correlation, shows
to be somewhere outside of the confidence bands for most of the cross-correlation plots. Note
that we indeed simulated contemporaneous correlation since the covariance of the error terms
is not a diagonal matrix.
214
Last, we can establish whether t
his is a stable VAR process by first fitting the VAR model and looking at the summary of the fit
which provides the eigenvalues of the companion matrix F.
The output of interest is provided by the r roots of the characteristic polynomial, which are the
eigenvalues of the matrix defined in the previous lesson.
Note that these are not the roots of the VAR equations. Thus, to check stability all these roots
will need to be smaller than 1. Which is the case in this example since all eigenvalues are
indeed smaller than 1.
In summary, in this lesson I illustrated the VAR module with a simulation example.
215
3.2: VAR Model
Let's return to the definition of the VAR model. In this formulation, Yi is the vector of the ith time
series. In other words, if we observe the time series over T time points it's a T by one vector of
observations over time. We have n such time series, Yi, where i indexes from 1 to n.
We can write the VAR of order p model in a matrix format for each time series Yi, where pi is a
matrix of coefficients corresponding to the ith time series. A matrix of that dimensionality of n
times p + 1, since we have n time series and the order of the model is p. We have an additional
coefficient for the intercept or the constant. X is the matrix of the lagged time series variables.
The matrix X is of dimensionality T by n times B p + 1 matrix with a tth row given by the left m
multivariate time series y times t-1, t-2, up to t- p. Last, the error term is also a T by 1 vector.
We'll assume that the errors are independent, but the variance depends on i.
216
Let's illustrate a VAR model with the simplest example, a bivariate AR(1) model. For this model,
we have two model equations one for each time series. Since the order is p equal to 1, the
model is only with respect to Y1t-1 and Y2t-1. Thus, each regression equation has the same
regressors, the same lag values, so y1t and y2t. This is called the unrestricted VAR model,
because the model corresponds to two regressions with two different (dependent) variables, but
identical explanatory variables. We can estimate this model using the ordinary least square
estimation approach separately for each equation.
217
We can write the two regressions in the matrix form as provided in this slide. Y1 is equal to X
times pi 1 plus the error, Y2 is equal to X times pi 2 plus the error term. The matrix X is a T by 5
matrix with each row corresponding to each time point. Given this these regression equations, if
we apply the ordinary least squares approach, used in the standard regression, as you learned
in the regression analysis course. The, the estimated coefficient matrices are as provided on the
slide. For example, pi1 hat is equal to the inverse of (X transpose X) inverse times X transpose
times Y1, similarly, for pi2 hat.
218
We can extend this estimation approach to a general unrestricted VAR model or of order p with
n time series. For such a model we have n regression models. One regression model for each
time series. For the first regression model, we regress Y1t on the regressors Y1t-1 up to Y1t-p,
Y2t-1 up to Y2t-p and so on Ynt-1 up to Ynt-p. For the second equation, we regress Y2t onto
exactly the same regressors, and so on.
219
An important property of the estimation of the VAR(p) coefficients using the ordinary least
square squares, is that the sampling distribution of the estimated coefficients in pi1 hat, pi2 hat
up to pin hat, are asymptotically normal distributed. Where asymptotic means that this applies
under a large sample size. If you do not observe the time series of a large number of time
points, then the normality of the estimated coefficients will hold only if the data are normally
distributed.
The statistical properties [are] particularly important when testing for statistical significance of a
subset of coefficients. The classic testing procedure is the so called world Wald test. The two
common settings when we would apply this tests are when we are interested in testing whether
all the coefficients corresponding to the kth time series, Yk R0 are zero in the lth regression or
for the lth time series. The null hypothesis [is] that all coefficients are 0 where the alternative is
that at least one is not 0. I will illustrate this test in a different lesson when we will discuss the
concept of gradual Granger causality.
A second setting, is to test the statistical significance of all the coefficients corresponding to the
p order of the time series. The null hypothesis that all the coefficients corresponding to the p
order are 0 versus the alternative is that at least one is not 0. This test can be used in deciding
whether to reduce the VAR model to a lower order VAR model. Generally, for such tests, if the
p-value of the test is small, we reject the null hypothesis and thus, at least one coefficient is
statistically significant.
220
These testing procedures discussed in the previous slide can be used to reduce the unrestricted
VAR model to a model with selected coefficients in order to fit a smaller model. This results in
what is called the restricted VAR model. Thus, a restricted VAR might include some variables in
one equation and other variables in another equation where an equation corresponds to one
regression or one time series. Old-fashioned macroeconomic models, so-called simultaneous
equations models of the 1950s, 1960s, 1970s were essentially restricted VAR models. The
restrictions and specifications were derived from simplistic macro theory. For example, the
Keynesian consumption functions, investment equations and so on. Sims was the first to be the
proponent of of the unrestricted VAR model in his book published in 1980s.
221
However, when there is contemporaneous correlation, it may be more efficient to estimate the
regression equations jointly, rather than to estimate each one separately using the least squares
approach. The appropriate joint estimation technique is a so-called generalized least squares or
abbreviated GLS.
It has to be shown that there are two conditions under which the ordinary least squares (OLS)
applied to each equation separately is identical to GLS, to the generalized least squares
approach. One, is when there is no contemporized correlations among the n times series.
The second, is when the equations of the end n regressions have the same regressors. In other
words, x is the same for all regressions which is equivalent to the unrestricted VAR model.
This motivates using the unrestricted model because it's less computationally expensive. The
generalized least squares, it's more computationally expensive. Thus, unrestricted VAR can
simply be estimated using OLS even when there is contemporaneous correlation while
restricted VAR will need to be estimated using GLS.
222
Sims, who was the first to provide more grounding to the the unrestricted VAR model, argued
that all fashion old-fashioned macroeconomic models particularly yoy'ure were using restricted
VARs. The restrictions in such models have no substantive justification. Sims, argued that
economists should instead use unrestricted models. He proposed a set of tools for use in
evaluation of VARs in practice.
But the (unrestricted) VAR model doesn't have only advantages, it also has some limitations.
Some of the advantages are that it is easy to estimate, it has good forecasting capabilities. If if
223
there is a relationship between time series and. And there is no need to specify which variables
are endogenous and which are exogenous. All are assumed endogenous. I'll introduce the
concept of endogeneity in a different lesson, when I'll discuss intercausaility, but. in short. It, it
means that the unrestricted VAR assumes that each time series will have a lagged relationship
with all of the other n-1 time series.
But the unrestricted VAR has also some limitations. It has many parameters. If there are K
equations, K times series, one for each time series, the and p lags in each equation then there
are K + pK squared parameters to be estimated. There maybe may be cases when we would
actually have less observations than the number of coefficients that we need to estimate and
thus, smaller number of p will need to be considered. VAR models are also a-theoretical, since
they use little economic theory. Thus, VAR models cannot be used to obtain economic policy
interventions.
In summary, in this lesson, I contrasted the unrestricted VAR model with the restricted VAR
model, and I pointed out properties of the unrestricted VAR model.
224
3.2.2(7) Estimation: Simulation Example
In this lesson, I will illustrate the estimation of the VAR model using a simulation study. We'll
return to the simulation study introduced in a previous lesson.
In the simulation, I simulated three time series following a VAR model of order 1. We would like
to see how close estimated coefficients are from the true ones provided by phi.
225
The most common R command used to fit a VAR model is the VAR R command from the vars
library. The input in the VAR function is a matrix of time series, where each column corresponds
to one time series, and each row to one time point. We also need to specify the order here, I
used p=1 since we know that the data that was simulated was from a VAR(1)model.
The output of this R command is quite complex. I'm showing you here only a portion of the
output.
Here, we see the estimation results for the first time series. The output looks very similar to that
from fitting a linear regression model since the VAR function does just that. It applies the
ordinary least squares to each individual time series. The output provided here provides the
estimated coefficients, the estimates of the pi1 matrix as in the model notation in the previous
lesson. We also, have the standard errors of these estimates and the p value of the t test of the
statistical significance in the last column.
What can we learn from this output? From the previous slide, the coefficients from the simulated
model, hence the true parameters, are 0.5, 0, and 0 as provided by the first column in the matrix
phi. In contrast, the estimated coefficients in the first column of the output are 0.526, 0.007, and
-0.0028.
The estimated coefficients are close to true ones, but the question is, are they close enough? To
address this question, we could perform hypothesis testing. For example, to test whether the
coefficients corresponding to a lag1 of the time series two and the time series three are not
statistically significant. To do so, we can look at the p values for the two coefficients which are
226
are 0.82 and 0.06. Thus, the significance level 0.05, both coefficients are not statistically
significant, thus placebo is (possibly?) plausibly zero.
The output for the other two time series is here. The one on the left is for time series two, and
the one on the right is for time series three. Similar to the output for time series one, they
provide information on the estimation coefficients and the p values for the t test for their
statistical significance. For the time series two, the true parameters are 0, 0.1 0.01?, and 0. As,
as provided in the second column of the pi matrix in the simulation, the estimated coefficients
are 0.02, -0.01, and 0.01. The p values of all three coefficients are large, indicating that they are
not statistically significant, whether that they're plecebo plausibly equal to 0. In this case, we
thus miss (fail) to accurately estimate pi 22.
For time series three, the true parameters are 0, 0, and 0.97, as provided in the third column of
the pi matrix. The estimated coefficients are 0.02, 0.008, and 0.96. Where both first two
coefficients are not statistically significant since their p values are large. And thus, they're
plausible equal to zero as in the simulation model.
227
The output also provides information on the covariance metrics matrix of the residuals, which is
an estimation of the variance of the error terms, sigma. We can compare this output with the
covariance of the simulated model provided below. The estimated covariance accurately
captures the contemporaneous dependence among the three time series as provided by the
comparison of the values of the two matrices. The : the true covariance and the estimated
covariance.
228
Next, we'll take a closer look at the residuals of the VAR model fit. For that, we can use the
residuals R command to extract the residuals, which are approximate of the error terms of the
three regressions. The residuals of the three regressions all stack up into one vector.
We can assess the normality using the hypothesis testing procedure, where the null hypothesis
is that the data residuals in this case are normally distributed. For this, we can use the
normality.test R command. We can also evaluate whether the residuals have constant variance
by applying the arch.test R command, which is a hypothesis testing procedure where the null
hypothesis is that the residuals have constant variance. For both we seek large p values, that is,
we do not want to reject the null hypothesis.
This is the R output from these two tests. The normality test provides inference on multiple
aspects of normality including skewness and kurtosis, where the former is a measure of
departure from symmetry and the latter is a measure of the departure from normal tails.
The p values for the normality test are all larger than 0.05, indicating that we do not reject the
null hypothesis or of normality.
Similarly, for the arch.test of the p value is large, indicating that we do not reject the null of
constant variance.
229
## Residual plots
par(mfrow=c(3,2))
for (i in 1:3){
plot(model.residuals[,i], main=paste('Residauls for series', i))
qqnorm(model.residuals[,i], main=paste('QQnorm for series', i))
qqline(model.residuals[,i])
}
We can also plot the residuals, here. Here I'm looping over i for 1 to 3 and plot the residual plot
on a normality probability plot for residuals of each of the three regressions of each of the three
time series. Here are the plots, there is no departure from constant variance or normality as also
provided by the testing procedure from the previous slide.
230
## Residual Plots: White Noise Assumption
acf(model.residuals)
pacf(model.residuals)
## Testing for Serial Correlation (Multivariate Portmanteau Test)
serial.test(model)
The null hypothesis of this test is that the residuals are uncorrelated. This is the output of this
test. We discussed this test before for univariate time series. The p value is large indicating that
the residuals are uncorrelated. Hence, it is possible to have model, modeled the serial
correlation in the three time series appropriately.
231
These are the autocorrelation and cross-correlation plots. The autocorrelation plots indicate that
the residuals from each of the three time series look like white noise. Interestingly, the
cross-correlation plots show that there may be some contemporary contemporaneous
correlation among the residuals since the bar of the zero lag is outside of the confidence band.
In summary, in this lesson I illustrated the estimation of the VAR model using a simulation study.
232
3.2.3(8) VAR Model: Order Selection
In this lesson, we'll continue the estimation of VAR models with an approach for the order
selection.
In the previous lessons, we learned about the estimation approach of the unrestricted VAR
model along with advantages of using and restricting over-restricted unrestricted over restricted
VAR models. However, I also pointed out that one limitation is that the unrestricted VAR has
many parameters. If the order is p, then we'll need p parameters for each time series. Thus,
each regression will have p times n+1 parameters which need to be estimated in each
regression. For example, for a bivariant bivariate time series, the number of coefficients in each
equation is 1+2p and the total number is 2+4p. When we have n times series, the total number
of coefficients in each equation is 1+np, and the total number is n+n squared times p. This can
be very large when n and/or p are large. If the selected p or the assumed lag order is
unnecessarily large, the forecast precision of the corresponding VAR(p) model will be reduced.
Therefore, it's useful to have procedures or criteria for choosing the order p. Ideally, we would
select the order to be as small as possible. We do not want to use an arbitrary p which is too
large, but. But how to select the order p?
233
Order selection falls under the more general framework of model selection. When one has
several models to consider and wants to select the one with the best performances, where
performance is performance, with performance as defined by some measure. Thus, the general
approach for order selection is to fit the VAR model with p orders from 0 to maximum order, and
choose the value of p which minimizes some model selection criteria. The criterion is a sum of
two components. The first is a measure of goodness of fit, and the second is a measure of
model complexity. The criterion selects a model which balances goodness of fit and complexity.
234
Classic examples of such criteria are the Akaike information criteria, or AIC, the Bayesian
information criteria, or abbreviated BIC, and the Hannan-Quinn criteria, among other criteria. In
all this these criteria, the goodness of fit measure is the sum of square residuals. The only
difference across this criteria is the specification of the complexity penalty. AIC is more
generous in the sense that (it) penalizes the least for complexity. For BIC and then HQ, (we) are
penalized most for complexity. Thus, BIC and HQ are indicated to be used to select order when
we're interested in prediction and forecasting.
Let's return to the simulation study where I simulated the VAR(1) model for three times series,
for a multivariate time series with three time series. In order to select the order, we can use the
VAR select R functions in R where the input is simply the multivariate time series. Here, I'm also
plotting the AIC values for different orders of the VAR model. And this is the plot, and according
to this plot, we can see that the selected order is equal to 1. And this was the order that we
simulated in the VAR(1) model and, in the multivariate time series model.
To summarize, in this lesson, I briefly introduced the approach that is commonly used to select
the order in the VAR model.
235
3.2.4(9) Order Selection: Data Example
In this lesson, I will illustrate the estimation and order selection of a VAR model. We'll return to
the example in which we're interested in modeling the bivariate time series. Consisting
consisting of the unemployment rate and the interest rate in the US, motivated by the monetary
policies in 2017 towards the increase of the interest rate.
Because the time series are not stationary, since both of them have a trend, we'll analyze their
first order difference not original time series. We'll first apply the simple AR model to each
individual time series, and select the order of AR for each of those two time series. Setting up
20 to be the maximum order, the selected AR order for the interest rate is 20 and for the
unemployment rate is 12.
But we're interested to capture the temporal relationship between the interest rate and
unemployment rate. Hence, we apply the VAR model in r. The command is VAR. Its input are
the data matrix (consisting of the first order difference of the two time series in this example), the
maximum lag for the model feet fit, and the specification whether to fit a trend, a constant, no
trend or no constant, or both (to account for the mean). Note that the VAR command also
selects the order hence the specification of the maximum lag to be considered.These are
variations of implementations of the VAR r function with different types of deterministic trends,
along with a formation information criteria for model selection, here being AIC. For all types of
deterministic components, the selected order is 12.
236
We can also compare the selected order using other criteria. SC here stands for BIC, HQ and
SC both selected an order equal to 2. Note that these two methods, BIC and the Hannan-Quinn,
usually are penalizing more for complexity. And hence they select much smaller order, p would
be equal to 2. This is a very big difference from the order selected with AIC and BIC [correction:
AIC and FPE].
In the next slide I will show you the output for the fitting the model the VAR model with order
equal to 12.
This is the output for the coefficients corresponding to the unemployment rate in. In this output
of factors beginning with dintrate correspond to the lags 1 to 12 for the interest rate. And all lags
beginning with dunrate correspond to the lags 1 to 12 for the difference time series of the
unemployment rate. From the output, we see that the largest p order for which we find a
temporal relationship of the change in interest rate, onto the change in unemployment rate, is
order two or up to two months. But for the serial correlation with unemployment the coefficients
are statistically significant up to order 12, where one AR lag or one year lag serial correlation.
This could be due to the seasonality in the unemployment rate.
237
The coefficients for modeling the interest rate are on this slide. From this output, we find that the
only statistically significant coefficients are up to order equal to 2. With an An order equal to 2,
would be sufficient to model the interest rate.
238
Since very few coefficients are statistical significance fall in statistically significant for an order
larger than 2. We'll, we’ll test here whether the model of order 2 for each of the time series is
possibly plausibly as good as forfeit of fit as the one of order 12. That is, we test the null
hypothesis that all coefficients for order 3 and higher are 0. Versus, versus the alternative
hypothesis that at least one of the coefficients is not 0.
For this we employ the Wald test. In the R code provided on this slide, we first extract the
coefficients corresponding to order 3 and higher. Second, we extract the covariance matrix of
this coefficient these coefficients from the covariance matrix of all the coefficients. In in a model
provided by the GCOVR vcov() R command with the input the model fit mode, mod. Last, I apply
the Wald Test, functioning r function in R where I need to specify the coefficients and their
covariance matrix. The option terms in this function can be used when we're interested in testing
only a subset of coefficients. Here I'm using all coefficients as I already extracted order 3 to 12
coefficients from the entire vector of coefficients. The p-value for the implementation of the Wald
Test here is 0.044, for the difference in interest rate time series. And it's approximately 0 for the
difference in unemployment rate. Thus, at the significance level of 0.05 we reject the new null
hypothesis for both time series concluding that an order larger than two would fit the model
better than a model of order two.
Next let's perform a residual analysis to evaluate whether the residuals have constant variance,
are normally distributed and whether they are uncorrelated. The arch test is used to evaluate
whether the residuals have constant variance. Since the p value is approximately 0, we reject
the null hypothesis of constant variance. We'll discuss modeling of non-constant variance in a
different lecture.
239
The normality test is used to evaluate whether residuals are normally distributed. Since the
p-value is approximately 0 again, we reject the new null hypothesis of normality. Last the serial
test is used to evaluate whether the residuals are uncorrelated. Since the p value is large, we do
not reject a null hypothesis of uncorrelated residuals. Last, we assess the property of stability,
all. All eigenvalues in this case are smaller than 1 and thus the fitted VAR is stable.
Next, let's look at the residual plots to explore further the non-constant variance and the
non-normality of the residuals.
These are the residual plots for the time series corresponding to the first order difference of the
interest rate. The upper left plot is the residual plot, clearly showing non-constant variants
variance. The upper right plot is the histogram of the residuals. From this plot we see that the
distribution of the residuals is somewhere somewhat symmetric but with heavy tails, hence the
rejection of the normality in the previous slide.
The rest of the plots are the autocorrelation plots of the residuals and of the squared residuals.
These plots indicates that both the residuals and their squared residuals are uncorrelated.
240
These are the residual plots for the time series corresponding to the first order difference of the
unemployment rate. From the residual plot we find that the variance of the residuals is constant.
More over Moreover, the histogram of the residuals for the time series are closer to normality
than from the previous time series since the tails are more normal. Both the residuals on and
the square residuals seem to be uncorrelated according to all the auto-correlation function plots.
241
3.3 Granger Causality & Prediction
In this lecture we will cover time series analysis and for this lesson I will introduce a concept that
is very important to our time series analysis. I'll also introduce prediction for the VAR model.
The structure of the VAR model provides information about the series of time series and along
with forecasting ability for other time series.The following intuitive notion of a variable's
forecasting or time series forecasting ability is due to Granger, who introduced the concept in
1969. If a time series, y1, is found to be helpful for predicting another time series, y2, and then
y1 is set said to Granger-cause y2. Otherwise, it is said to fail to Granger-cause y2. Formally,
we say that y1 is said to fail to Granger-cause y2 if the mean square error of the model without
lack lagged observations of the times time series y1 is the same as the mean square error of
the model we lacked with lagged observations of the time series y1. Clearly, the notion of
Granger causality does not imply true causality, it only implies for casting forecasting ability.
242
3.3.1(continued) Granger Causality & Prediction
In a bivariate VAR model with order p for 2 x 3 is two time series Y1 and Y2, we have that Y2
fails to Granger-cause Y1 if all of the p VAR coefficient matrices Pi1 to Pip are lower triangular.
As shown in the slide, the coefficient matrices corresponding to Pi12, the coefficients
corresponding to the Pi1 to Pi12 location in the matrices are 0. If Y2 fails to Granger-cause Y1
and Y1 fails to Granger-cause Y2 then the VAR coefficient matrices Pi1 to Pip are diagonal. This
means that there is no temporal lagged relationship between the two time series.
243
More generally, for an n-variate multivariate time series, we identify Granger causality in the
same same way. For example, if we have a tri varied tri-variate, multi varied multi-variate time
series, we need to evaluate causality among all pairs among the three time series. If we
consider Granger causality on Y1, we need to consider with respect to both Y2 and Y3. We say
that Y2 fails to Granger-cause Y1 if all of the coefficients on lagged values of Y2 are zero. In in
the equation for Y1 for the first time series. And that Y3 does not Granger-cause Y1 if all of the
coefficients on lagged values of the third time series Y3 are also 0 in the equation for the first
time series, Y1. It is possible that only one of the two time series, Y1 or Y2, Granger-cause Y3,
both, or none.
How to evaluate Granger-causality? We can use a hypothesis testing procedure where the null
hypothesis is that the coefficients corresponding to the Granger-causality are 0. For example if
we are interested whether to identify whether Y2 Granger-causes Y1. The, the null hypothesis is
that the coefficients on lack lagged values of Y2 are 0 in the equation for the time series Y1.
More generally for a pair of time series Yk and Yl among the end n time series, we test a new
the null hypothesis start that all the coefficients on lagged values of Yk are 0 in the equation of
4Yl for Yl. Rejecting the null hypothesis means that Yk does render codes Granger-cause Yl.
244
Last, I will only briefly discuss forecasting of a time series in a VAR model. Since the VAR
models n time series simultaneously, we will need to forecast all n time series or only those of
interest. Since VAR is a multivariate AR model, we can use the chain rule of forecasting.
The best linear predictor in terms of minimum mean squared error, MSE, of Yt+1 or 1-step
ahead forecast based on past data at time t, is obtained from the linear equation of the model.
By by plugging in the last values of the time series as provided in the slide. Forecasts for longer
horizons h (h-step forecasts) may be obtained using the chain-rule of forecasting by applying
plugging in predicted lagged values of the time series.
245
Again, this is the output for the coefficients corresponding to the difference in the interest rates
time series as. As noted before, the coefficients corresponding to the difference in
unemployment are statistically significant for orders one and two. Thus, we do identify some
lacked lagged relationship of the changing change in unemployment onto the change in the
interest rate as expected from how interest rates are established by the fed.
Let's look at the Granger causality more formally. That is, we perform a statistical inference to
address the following question. Is the change in interest rate led by the change in
unemployment rate, statistically speaking? To do so, we can apply the Wald Test. For this test,
we test whether the coefficients corresponding to the lacked lagged values of the difference in
unemployment rate are zero with the alternative hypothesis that a least one coefficient is
non-zero. Hence, Granger causality. Just as for testing for order selection, we use the Wald test
function in R. The vector coefficients for the differenced interest rate consists of all estimated
coefficients for the interest rate time series, except for the constant or intercept. Thus, removing
the last coefficient. The covariance matrix of the coefficient is defined here as var.dintrate.
We apply here the Wald Test, but not on all the coefficients. Only on those corresponding to the
lacked lagged values of the unemployment time series. Hence, the specification of the terms of
the option terms equal to the indices for the coefficients corresponding to the unemployment
time series. Based on this test, the p-value is approximately zero, which means that at least one
coefficient is not zero. Thus, indicating that there is a significant temporary lacked lagged
246
relationship of the change in the unemployment rate onto the change in the interest rate. Thus,
change in unemployment rate Granger-causes the change in interest rate.
Let's also see whether the VAR model improves the forecasting of interest rate when accounting
for both autocorrelation with lacked lagged values and the interest rate and cross correlation
with lacked lagged values of the unemployment rate. We'll forecast the months of January,
February and March in 2017 to see whether our model will forecast a change in interest rate
better, using the VAR model versus using univariate AR model will. We’ll use the first 67 years
of data. The training data consists of the 67 years before 2017 and this training data is used to
train the model where the last three observations, the months of January, February and March
of 2017 correspond to the testing data.
R for this I first apply the AR model to the change in interest rate using the AR R command. And
predict the last three observations using the predict R command, specifying the model with the
specification of the model fit and the number of rags that had lags ahead to be predicted.
Next, I predicted based on the VAR model. Similarly, I used the training data to fit the model and
the testing data to predict. The predict function again is predict just like in AR model.
247
This plot shows the change in interest rate since September 2016 until March 2017. We can see
two small changes in November 2016 and February 2017 corresponding to the first two
increases in the interest rate after many years without any change. The red and green circles
correspond to the forecast based on the two models. They both predict that there will be no
change in the interest rate. This is a as expected since the interest rate did not change for many
years and any model will predict no change, regardless whether using information from other
economic indicators such as unemployment.
Thus, while unemployment does grant you cause Granger-cause interest rate, this is not
reflected in an improvement forecasting accuracy. This may be only for the three months,
because is an of this example. The forecasting may prove improve however, if we consider a
different time period.
In summary in this lesson, I illustrated the Granger causality with a data example.
248
3.3.3(12) Generalizing the VAR Model
I will straight illustrate an extension and generalization of the VAR models. Specifically, I'll
discuss extensions of the VAR model for non-stationary time series including two components -
the deterministic component specifying seasonality in and trend and a component reflecting
exogenous variables.
The basic VAR model may be too restrictive to represent the main characteristics of the data. In
particular, other deterministic terms such as a linear time trend or seasonality may be required
to be incorporated in a model to represent the data properly. Additionally, stochastic exogenous
variables may be required as well.
The general form of the VAR model, with deterministic terms and exogenous variables, is given
by the model formulation as on this slide. In this formulation, Dt is a matrix of deterministic
components, and Xt is a matrix of exogenous variables. I'll introduce those two components in
the next two slides.
249
The deterministic component can specify both trend and seasonality whenever those are
present in a time series. In using the R statistical software, we can specify trend in three ways -
none, only as a constant, or as a linear trend with respect to time. This thus, it allows for three
options, a trend, both, or none. If we'd like to represent more complex trends, then we will need
to write our own function to modify the VAR command.
For modeling seasonality, the VAR command in R allows modeling that, through out a
deterministic component specified as seasonality. Specifically, we need t, the VAR function in R,
only specify dummy variables versus now seasonality using the seasonal option. But it does not
allow for a seasonality model using the harmonics variables, for example. Thus, if seasonality
has a long seasonal periodicity, we'll add many more parameters to be estimated.
250
What are exogenous variables? Assume we have a time series Yt or a multivariate time series
Yt modeled as on the slide. Xt is exogenous if its lacked lagged values and possibly its
contemporaneous value, aid in predicting Yt. However, Yt does not aid in predicting Xt.
As on the slide, the motto model for Xt can depend only on the lag values of Xt but not on the
lags of Yt. This also means that Xt Granger-causes Yt, but Yt fails to Granger-cause Xt.
Thus, VARX model allows to fit smaller models for Xt, while accounting for temporal relationship
of Xt onto Yt.
251
Another extension of the VAR model is the so-called structural VAR model, which is an
extension of the VAR model by considering a linear transformation of Yt through the A times Yt
on the left. And/or a linear transformation of the error term through the term B times epsilon t.
The structural VAR model not only models lag temporal relationships, but also
contemporaneous dependencies between the time series through the transformation matrices A
and B.
We can further write the model in the reduced form by multiplying both sides, the left and the
right sides, with the inverse of the matrix A assuming it is invertible. The reduced form has only
a transformation of the error terms, thus accounting for the contemporaneous relationships
through the error term alone. It is a VAR model with correlation in the error terms specified by
the matrix gamma.
This is an example of a bivariate VAR(1) of order equal to 1 model, as provided in this example.
The Yt11t depends on a contemporaneous Y2t, since in the equation for Y1t, the the coefficient
a12 is not 0. The same for Y1t, which depends on a contemporaneous Y1 through the
coefficient a21. The error terms are assumed to be independent by the different variances. We
can rewrite this model in the matrix form as on this slide, where the vector time series Y1t Y2t is
multiplied by a 2 by 2 matrix with rows. With the first row 1 and a12 and the second row a21 and
1. We can further write the model in the reduced form as provided on this slide.
252
It's important to remember that structural VAR needs to impose some constraints on the
matrices A and B. This is because of so-called identification issue. That is, the parameters in the
structural VAR are not identifiable because they cannot be estimated uniquely. That is, given
values of the reduced form pi 1, pi 2 to pi p and the covariance matrix of their terms sigma. We,
we cannot estimate uniquely the parameters of the structure of our model given by A, B, pi 1
star, pi 2 star, to pi P star.
If we take the example provided on the previous slide, the reduced form has 9 parameters,
where the structural VAR has 10 parameters. That is, we'll try to estimate 10 parameters using a
system of 9 equations. From basic linear algebra, we know that such a system cannot provide a
unique solution. In order to estimate the corresponding structural VAR model we would need to
add one constraint on the parameter such that we have 10 equations or 8 parameters to
estimate. Different restrictions can be on A, on B, by imposing B to be an identity matrix. In this
case, we have an A model. We can have restrictions on the A matrix by assuming that as an
identity matrix, having the B model. Or we need to impose restrictions on both A and B when we
allow for both an AB model.
253
Estimation to the structural VAR model means estimation of the reduced-form model then
mapping back to the structural VAR model. In order to estimate the reduced form, we generally
consider a two-step procedure. First, apply the estimation approach just as a simpler VAR
model, to obtain the estimates of the coefficient matrices, pi 1 to pi p. Then take the residuals of
the model, phi, and estimate their covariance using the sample covariance.
Here is a simple example. If we assume a bivariate time series, when we fitted with a VAR
model of order 1 with the restriction that a12 is equal to 0. We estimate the reduced model to
get the estimated coefficient matrices, and the estimated covariance matrix. We can estimate
the only unknown coefficient on the A matrix from the estimated covariance as provided on the
slide. As in this case, if we impose a sufficient number of constraints, we have a 1 to 1
correspondence between the reduced model and the structural VAR model.
In summary, in this lesson I introduced two extensions of the VAR model, specifically the VARX
model and the structural VAR model.
254
3.3.4(13) Co-Integration
In this lesson, I will introduce another concept that applies to an extension of the VAR model,
specifically co-integration. Thus, from the previous lesson from generalizing using the VARX
model which allows for seasonality and trend, and structural VAR model which allows for
contemporaneous relationships. Now, the generalization allows for nonstationary VAR model
which can be dealt using differencing approach.
In a different lecture, we learn about an approach for dealing with nonstationary time series,
specifically differencing the time series. This concept applies generally to both univariate and
multivariate time series and is formally defined as integration. To be more specific, if the d order
difference of a time series of Yt is stationary, we define that Yt is integrated of order d, or more
concisely, write Yt ~ I of d. I of d is an indication, is a definition of integrated time series of order
d. If d is equal to 0, then a time series Yt is stationary. Thus, we do not need to take any
difference to make it stationary. When d is equal to 1 then only an order equal to 1 difference is
sufficient to obtain a stationary time series.
While integration means taking a d order difference of the time series, cointegration means
taking a linear transformation, beta, of the vector time series Yt, where Yt is a multivariate time
series such that the resulting linear combination is a stationary time series. The linear
combination beta transpose Yt, is often motivated by economic theory and referred to as a
long-run equilibrium relationship.
255
It's important to know note that the vector beta is not unique in making the linear combination of
the multivariate time series stationary. We can multiply this vector with any value, with any
constant which will result in a risk-held rescaled time series. This will not change the stationary
property of the linear combination beta transpose Yt. Thus, we need to impose a constraint on
the element of the vector beta in order to have a unique linear combination. This is called
normalization.
A typical normalization is to consider a vector of the coefficients in which the first valuing the
vector is 1, as the vector beta provided on the slide. Using this vector beta, if we multiply beta
transpose with yt, we'll get y1t minus beta2Y2t up to betan, Ynt equal to Ut. Or, the equation
provided on the slide where Y1t is equal to this linear combination of beta and contemporan-
eous values of the time series plus ut.
Ut is a cointegrated residual; ut is a stationary time series.
256
For an n-dimensional multivariate time series, we can identify up to n minus 1 such
cointegration relationships. For example, if we consider a trivariate time series with n equal to 3,
there may be two different cointegrating vectors beta 1 and beta 2 such that the linear
combinations using these vectors result in stationary time series. These two vectors are linearly
independent, otherwise, they will lead to the same cointegrated relationship. The matrix B with
each row corresponding to one cointegrating vector is called the cointegrating basis.
So important properties are as follows. One, is that both vectors need to be normalized to be
unique as discussed in the previous slide.
Second property is that any linear combination of the two vectors leads to a cointegrating vector.
This is why B is called cointegrating bases since the two vectors are at the basis of any other
cointegrating vector.
257
Let's now consider a multivariate time series with n time series which is of water of integration of
order 1. Meaning that linear combinations are sufficient that one order of linear combinations are
sufficient in transforming the multivariate time series into a stationary time series. Similarly, as
for n equal to 3, we can identify up to n- 1 different unique cointegrating vectors which will form
the cointegrating basis as in the previous slide. Cointegration hypothesis test cover two
situations. One is that there's at most one cointegrating vector. The other is that there are
multiple cointegrating relationships. These two approaches have been introduced in the
references provided on the slide.
258
Let's take a closer at these approaches. First, let's define the hypothesis in. In testing for
cointegration, the null hypothesis is that of cointegrating multivariate time series assuming an
integrated, a time series of order 1 and the alternative is that the multivariate time series is
stationary. Note that this is not a test of stationarity; it is a test of cointegration. The general
approach proposed by Engle and Granger is that of a two-step method.
First, form the cointegration vectors and test whether the residual time series is stationary, thus
then?, employing a test for stationarity.
There are two situations here. One, is when the cointegrating vector is known, the other, is
when it is not known. When the cointegrating vector is not known it can be estimated from the
data assuming that it's cointegrated. To note when the basis is not known, some normalization
assumption must be made to uniquely identify it.
259
We can expand the idea of cointegration to VAR models. Let's consider the classic VAR model
with a deterministic component. VAR is stable if and only if there are no roots of the VAR
polynomial in [correction: inside or on] the unit circle. Moreover, stability implies stationarity but
not the opposite. Stationarity condition is less restrictive. A VAR is stationary if there are no
roots of (with?) modulus that are exactly on the unit circle. If there are roots on the unit circle,
then some or all of the time series in Yt are integrated of order 1 and they need maybe may be
also cointegrated. If Yt is cointegrated then the VAR representation is not the most suitable
representation for analysis because the cointegrating relationships are not explicitly apparent.
260
The cointegrated relationships become apparent if the VAR is transformed to the Vector Error
Correction Model, abbreviated as VECM. VECM is particularly of interest in the interpretation of
the relationships between the time series by introducing concepts, such as long-term
relationship between variables, and the associated concept of error correction, whereas, one
studies how deviations from the long run are corrected.
The VECM is described on this slide. In this definition, the matrix pi is 1 minus the sum of the pi i
matrices in the original VAR model. This first component, pi times Yt-1, is the only part that can
be integrated of order 1. All other components are differences of the lags t-1 to t-p+1, and
hence, are stationary. To ensure that the model is identifiable we need to assume conditions on
gamma k and pi matrices as provided on the slide. I will not explain further on this model since it
requires more advanced modeling. Its application is more relevant to understanding
relationships between time series rather than forecasting.
In summary, in this last lesson, I introduced an extension of the VAR model which allows to
model non-station multivariate non-stationery non-stationary time series.
261
3.4: Case Study 1: Alaska Moose Population
The time series plot of the forecast of the factors in this data set are here. We'll focus on the
total moose population in this study. We can see that many of the factors provided have some
overlapping patterns with this time series.
The question we'll address in this lecture is, whether we can forecast the total moose population
with the aid of other factors? Does the wolf population impact the moose population? Does the
moose population tend to decrease with heavy snow? Again, such questions can be addressed
with a multivariate time series analysis.
262
3.4.1 (continued)
In a previous lecture I developed a forecast of the moose population based on two methods for
modeling univariate time series. The ARIMA model, and the model in which we first estimated
trend using non parametric regression and applied ARMA to the detrended time series. The
forecasts are provided here along with observed data for comparison. The model in which we
account for nonlinear trend shown in blue [correction: red] performs better than ARIMA shown in
red [correction: blue].
As pointed out here, as pointed out before, it is possible for other factors to impact the moose
population. For example the population of moose predators, wolves, for instance can be a key
indicator of the subsequent years moose population. In general, a higher predator population in
a given year will not only influence the following years moose population, but is also expected to
have a more longer effect.
Additionally natural factors such weather, and environmental conditions play in an influential
role. Average snow depth for a given year will be used as a measure of how harsh a particular
winter season is. The population of the city of Fairbanks, the largest city in the region, as well as
the number of animals harvested by hunters each year can reveal the effect of humans on the
population of surrounding moose. Can we improve on the forecasting of the Alaskan moose
population by accounting for such factors?
263
3.4.2 Model Fitting
On this lesson I’ll illustrate the model fitting for the Alaska Moose population.
Because the objective of this analysis of the data on the total moose population in Alaska is to
predict the moose population, we’ll divide the the data into training and testing where the
training is data is used for fitting the model and the testing data is used for evaluating the
forecasting performance of the fitted models. Next we only analyze the training data. In the
previous lesson, we've learned that the data in the study are not stationary. Thus, we'll next
assess whether differencing the time series will improve stationarity.
264
3.4.2 (continued)
Here we apply the first order difference first. Here are the time series plots for the first order
difference. The first word order difference times series for total most moose population still has a
seemingly increasing trend. For other time series, the first order is sufficient to make the
difference times time series approximately stationary.
265
Considering high order differences as provided on this slide thus removing the trend in the total
moose population. However we still identify some possible nonstationarity for the other factors.
Also it will make interpretation of the results and forecasting more challenging. We'll continue
analyzing the 1st order difference data.
We'll continue with applying the VAR modeling to the first order difference data. We first select
the AR order using the VARselect R command using it as its input the multivariate time series
266
and the maximum order allowed, here it is =20. The selected order for this particular data is 4. In
order to fit a general unrestricted VAR model with no trend, we can use the VAR command
specifying the selected order. Because we didn't specify any option for the type, the default is to
fit only a constant for the mean with no trend. Furthermore if you would like to fit a restricted
model obtained via some form of model selection, we can use a restrict command. In the next
slide, I will present the model output for the total moose population from both models.
3.4.2 (continued)
This the output for the total moose population from the unrestricted VAR model. The output
provides the estimation of the coefficients along with statistical inference and a statistical
significance of the coefficients. The order of the coefficients is as follows, the first five are for the
lag one time series relationships, for the five time series. The next five are for the lag two and so
on. The last coefficient is for the constant, or the mean of the time series. For example, the
estimated coefficient for the relationship of lag ( t-1 ) of the total population in Fairbanks to the
total moose population is 0.004. Since the p value is large for this coefficient, we conclude that
it's not statistically significant. The next coefficient is for the relationship with lag t-1 times series
of the moose harvested and it's its estimate is -1.41. Again, since the p-value is large, this
coefficient is also not statistically significant. In fact, none of the coefficients is statistically
significant since the p-values are all large.
267
This lack of statistical significant is also reflected in the F-test for the overall regression. For this
test, the null hypothesis is that all coefficients are 0. Since the p-value is large, we conclude that
it's plausible that all coefficients should be 0.
3.4.2 (continued)
This is the output for the restricted VAR. The restricted VAR only includes the lag minus 1 time
series for the total moose population and the coefficient for this relationships is statistically
significant. Moreover, by including this relationship alone, we expect explain 42% of the
variability in the total moose population. Since none of the lack lag relationship with respect to
other time series have been selected, we may conclude that other time series do not Granger
cause the total moose population. We'll test this formally.
268
We apply here the Wald test, where the null hypothesis is that all coefficient corresponding to
each of the four lagged time series, including total population in Fairbanks, moose harvested,
average snow and total wolf population are 0, meaning that they do not Granger cause the total
moose population. The test is applied separately for each of the four time series.
From this result, we find that there is no Granger causality for Fairbank's population and wolf
population. But, there is a statistically significant Granger causality with respect to the other two
time series. The results from this Granger causality analysis and the restricted VAR model are
not consistent. One reason for this inconsistency is that the model selection approach
implemented in the our R command VARSelect, decides which variable to be included in a
model based on the T-values rather than performing a rigorous model selection approach. Thus,
the hypothesis testing procedure provided on this slide will be more reliable in identifying
Granger causality.
269
3.4.2 (continued)
Since we found that there is Granger causality with respect to only moose harvest and snow
average time series, next we will explore a model with the reduce set of time series including
these two time series. And a time series corresponding to the change in the moose population.
A similar analysis as before is performed. Based on this model, we now find that there is not a
statistically significant Granger causality for the change in the moose harvested onto the change
270
of the total moose population. But there is statistically significant Granger causality for the
change in average snow.
3.4.2 (continued)
Once more comparing these results with those provided by the model including all factors, we
see again an inconsistency with respect to the Granger causality. For for the change in moose
harvested, while. While for the model including all factors there was a statistically significant
Granger causality, for the model with reduced number of factors, it's there is not. Thus we
conclude that there is a significant Granger causality with respect to the average snow. But we
do not have evidence for Granger causality with respect to the other three factors.
Last, we considered a model with the reduced number of time series to predict the total moose
population for the last four observations of the time series.
271
Here is the plot of the observed time series for the total moose population for the last 20 years
of data in g the ARIMA model shown in blue. And the model in which the trend is fitted
using a non parametric model and the stationary process is fitted using ARMA. This prediction is
shown in red.
From this plot, the VAR model performs slightly better than the ARIMA model although they both
misfit the trend. The method in which the trend is fitted using an non permit a non-parametric
regression performs significantly better, clearly capturing the decreasing trend, although it under
predicts for all four years.
In summary, in this lesson I illustrated model feat fit and comparison of univariate and
multivariate time series models.
272
the Blue Ridge Mountains of the Appalachian Highlands in northeast Georgia. Flows It flows
southeasterly for 120 miles and then towards South along the Georgia-Alabama border for
about 200 miles.
The upper Chattahoochee River watershed is the most important if not unique source of water
supply for Atlanta area. It also plays a key role in shaping the whole basin interlogical
hydrological and ecological system from which it belongs. Our understanding of such
environmental systems can be substantially improved by using multivariate time series analysis
of modelling techniques.
The main hydroclimactic time series, name precipitation, ground/air temperature and stream
flow, are used in this study where the primary objective is to forecast the stream flow. For this
tide, study, temperature and rainfall data was acquired from the National Weather Service
Forecast office. And the river flow measured by the discharge in cubic feet per second for the
downstream site of Buford Dam was acquired from the US Geological Survey, Water Resources
database.
The data available from the two sources have different time frames, 1956 to 2017 for river flow,
and 1950 to 2017 for temperature and rainfall. Thus, we will need to first process the data
tomorrow to model all three time series within the same time frame.
We first read the data from three different files with some were somewhat different structures.
Make sure to open the files to learn about the structure of the three data files. First, we're
missing data from one observation in rainfall for October 1stat4. Which will, which we’ll replace it
with the average of the neighboring observations, the average between September and
November. Next, we remove the first six years from the temperature and rainfall data. We'll also
273
have data on July 2017 for temperature and rainfall, but only until March 2017. Thus, I remove
the last four values in the temperature and rainfall time series also.
Next, I convert the vectors of observations into time series using the tsr ts R commands,
specifying frequency equal to 12 since the data are observed monthly.
Last, I plotted the time series and their algorithm auto- and cross-correlation plots.
274
This These are the time series plots for the three time series. The temperature time series
clearly shows its analytic seasonality. There is no The seasonal changes in rainfall are not
significant even though they there are visibly higher amounts of rainfall during winter months.
More importantly, the relationship of temperature and rainfall to the river flow is the one that we
are interested. The air temperature shows that the potential evaporative transpiration
evapotranspiration during summer months is nearly four times higher than that during winter
months. Thus, stream flow time series exhibits seasonal patterns related to these variations.
During winter months when the potential evaporative transpiration evapotranspiration is lowest
in the year. The, the stream flows are mainly determined by the precipitation. While during
summer months, when the potential evaporation evapotranspiration is higher than the amount of
water falling within the watershed, the string stream flows are consistently low despite of the
variations of precipitation.
The ACF plot show the high degree of seasonality for temperature and the ACF plot for rainfall
shows large values of the lags 1 and 12. However, there is not a significant lead line lead-lag
relationship between river flow, and the two times series as shown by the cross-correlation plot.
275
Next, we're testing the hypothesis of stationarity for all the three time series. Here I'm applying
the R command adf.test. And where the input is river.ts or any of the three time series. Note that
I also specify here alternative = stationary. And this means that the alternative in this hypothesis
testing procedure, is that the time series is stationary. We see that from this test, the p values
are all small, indicating that we reject the null hypothesis, or that the alternative hypothesis of
stationarity is not rejected. Hence, we have stationarity.
276
But if we consider differencing the time series, the order selected for differencing the time series
is 0 for all three time series. While for the river flow times, time series this may be the case. We,
we have learned from previous plots that a difference of order 12 may be appropriate for
seasonality in the temperature and the cyclical pattern for rainfall. The reason why the ndiff adf
command does not identity an order 12 difference is because it only considers lower order
differences.
Next, I consider a log transformation of the times series. This is probably because the variability
of the river flow time series varies with time. So just using, suggesting heteroskedasticity or
non-constant variants variance. One approach to deal with this is using a transformation. For
example, here we are using the log transformation.
Last, in order to account for the seasonality in the temperature, and seek the seasonal pattern in
the rainfall, I applied the order 12 difference for all three times series.
277
These are the log transformed time series. The nonconstant variance for the river flow times
time series is stabilized.
These are the plots of the differenced time series. The seasonality is in the temperature and
cyclical pattern in the rainfall are not present anymore.
278
This is also evident from the ACF plot, we. We see some cyclical pattern left in the rainfall
indicating that removing this the seasonality is not sufficient to account for the cyclical pattern.
And somewhere In summary, in this lesson, I introduced yet another data example that I will use
for illustrating what multivariate time series analysis. In this lesson I provided an exploratory
data analysis for this example.
279
3.5.2 Univariate Model Fitting & Prediction
In this lesson, I will continue the example in the prediction of river flow. But this time, we'll focus
on modeling using univariate time series approaches. Again, in this study, we'll forecast the flow
of the Chattahoochee River which is part of the important hydrological system in the southeast
of the United States and this is the Apalachicola-Chattahoochee river watershed (which) is the
most important. If, if not unique, source of water supply for Atlanta area.
I'll begin with the modelling of the river flow using univariate time series models. We'll apply the
modelling on the lag log transform time series. I also split the data into training and testing
where the training data include all the years from 1956 until 2006, and the testing data include
year 2006, and the first three months of year 2017. I'm not showing you here the code for model
selections. It has already been discussed for other case studies in previous lessons. The
selected order for the river flow time series are one and four. The AR water order is equal to 1
and MA is order is equal to 4 with no differencing identified. That means d is equal to 0 and the
difference order for the seasonal component is selected to be equal to 2. With this these
selected orders, we feed fit the seasonal ARIMA model as provided on this slide.
280
These are the visual residual plots for the fitted seasonal ARIMA model. The residuals versus
time plot does not show any particular pattern. However, the ACF plot points to large outer
auto-correlations of lags 1 and 12. The histogram on the probability norm in normal plot showed
that the residuals have an approximate normal distribution. From this plot, the only possible
concern is that there is some autocorrelation in the residuals.
281
Let's apply a test for uncorrelated data using the Box.test R command to import. The important
units in this R command are fitdf and lag. The input for fitdf is the number of coefficients in the
ARIMA ARMA model. And this is used because we are applying the box test for the
uncorrelated data on the residuals. Since we use an ARMA model with order one and four, the
fifth DF fitdf is equal to five. The lag input will need to be higher then the fifth DF fitdf. Here, we
consider it to be six. We also have the option to use different test statistics provided by the type.
The p-value for the testing procedure for uncorrelated data is large, which means that we do not
reject the null hypothesis of uncorrelated residuals. Hence is it’s possible that the residuals are
uncorrelated.
The forecast are R function in the library forecast can be used to derive the predictions for the
river flow time series based on the seasonal ARIMA model. To plot the forecast, we can apply
the forecast function with input, the fitting fitted model and the number of lags ahead for
prediction. In this case, h equals to 15, then we can apply the plot function to the forecast
object.
282
The plot is provided here. The black line shows the observed time series. And at the end, we
have the predictions along with the prediction intervals. But it's hard to assess how close the
forecast are to the observed time series values for the test data. We can see the predictions
have high uncertainty since the prediction intervals are wide.
283
Next, we'll extend the ARIMA model to count account for exogenous factors, such as
temperature and rainfall. Just like the VARX model, this is called ARIMAX. The ARMIA R
command allows specification of exogenous factors through the xreg option. As provided here,
both the temperature and rainfall time series are considered exogenous factors in modeling and
predicting river flow.
We apply again the model selection. This time, the selected order for ARMA are 1 and 4
whereas forces in all of this for seasonality it is 0, indicating no difference differencing for the
seasonal component when accounting for the exogenous factors.
284
The residuals are here. The residual plot versus time does not present any pattern. The ACF
plot of the residuals resemble that of white noise, indicating possibly unrecorded uncorrelated
residuals. The residuals also have an approximate symmetric distribution and an approximate
normal distribution as provided by the histogram, and the probability normal plot.
285
Here, we apply once more the test for uncorrelated data. Similarly to the ARIMA model without
exogenous factors, we do not reject the null hypothesis for uncorrelated residuals when
applying this testing procedure. Because because the p-values are large.
We can obtain the forecast similar for this model for ARIMAX model similar as before. The R
code used for extracting the point forecast and the lower and upper predictions intervals are in
the R code files for accompanying this lecture. On this slide, I'm comparing the predictions using
ARIMA without exogenous factors in the upper figure and with exogenous factors in the lower
figure. I specify the same scale for both plots for comparison.
The prediction using ARIMA without exogenous factors are not better than when accounting for
exogenous factors. The point predictions for ARIMAX do not have as high uncertainty as when
using the simpler ARIMA model.
In summary in this lesson, I provided the univariate modeling for the river flow data example.
286
3.5.3 VAR(X) Model Fitting
In this last lesson, I will finalize the analysis on river flow data with an analysis using multivariate
time series modeling. In this study, we'll forecast, again, the flow of the Chattahoochee River.
We'll begin with a classic VAR model applied to the trivariate times series, including log
transformation of river flow, temperature and rainfall. As before, we first select the order for the
VAR model which is p equal to 1 for this data. When fitting the VAR model, we consider a
seasonal deterministic component specified as season equal to 12 and we assume a trend
specified as type equal to both which means feeding fitting both a constant and a linear trend.
We can also further reduce the model using the VAR restrict R command. We'll next look at the
outputs for the river flow for both the VAR and VAR restricted.
287
This is the output from the VAR model for modeling the river flow time series. The first three
rows in the output matrix of the estimates are for the coefficients for the lag t minus 1 time
series. The next coefficients are for the constant and for the trend, and the last 11 coefficients
are for modeling seasonality. Note that we only have 11 coefficients for seasonality where we
are modeling monthly seasonality. And hence, we should instead have 12 coefficients. Similar to
modeling seasonality using dummy variables as introduced in the first lecture of this class, we
only include 11 of the 12 dummy variables. Because a the dummy variables are linearly
dependent and since with the column of 1s and since we include a constant or the intercept, we
only include 11 of the dummy variables.
What can we learn from this output? First, the coefficients for lagged river flow are and rain are
statistically significant. Second, the coefficient corresponding to trend is not statistically
significant, given that all other factors of modelling seasonality and zero serial correlation are
included in the model. Among the seasonality dummy variables, some of the coefficients are
statistically significant, some are not. The R squared of the model is about 0.57. Or if we would
use interpretation from linear regression analysis, 57% of the variability in the response is
explained by the predicting variables included in the model.
288
This is the output from the restricted VAR model for modeling river flow time series. To note that
the restricted VAR simply traps drops those coefficients which are not statistically significant
which is not the correct approach to do model selection. Since as we learned in regression
analysis when you're moving removing one predicting variable, the statistical significance can
be very different from the model including the predicting variable. To understand this concept, I
recommend learning about variable selection from the online regression analysis course. Based
on this output, we can see that the R square has increased significantly to 99%. This is much
higher than the full model.
289
Does the temperature and rainfall Granger-cause river flow? Here is the output performing the
Wald test for the new null hypothesis. That that all coefficients correspond to the lag T minus 1
of the two time series are 0. Based on this test, the p-value is small. Hence, the we reject. We
reject the null hypothesis. This test says that at least one of the time series Granger-cause river
flow.
290
Let's compare the predictions based on the VAR unrestricted VAR models versus the observed
time series for 2016 and the first three months of 2017. Hence, predicting 15 lags per head
ahead. This time, I'm using the predict R command instead of forecast R command to show you
a different implementation of the prediction. First, we consider our predictions only for the river
flow. Hence, the dollar sign followed by the name of the time series.
To another predict, we'll To note: predict will provide predictions for each of the three time series.
But we are only interested in the predictions of the river flow. Also note that the output of the
predicted predict R command is a list and only the first element in the list provides the
predictions. The object called pred in this slide provides not only point predictions or forecast,
but also the prediction intervals for the predictions. We apply this implementation for both
models.
Here are the predictions under prediction intervals. The points in red and green are the point
predictions from the two models, and the dotted lines in red, and blue are the prediction bands.
Both models performed similarly. The point predictions are quite different from the observed
data. More importantly, the observed data fall outside to the outside of the prediction interval. In
fact, the they over predict and pointing to poor prediction overall.
291
Next, I will illustrate the implementation of the VARX model in which I fit a VAR model to the
bivariate time series, including log transformation of the river flow and rainfall while temperature
is now an exogenous factor. This implementation is used only to illustrate the VARX model.
The implementation using uses the VAR R command with similar input as before, except that
this time, I'm including another option in the VAR function specifying the exogenous, exogen
equal to temperature. The rest of the implementation is just like the VAR model without
exogenous variables.
292
This is the output of the VARX model, which now includes temperature as an exogenous
variable. The regression coefficient for temperature as an exogenous variable is not statistically
significant, given that we accounted for seasonality trend and see a serial correlation in the river
flow modeling. Importantly, the estimated coefficients based on this model are similar to those
from the model where temperature is an endogenous variable.
293
In this slide, I'm comparing the prediction of log river flow based on the three models: VAR,
ARIMA and ARIMAX. ARIMAX and VAR perform similarly, because ARIMAX accounts for lag of
relations relationships of temperature and rainfall. ARIMA does not predict worse than VAR or
ARIMAX, but it does have more uncertainty where the prediction band includes the observed
values. And and hence, has better predictions.
In summary, I concluded the analysis and river flow data with the model link modeling using
multivariate time series and I compared the forecast based on univariate and multivariate time
series models.
294
`` Unit 4: Modeling Heteroskedasticity
First, we'll learn about the difference between conditional and unconditional variance, given a
time series. The condition of conditional variance of the time series at time T is the variance of
the time series, given the past values of the time series. In contrast, the unconditional variance
of the time series of at time T is simply its variance disregarding the past data. We commonly
refer to conditional variance as volatility.
In the simple case when the time series data are independent (what happens at time T does not
depend on what happened in the past) the conditional variance is the same as the unconditional
variance since the variability of time T is in dependent independent of the variability of the past
data. Independence implies lack of dependence or lack of correlation.
295
On the other hand, the case when the times series data are uncorrelated, but dependent, the
covariance between any two variables in a time series is zero. Meaning that the reason there
isn’t a linear dependence between any pairs of the variables in a time series. However, it is
possible to find a transformation f of the variables in a time series such that f of x1, f of x2 to f of
xn are in fact, correlated. That is a nonlinear dependence.
One common transformation we'll use in this course is the power transformation. For example,
square power transformation. They're Under uncorrelated data, but not independent. W, we do
not have that a conditional variance is equal to the unconditional variance.
4.1.1 (continued)
How can we diagnose independence and uncorrelated, but dependent data? We have
independence when the ACF plot of the time series, as well as transformations of the time
series resemble the ACF of white noise. On the other hand, we have uncorrelated data, but still
dependent if the ACF plot of the time series resembles that of white noise, but the ACF plot of a
transformation of the time series does not.
The difference between independent and uncorrelated data becomes apparent when we study
the residuals for when ARMA model applied to data with non-constant conditional variance.
We'll see in some illustrative data examples that while the residuals are white noise, the
squared residuals may not be white noise.
296
4.1.1 (continued)
ARMA models are used to model the conditional expectation of a process given the past data.
But in an ARMA model, the conditional variance given the past data is assumed constant.
The class of heteroskedastic models is a generalization of the models so far through relaxing
the assumption of constant conditional variance of the time series. And therefore, modeling a
nonlinear time series.
What does this mean for say, modeling stock returns? For example, suppose we have noticed
that recently daily returns have been unusually volatile. We might expect that [stock warrant?]
return is also more variable than usual. However, an ARMA model can not capture this type of
behavior because its condition of conditional variance is constant.
So we need a different time series models if we want to model the non-constant variance or
volatility. We need to consider more general models in which both the conditional expectation
and the conditional variance depend on past time series data.
297
To be more specific, if we begin with an ARMA model as provided on the slide, the residual
process, Zt which is assumed to be white noise with constant variance given by sigma squared.
This will be an ARMA model.
However, in many applications, time series data often exhibit volatility clustering where time
series show periods of high volatility and periods of low volatility indicating non-constant
variance. If that's the case, we can rewrite the error term Zt as the product between a
deterministic, but time varying components sigma t and a random process Rt with mean 0 and
variance 1.
4.1.1 (continued)
298
In the formulation provided in the previous slide, sigma t is of interest as it models the
conditional variance of the time series. Sigma t's often referred to in financial literature as
volatility.
What are the important features of sigma t? The function sigma t may be high for certain time
periods and low for other periods, also reflecting clusters of variability. The function sigma t
varies with time in a continuous manner. In other words, there are no discontinuities or
anomalies in the volatility. The function sigma t varies within a specific range. That means it's
not infinite, it takes a finite range of values. The function sigma t is commonly assumed to have
the so-called leveraged property. In other words, it reacts differently to large amplitude changes
from the small amplitude changes in the time process.
We can then use nonparametric regression on the log of squared residuals to estimate log of
the conditional variance where log of sigma t squared. To obtain the estimate of the conditional
variance, we can transform back by taking the expectation, by taking the exponential of the
estimate of the log sigma square t. This approach can provide a good estimate of the condition
299
of conditional variance, if it changes smoothly over time. From For modeling more complex
structures and volatility, other models can be considered as introduced in this lecture.
4.1.1 (continued)
Modeling heteroskedasticity has become important in the analysis of time series data,
particularly in financial and economic applications. These models are especially useful when the
goal of the study is to analyze and forecast volatility. If we can effectively forecast volatility, then
we'll be able to price, for example, more accurately, create more sophisticated risk management
tools or come up with new strategies that take advantage of the volatility.
Classic applications in finance where the estimation of volatility is needed as is, for example, for
option pricing, specifically in the black shows Black-Scholes model for option pricing which is
dependent upon the volatility of the underlying financial instrument. It can also be needed in
tradable securities where volatility can be traded directly by the introduction of the CBOE
volatility index, abbreviated VIX.
To be even more specific in financial data, it is very difficult to predict the mean of the price of a
financial instrument such as stock price. If we it were to be easy, other investors would predict
the price as well and trade in such a way as to remove the advantage of the prediction. As a
result due to the removal of the predictable information from the financial instrument by
investors, almost all that is left is a random walk with a weak trend. The long-term growth rate
reflects how to trade a financial instrument. If it has large volatility in the future, the returns need
to be larger to attract investment to compensate for their risk of a loss, of a potential loss.
300
The applications are of estimation of volatility or the condition of conditional variance are or in
the risk management where volatility is a direct measure of risk in management and investment.
More generally, volatility is a measure of uncertainty in any decision making. Similar principles
as in investment for financial instruments apply within other settings where risk of uncertainty
plays an important role in decision making.
To simulate the time series a time series with non-constant conditional variance are in R we
need to perform a similar simulation approach as for the random walk simulation. That is we
need to simulate a vector w for the random white noise, then a different vector called here eps
for our time series values. And finally, a vector sigma SQ sigsq in this simulation for the
conditional variance. The coefficient a0 and a1 are to capture mfx MA effects and b0 to capture
our arfx AR effects in the conditional variance as provided in the formula for sigma squared, the
conditional variance. The last two R commands are for plotting the acf or for the time series, as
well as for the squared time series.
The simulation model is described here. The time series is epsilon t. Sigma t is the standard
deviation of the conditional or the square root of the conditional variance.
301
The ACF plot of the time series is in the upper panel and the ACF plot of the squared time
series is in the lower plot. While the ACF plot of the time series resemble that of white noise, the
ACF plot of the square time series does not. We see that the first three lags are outside of the
confidence band, indicating zero serial correlation in a the squared time series. This is an
example of uncorrelated data, but not independent since the squared time series is correlated
data.
302
We'll consider one data example of a financial instrument, the stock price of a company. I
selected here a company with very large volatility, BDCA Energy, which is a crude oil and
natural gas producer with headquarters in Denver, Colorado. Daily stock price for more than ten
years of data starting with January 2007 are considered in this data example. The reason for
being interested in assessing volatility for this company is because it is highly dependent on the
crude oil price.
For this analysis, I'm using R functions in the quantmod package for R which is designed to
assist the quantitative traders in the development testing and deployment of the
statistically-based trading models. Using the R command getSymbols is it’s possible chill out to
load data from a variet g Yahoo Finance, Federal Reserve Bank, Google Finance and this data
would include stock price and other financial instruments. In this implementation, the source is
Google and the R command for plotting the time series and used from this package is
candleChart.
303
The time series plot is here. The time series is clearly non-stationary with a nonlinear trend, but
how about the volatility? It seems that there are periods of higher volatility.
A better way to visualize volatility is by plotting the return of the log price and here’s a time
series of the returned time series. This time series clearly shows large volatility in the first
quarter of 2008, for example. This time series indicates that we have non-constant conditional
variance.
304
We'll first fit an ARIMA ARMA model. For this, we'll select the order of ARIMA, including the
order of ARIMA ARMA and the order of the integrated part of the model. The code provided
here is slightly different than the one we used in fitting an ARIMA model. Because here, we are
only interested in selecting the order without saving the AIC values. We begin with zero orders,
then surge with q and z search for p and q orders between zero and ten. For the difference
order, we select between zero and one.
We fit the ARIMA for all combinations of words orders and update the final.order and final.arima
fit until the AIC value does not improve or does not reduce anymore. The selected orders are p
or air.order AR order equal to 5 on and q or MA order equal to 6. The selected order for the
difference of the time series is zero. Last, we obtain the residuals for the final model.
The residual plot are here. The ACF plot of the residuals looks like one for the white noise
process. On the other hand, the ACF plot of the squared residuals suggests that there is a serial
dependence in the squared residuals. Thus, the residuals are linearly uncorrelated, but not
independent since the transformed residuals, specifically the squared residuals, are correlated.
305
Let's test whether the residuals and square residuals are correlated using hypothesis testing.
We apply here the Box-Ljung test. Note that the test is not for testing independence, but for
testing whether the data are uncorrelated where the null hypothesis is that the data aren't
correlated versus the alternative that the data are correlated. Based on this test, the p-value for
testing uncorrelated scored squared residuals is very small indicating that we reject the new null
hypothesis. And thus, conclude that the squared residuals are correlated. The p-value for
testing whether the residuals are uncorrelated is also small, suggesting that an ARIMA model
does not lead to uncorrelated residuals for this particular example.
306
Last, we estimate the variance using nonparametric regression. As presented in the previous
lesson, we take the log of squared residuals, and apply nonparametric regression approach. We
can apply the splines regression model using the gam function in the MGSV mgcv library of R.
In this implementation, the response is the log of squared residuals. Here, I'm replacing the
close price in a PDCE price object with a square root of the exponential of the fitted values in
order to maintain the same data structure.
The plot of the fitted volatility is here. We can see that indeed, the volatility has a nonlinear trend
with three years of high volatility to. To see whether I captured the high volatility period using
307
this estimation approach, I'm contrasting here the estimated volatility with the time series of the
return log prize price. By contrasting these two plots, we clearly see that the periods during
which we estimate high peaks in the volatility correspond to years of high variability in the return
prize price.
In this lesson, I’ll introduce one of the simplest time series models to model to condition of
Aaron’s the conditional variance, the so-called ARCH model.
Let's assume we begin with fitting an ARMA model to a time series YD Yt as provided on a
slide. Where, where the residual process is Zt assumed to be white noise with constant variance
given by sigma squared. However, in many applications the time series data often exhibits
volatility clustering where time should series show periods of high volatility and periods of low
volatility indicating non-constant variance. If such, we can rewrite the error term Zt as the
308
product between a deterministic but time variant component sigma T. And and the random
process Rt with min mean 0 and variance 1.
We'll begin with one of the simplest time series models. Introduced introduced by Robert Engle
in 1982. The autoregressive conditional heteroskedasticity or in short, ARCH model, used to
model the time-varying volatility often observed in economical time series data. For this
contribution, he won the 2003 Nobel Prize in Economics. ARCH models assume the variance of
the current error term or innovation to be a function of the actual sizes of the previous time
period's error terms. Often the variance is related to the square of the previous innovations or
error terms.
309
4.1.3 (continued)
The ARCH family of models assume that the volatility, or sigma t squared, is a linear function
with lagged values of the mean equation errors Zt, thus. Thus, the time-series dynamic of
volatility is like an AR process. Specifically the squared error Zt are modeled, using an AR
process, where gamma 0, gamma 1, and so on, are the error coefficients at time t, and omega t
is the AR error assumed to be uncorrelated with min mean 0 and variance lambda squared.
The expectation of Zt squared given past history of the error process, is equal to the AR
polynomial for Zt squared. But, the expectation of Zt squared is also equal to sigma t squared.
Hence, we model the conditional variance as a linear combination of past squared error terms.
Gamma 0 + gamma 1 x Zt-1 squared, plus gamma two plus times Zt-2 squared and so on.
In introductory statistics courses it is often mentioned that independence implies serial
correlation but not vice versa. A process such as the ARCH process where the conditional
mean is constant, but the conditional variance is not constant is an example of uncorrelated but
depended dependent processes. The dependence of the conditional variance on the past
causes the process to be dependent.
4.1.3 (continued)
310
Thus in ARCH models, the conditional variance has a structure very similar to the structure of
the conditional expectation in an AR model. I explain this concept with the simplest ARCH
model: ARCH of order 1. We'll consider here Yt to be the, to be a process with a mean which
does not vary with time, plus the Zt process which is modeled using an R1 ARCH(1) model.
That is, the conditional variance is equal to gamma0 + gamma1 Zt-1 squared.
311
Note that sigma t is assumed deterministic. The past Zt are assumed given and hence realized
a at time t. However, we do not observe volatility with certainty. The true volatility of Zt is sigma t
squared, which is deterministic, plus omega t, which is random, where Omega t has mean 0 and
variance 1. Those two processes are also assumed uncorrelated. Together this leads to the
AR(1) equation for Zt squared, as provided on this slide.
This equation is crucial in understanding how ARCH processes work. If Zt-1 has an unusually
large absolute value, then sigma-t is larger than usual and cell so Zt is also expected to have an
unusually large magnitude. This volatility propagates since when Zt has a large deviation that
makes sigma t plus one squared large, so that Zt plus 1 tends to be large, and so on. Similarly,
Zt -1 squared is unusually small. Then sigma t squared is small, and Zt squared is also
expected to be small and so forth. Because of this behavior, unusual volatility in Zt tends to
persist throughout but not forever.
4.1.3 (continued)
312
gamma0/(1-gamma1) where gamma0 and gamma1 are the AR(1) coefficients for Z[t squared]
Zt squared.
We also need to have that Yt and hence Zt are uncorrelated. Because we assume that the
variance is finite and positive under stationarity. Thus gamma 1 needs to be smaller than 1.
These are the conditions under the so called weak stationarity.
In order to ensure strong stationary we need to impose additional constraints on Gamma 1. For
for ensuring that the fourth moment of Zt is finite. The fourth moment, assuming normality, is
provided on the slide. In order for this to be finite, we also need to impose the conditional
additional constraint that 3*(gamma1 squared) is smaller than 1. This is needed in the derivation
of the kurtosis which is a measure of how fat the tails are. The kurtosis of the ARCH(1) model
under the assumption of stationarity is larger than 3. Which means that an ARCH(1) model for
Zt has fat tails, and thus it is more likely to produce outliers than the simple normal white noise.
313
4.1.3 (continued)
This is a comparison of the distributions of different Kurtosis. For distribution with Kurtosis with a
coefficient larger than three, as the one in the middle, the curvature. Is is high in the middle of
the distribution and tails are fatter than those of a normal distribution provided on the right. A
distribution of data with fat tails is frequently observed in financial markets.
We can combine the ARCH model with an ARMA model as illustrated in the first slide of this
lesson. Some specifics examples are provided here, we. We can begin by assuming an AR(1)
314
process for modeling Yt with mean mu and AR coefficient 5. Then the error process Zt can be
assumed to be an ARCH (1) model. Here the conditions of variance is gamma 0 + gamma 1
times Z t-1 squared.
A second example provided in this slide is the ARMA(1,1) and ARCH(2) models where Yt is
modeled by an ARMA (1,1) with Zt, the error process being modeled by an ARCH(2) model
hence. Hence, the condition of conditional variances gamma 0 plus gamma 1 times Zt minus Y
1 squared plus gamma 2 times Zt minus 2 squared.
It's important to note that when we have joint models, as provided here on this slide, we need to
estimate them jointly. It's not good practice to first estimate the ARMA process for ty Yt, then to
apply an ARCH model to the residual process. We will discuss this aspect with a data example
in a different lesson.
4.1.3 (continued)
How to estimate the ARCH coefficients, or the ARCH model? The common approach is using
the maximum likelihood estimation. Because at time series yt, is not, the variables in a time
series Yt are not independent, then the value hood likelihood function cannot be written as the
product of the individual marginal distributions of YET Yt. Instead, it needs to be decomposed
as a product of conditional distributions. I'm introducing here the concept of conditional
likelihood.
As discussed in the first lecture on this course reviewing the essential concept and in statistics
such as the decomposition of the joint distribution, we can decompose a joint distribution of the
315
random variables Yt, Yt-1 up to Y1 modeled by an ARCH(p) model, using the unconditional
likelihood as defined on the slide.
The second term in the decomposition is a joint distribution of the first p random variables in the
time series. Except for the small order ARCH models, this last term is difficult to express.
Therefore, Ignored ignore it if p is much smaller than t. Thus, in fitting an ARCH(p) model, we
instead maximize the conditional likelihood defined on the slide. The maximization is with
respect to the AR coefficients in the ARCH model. Since the estimation requires a numeric
algorithm, is it’s only done using a statistical software such as R.
In summary, in this lesson I introduced one of the simplest models used to model the conditional
variance or heteroskedasticity in a time series. The, the so-called ARCH model.
316
4.1.4 ARCH: Data Examples
This lesson is on modeling heteroskedasticity in a time series. And in this lesson I'll focus on a
specific data example for illustrating the ARCH model.
We'll return to the data example in which we will model the stock price of a company. I selected
here a company with very large volatility, PDCE Energy, which is a crude oil and natural gas
producer with headquarter in Denver, Colorado. Daily stock price for more than ten years of
data starting with January 2007 is considered in this data example. The reason for being
interested in estimating the volatility for this stock price is because it's highly dependent on the
crude oil price.
In the lecture for ARMA modeling, we learned that we can identify the order or of an AR process
using the PACF plot. Since ARCH is an air AR process for the squared Zt process or the
squared residual process, we can use the PACF plot to identify the order of the ARCH model.
Thus in this example I used the R command PACF with an input of the squared residual
process.
And this is the PACF plot. From this plot we see that the PACF is large for the first seven orders
but decreases afterwards. However, there is still some large PACF values outside of the
confidence band for large orders. Using this approach, we may select an order equal to seven
as fitted in the command on this slide using the garch ARCH R command. In this command, we
input the process to be modeled, in this case the residual process, and the order for the ARCH
model. Please pay attention on how the ARCH order is specified in this R command. We'll see
other implementations of ARCH and GARCH models where the order is specified differently.
317
Last, we evaluate the goodness of fit of the ARCH fit by first obtaining the residuals from the
ARCH model. Note that I remove the first seven values from the residual vector since there will
be NA's. This is because we fit an ARCH model of order seven. I plotted here the ACF of the
residuals and of the squared residuals from the ARCH model to check whether the residuals are
uncorrelated and whether there is an ARCH effect left in the residuals. To complement this
analysis, I also applied a hypothesis testing procedure u in box Ljung-Box for testing for serial
correlation and ARCH effect in the residuals. The input lag equal to eight and fit DF equals to
zero seven is because we applied the test to residuals derived from a model with seven
coefficients.
4.1.4 (continued)
The output of the ARCH fit is here. It is similar to that of an AR fit. The output provides the
estimated error coefficients, along with inference on statistical significance of the AR
coefficients. According to this out output, all coefficients are statistically significant hence, an
order of seven is appropriate. The output also provides the results from a hypothesis testing
procedure of uncorrelated residuals, and for an ARCH effect in the residuals. According to this
test, we reject the null hypothesis of uncorrelated residuals, but do not reject an the null
hypothesis of an ARCH effect in the residuals.
318
Applying the Box-Ljung test we have some evidence to reject the null hypothesis for both tests,
and the p-values are both small, although not smaller than 0.01. Given this level of p-values, we
reject the null hypothesis at the significance level of 0.05, but not at the significance level 0.01.
319
4.1.4 (continued)
The ACF plots on the other hand look like those of white noise since the ACF values are within
the confidence VAR band except for the lag 0.
I will note here that there are other R packages to fit ARCH, and more generally GARCH models
which will be discussed in other lessons in this lecture. Choose such implementations are using
the Garch function from TC package and a second implementation is the Garch fit function from
the fGarch package. The GARCH function from time series package is fast but does not always
find solutions. The garchFit function from fGarch package is slower but does converge more
320
consistently. So far, I illustrated an implementation using the GARCH function in a T-series
library.
This slide also presents the implementation using the Garch Fit function. We can feed fit the
ARMA Garch model using the GarchFit function in two ways. One is to fit an ARMA process on
a time series first, then model the residuals from the ARMA fit using an ARCH model. However,
this approach does not provide efficient estimates for the ARMA ARCH joint model, not as
efficient as the one when we estimated two models simultaneously.
We can also fit a ARCH model using the garchFit function. By by specifying the orders of the
ARMA model through the option ARMA plus the orders of the ARCH model specified by the
Garch option. Note that for this implementation the first order is for the AR part of the model and
that is specified Garch (7,0). This is different from the implementation using the Garch R
function in the Time series library where we specify Garch (0,7).
4.1.4 (continued)
This slide compares the outputs for fitting the ARCH model to the residual process using the two
implementations. Using using the garch function on the left and the garchFit function on the
right. The omega coefficient in the garchFit corresponds to the a, 0 a0 coefficient in the garch.
The coefficients alpha 1 to 7 correspond to the coefficients a1 to a7. The estimates from the two
models are almost identical. The difference between the output is that the garchFit function also
provides the estimate for the mean of the fitty fitted process, the mu parameter. In this case the
mean parameter mu is not statistically the estimate of the mean parameter mu, is not statistical
321
significant. This is not surprising since we applied the ARCH model on the residuals of an
ARMA fit. Based on this output the fitted ARCH model is provided here on this slide.
This is the output from the joint modeling of ARMA and ARCH applied to the time series of the
return stock price using the garchFit function in r. The fitted model as provided by this output is
provided here on the right. The first part of the output is for the ARMA model, the estimated
ARMA coefficients. Hence the output is different from the output of the arch fit using the residual
process. In fact, comparing the estimates from the ARMA process that we initially estimated with
(while) disregarding the ARCH portion of the model. When we're comparing those to the
estimated coefficients using the joint ARMA-ARCH model, we find that the estimated ARMA
coefficients from the ARMA-ARCH fit, the joint ARMA-ARCH model, are different from the
estimated coefficients of the ARMA fit alone. On the other hand, if we compare the ARCH fitted
model using the ARMA-ARCH fit versus the one fitted in the previous slide where we fitted the
ARCH model on the residuals, the fitted models are similar.
In summary, in this lesson I provided a data example to illustrate how to fit an ARCH model.
322
4.2: GARCH Models
The ARCH model, while easy to understand and interpret, it has some limitations. The ARCH
model is rather restrictive and in that the model assumes that positive and negative shocks have
the same effect on volatility because it depends on the square of the previous shocks. In
practice, however, it is well known that many economic and financial indicators, but also other
phenomenon, or other measures with time dynamics respond differently to positive and negative
shocks. Particularly, because negative and positive shocks may be of different nature. For
example, some negative shocks can be political or environmental, while some positive shocks
can be economic-driven.
Moreover, ARCH models are likely to overpredict the volatility because they respond slowly to
large isolated shocks to the return series. And because when modeled with ARCH, large
variance effects disappear quickly. Certainly, no long longer than the ARCH order, the specified
ARCH order. Given that the impact of high variance may last for a longer period of time, we
would prefer a model in which the impact of large variance can be expressed for much longer
than the order of the model equations, of the order of the ARCH.
323
Last, the ARCH model does not provide any new insight for understanding the source of
variations of a time series. It provides a mechanical way to describe the behavior of the
conditional variance. It gives no indication about what causes such behavior to happen.
Such limitations can be addressed with a more general model GARCH or extensions of this
model, as we'll see in a different lesson. GARCH stands for Generalized autoregressive
conditional heteroskedasticity. The intuition behind this model is that it is a weighted average of
past squared residuals. But it has declining weights which never go completely to zeros.
Hence, it models the impact of large variance for a longer period of time than the specified
orders.
The most widely used GARCH specification asserts that the best predictor of the variance in the
next period is a weighted average of the long run average variance, the variance predicted for
this period and the new information for this period, which consist of the most recent squared
residuals. Such an updating rule is a simple description of an adaptive or learning behavior.
For ease of understanding, I'm presenting on this slide the simple GARCH 1 model, which often
is sufficient to model heteroskedasticity. The conditional variance sigma t squared is modeled
using the lag squared noise process, Zt-1 similar to the ARCH model. But also includes
additional lagged relationship with past variability through sigma t-1 squared. This addition
allows for modeling the impact of the variance for longer than an order one ARCH model.
More generally, a GARCH(m, n), with orders m and n, that is of AR order m and MA order n, is
when the conditional variance depends linearly on the lagged squared noise process through
324
Zt-1, Zt-2 up to Zt-m. But also, through the past variability through sigma t-1 squared up to
sigma t-n squared.
Thus the interpretation of the two orders is as follows: m refers to how many autoregressive lags
or ARCH terms appear in the equation while n refers to how many moving average lags are
specified which here is often called the number of GARCH terms. This model was introduced by
Bollerslev in 1986.
4.2.1 (continued)
If Zt is modeled by a GARCH model, then Zt squared is an ARMA process but, not with IID or
independent identically distributed white noise. The volatility of the Zt squared cannot be
observed fully since it depends on the lagged Zt squared time series. In this example, it
depends on Z t-1 squared. Thus, we express Zt squared as the sum between the volatility and
omega t.
Starting with this formula, we can replace sigma t-1 squared with Zt-1 squared minus sigma t-1
squared. Separating the terms in Zt-1 and omega, we have an ARMA(1, 1) process. Adding
together the terms with Zt-1 squared leads then to an ARMA model and in Zt-1 squared where
the AR coefficient is gamma 1+beta 1, and the MA coefficient is -beta 1.
325
Given that Yt is the sum between the mean, mu and the GARCH process of orders 1 and 1,
then Yt is stationary if its expectation, mu is constant for all time points, the variance of Yt, which
is equal to … the expectation of the Zt squared, is computed as on this slide.
From this derivation, we have a condition in the marginal or unconditional variance of Yt and the
GARCH coefficients, gamma 0, gamma 1 and beta 1. Specifically, the unconditional variance is
equal to gamma 0 divided by 1- gamma 1- beta 1. For this variance to be finite and positive as
required for stationarity Yt, we need to have that 1- gamma 1- beta 1 to be positive or the sum
of the two coefficients to be smaller than 1. And that the two coefficients are also positive.
The last condition is that the covariance between Yt and Yt-1 is 0. That is, Yt and Yt-1 are
uncorrelated. Note that this does not mean they are independent. In fact, for GARCH model,
they are not independent.
Last, I'll point out that the second and the third conditions will be different for other orders of the
GARCH model.
4.2.1 (continued)
326
These are some important characteristics of a GARCH(1,1) model, one of the most popular
model for modelling heteroskedasticity. The first condition was derived in the previous slide, and
it guarantees stationarity. The second characteristic is that GARCH models capture volatility that
comes in clusters. That is, periods of high volatility, followed by periods of low volatility. Last, the
GARCH model has heavy tails measured by the a kurtosis, that is the kurtosis larger than 3 if. If
we have this condition on the GARCH coefficients. A heavy tail distribution means that GARCH
can capture all line outlying periods of large volatility, large variability.
327
We can expand the GARCH model by considering joint models. For example, ARMA-GARCH
models. For an AR(1)-GARCH(1,1) model, the model equations are on the slide. Yt is modeled
by an AR(1) model with AR coefficient phi whereas the noise, Zt, is modeled by a GARCH(1,1)
model which means that the variance of Zt sigma t squared follows the formula of a
GARCH(1,1) model. Another example provided on this slide is the ARMA (1,1) with GARCH
(1,2), joint model for illustration of such joint models.
4.2.1 (continued)
The estimation of the parameters of GARCH models, is the maximum likelihood estimation
approach. The parameters to be estimated are the coefficients gamma 0, gamma 1, up to
gamma m. And the coefficients beta 1, up to beta n, for a GARCH model of orders m and n.
If we assume normality, the conditional likelihood function of Yt given the past history up to time
t, can be expressed as a normal distribution with mean 0 if the mean of Yt is 0 and variance
sigma t squared. The PDF, the probability density function, is as on this slide.
If we replace sigma t squared, the conditional distribution, the conditional distribution is now a
function of the GARCH coefficients. As we discussed in the first lesson of the first lecture of this
course, we commonly estimate such models by using a conditional likelihood which will be the
product of such conditional distributions. As you would expect, maximizing the resulting
likelihood function is very complicated requiring numerical algorithms. This is why when we
estimate GARCH models in R, it can take a minute or two to get the fitted model. There are also
different numeric algorithms that can be used. Some can be slower, but reaching convergence,
328
some are faster but not guaranteeing convergence. This is the case of the two implementations
of the GARCH model as discussed in the previous lesson.
In the previous analysis of the PDCE price return, we first fitted an ARMA model, with the
selected orders 5 and 6, then applied an ARCH model to the residual process. We also
compared to a joint ARMA ARCH, Model.
Here, we consider modeling the return price using a joint ARMA-GARCH model. We fix the
ARMA orders to be the selected orders (5,6), using the AIC approach from the previous lesson.
We'll next select the order of the GARCH model. To select the order, we loop over all
combinations of orders M and N with values between 0 and 3. Note that we consider somewhat
smaller orders for GARCH than we use for ARMA modeling, since the small GARCH model
commonly is sufficient in modeling the volatility.
329
In this implementation, I first define the model's specifics through the ugarchspec R function.
The model's specifics are as follows, the ARMA orders are 5 and 6, the GARCH orders are m
and n. We consider the mean to be nonzero, and we assume the t-distribution instead of normal
distribution through the specification of the option distribution.model.
Then the model is fitted using the ugarchfit with the formulation specified by the specifics in a
previous, in the ugarchspec. Note that this is yet another R function or another R
implementation that can be used to fit a joined ARMA GARCH model. All R functions used on
this slide are available in rugarch library. This library is one, if not the most recent
implementation of the ARMA GARCH model. It provides many more modeling options for
modeling the conditional mean, and the conditional variance jointly, as I will show in the rest of
the lessons of this lecture.
Instead of using AIC, I now use the BIC, since BIC prefers smaller, lacks less complex models.
This is important, because we fit an ARCH GARCH joint model which is already very complex.
The final order will keep the values of n and m because v i c BIC does not get smaller for larger
values of m and n beyond the selected orders. The selected orders in this implementation are
the AR order where M is equal to 1, and the MA order, where N is equal to 2.
However, when fitting ARMA-GARCH simultaneously, jointly, we should determine both the
ARMA and the GARCH orders jointly. If the process is indeed well approximated by the
ARMA/GARCH model, then considering the conditional mean model ARMA while neglecting the
330
conditional variance model GARCH, as we did in the first implementation, will lead to inefficient
estimators. Similarly, when considering the conditional variance model, you should not neglect
the model for the conditional mean. This is because neither a conditional mean model, nor the
conditional variance model can be estimated consistently if taken separately. Joint estimation
will typically be more efficient, and this is why it is preferred.
Unfortunately the task of jointly determining the ARMA and GARCH orders is difficult. So it will
require fitting many many more models. For example, if we assume orders from 0 to 5 for both
orders of the ARMA, the AR and MA orders, for the conditional mean, and orders from 0 to 3 for
both orders of the GARCH for the conditional mean, we would have to fit 576 models.
Fitting an ARMA-GARCH model is computationally expensive, and thus without distributed
computing, fitting all models will take more than a day of computation time.
Alternatively we can select the order for ARMA and GARCH separately, but we find the order
selection through multiple requirements. Thus here, we'll apply order selection for ARMA, given
the selected orders of GARCH in the previous slide. That is, we fixed the orders for the GARCH
to be equal to 1 and 2 and now select the orders p and q for ARMA. I applied the same code as
in the previous slide except that now I'm varying the orders for ARMA. The selected orders For
ARMA are 0 and 0, that is, the conditional mean does not depend on time.
Next, I refined the orders of GARCH assuming an ARMA 0, 0, as selected in the previous slide.
The selected GARCH is the classic GARCH (1,1), commonly used to model stock price returns.
331
Let's now compare the three models. We have the three models are as follows. The joint model
where the ARMA orders are (5,6) and the GARCH models are (1,2). The second model is when
the ARMA orders are 0 and 0, and GARCH orders are 1 and 2. The third model, where the
ARMA orders are 0 and 0 and the GARCH orders are 1 and 1. We'll compare with respect to the
AIC, BIC, and other information criteria.
We can compute the information criteria of a model, estimated using the ugarchfit from the
library rugarch, using the infocriteria command R function. This function, it applies only to
models fitted with the ugarchfit from the rugarch library.
332
The output is here. The first two values are the Akaike formation information criteria, or AIC, and
for the B, Bayes information criteria, or BIC for example, in. In this example the values across
the four, the five criteria are ver very similar for the three models. Since we prefer less complex
models, thus we would chose the last model, where we have an ARMA with order 0, 0 and a
GARCH with orders 1 and 1.
333
These are the ACF plots for the residuals and the squared residuals of the selected model, of
the less complex model. Based on these plots, the ACF of the residuals resembles, the ACF
plot of the residuals resembles the ACF of white noise. However, this is not the case for the
squared residuals suggesting that a higher order GARCH model may be considered.
The R code on this slide is for obtaining the predictions for the conditional mean and for the
conditional variance or so-called volatility. The prediction is performed one lag ahead on a rolling
basis, starting, beginning of 2017. The code on the slide is based on for the first model, but I
applied the code to obtain the predictions using the other two models also. The output of the
model consists of forecast for the return price for the conditional mean and the conditional
variance.
334
These are the prediction accuracy measures, for the conditional mean based on the three
models. Model one performs best across all measures.
This is ARCH R code we can use to compare the predictions to the observed data using a
visual display.
335
We see that the predictions for the conditional mean are close to the zero line. This is because
in this example the volatility predominates over the conditional mean.
In this code, I'm contrasting the squared time series with the predictions for the conditional
variance.
336
And this is the plot. The predictions capture some of the volatility however. However, all models
consider predict underpredict the conditional variance, the volatility.
337
4.2.3(7) Other GARCH Models
In this lesson I will conclude the introduction of the concepts of GARCH modeling with other
extensions of the GARCH model.
A first extension of the GARCH was introduced because of one of the limitations of the GARCH.
Particularly the residuals from GARCH models often display fat tails, suggesting that a
distribution with fatter tails than the normal distribution should be considered. For example, the
t-distribution. We have seen this illustrated in the modeling of the condition of the conditional
mean using GARCH and t-distribution for the return price of PDCA PDCE stock price.
In 1991, Daniel Nelson took this idea further by proposing a model with a transformation of the
conditional variance, specifically the log transformation. Thus, according to this model, called
also EGARCH, or exponential GARCH model, the log of sigma t squared is modeled using a
GARCH formulation. This transformation deal dealt with also asymmetric tails, since the model
allows the variance to have asymmetric behavior to reflect asymmetries in the data. This
addresses one other limitation of GARCH, particularly the GARCH models, the impact of
positive and negative shocks similarly.
338
Another model which allows for asymmetry in the impact of negative and positive shocks on the
volatility is APARCH. The intuition of this model is that for some time series, large negative
observations appear to increase volatility, more than do positive observations of the same
magnitude. This is called the leverage effect. The solution is to replace the square function with
a flexible class of non-negative functions that include asymmetric functions. The APARCH
model is an ARCH model formulation for the condition of conditional standard deviation.
Specifically it models the power of the standard deviation, if. If delta, the power, is equal to two,
the coefficients in the APARCH gamma are zero then we have the GARCH model.
339
Another extension of GARCH is the Integrated GARCH which applies when the conditional
variance is nonstationary. The difference between GARCH and IGARCH is that the GARCH
parameters add to one. The most popular is IGARCH with orders 1 and 1 as presenting
presented on a the slide, with where the coefficients for Zt-1 squared, and the coefficient for the
lacked lagged variance is sigma squared t-1 at 2, 1. But EGARCH, APARCH, the IGARCH are
just three of the many other extensions. The rest of the slide points to many other references
where such extensions have been introduced.
340
4.3: Case Study 1: Modeling Currency Exchange
For this analysis, I selected to analyze the exchange rates between the US dollar and three
other currencies. The Euro, which is the currency in most countries of the European Union,
Brazilian Real, and the Chinese Yuan. The exchange rate between the US dollar and the Euro
should best reflect the Principle (Purchasing) Power Parity Theory. Brazil is a developing
country, and hence there may be some fluctuations in terms of the exchange rates due to
political instabilities. Last, I selected Chinese Yuan because of recent debate on currency
manipulation.
Generally, the study of temporal dependencies in exchange rates can also be used for
forecasting daily fluctuations in rates for investments. Large contracts between countries can be
placed given the level of exchange rates. Moreover, investments in a country are also highly
dependent on the volatility of the exchange rate with the currency of that country.
341
Data for the exchange rate between Euro and USD is available only since 1999 because it was
introduced in the market in January 1999. Data for the daily exchange rate for Chinese Yuan is
available since 1999, at least this is the data I was able to find. And I only found daily data on
the exchange rate for the Brazilian Real starting with 2000.
Our first step is to read the data using the read.csv R function. Each currency rate is provided in
a separate and individual file, but all files have the same structure and hence the R code in this
slide applies to other currencies although it is for Euro only. To note that I'm using here yet
another argument to process dates which is different than other commands used before. This is
to show you that there are many R functions to process time series data. The code on this slide
also provides the R lines of code for plotting the time series data along with the one order
difference time series.
342
The time series plot for the three currencies are here, the. The exchange rate between USD and
Euros start at a rate of 0.84, meaning 0.84 Euros for one Dollar. And continues to increase in
favor of US dollar reaching a level as high as 1.25 in August 2000, then slowly decreasing over
a long period of time with large volatility. Interestingly, at the end of the period of the study, the
exchange rate reached a similar level as the rate when the Euro was introduced.
Similarly, the rate between the US dollar and the Brazilian Real had ups and downs varying over
large range, a much larger range than the exchange rate between US dollar and Euro. Pointing
to higher volatility and hence higher risks.
Last exchange rate between US dollar and Chinese Yuan has a surprising pattern. For many
years, the exchange rate presented little variation and change until January 1994 with the big
jump, with the big change beginning of the year 1994. Then showing again low variation and
change until 2007. Then slowly decreasing until end of 2015 when it started to increase again.
343
Differences in the volatility are better reflected in the Differenced Time Series. Their plots are on
this slide.
The smallest volatility is for the current currency exchange between US dollar and Euro. For this
exchange rate we see some clusters of high volatility but also an outlying volatility in 2004. We
observe a similar cluster of volatility pattern, the exchange rate between US dollars and
Brazilian Real although the volatility is several folds larger.
For the exchange currency between US Dollar and Chinese Yuan there is an extreme value in
1994. But thus the volatility is hard to evaluate because of this outlier.
344
These plots are for, again the different differenced time series, but re-scaled in a way that all
three plots have the same scale. They show again the differenced time series and across all
three currencies. Once the outlier 1994 for the Chinese Yuan is removed the largest volatility is
now for the Brazilian Real. In comparison, on the same scale the volatility for the exchange rate
for the Euro is quite small. However, for the Chinese Yuan we identified periods with extreme
small changes over long periods of time but also periods of high volatility. Extreme small
changes over long periods could be interpreted as control over the currency purchasing power.
However, we do note that the volatility in the exchange rate for the Chinese Yuan is larger than
for the Euro in recent years.
345
Next we will assess the autocorrelation and partial autocorrelation plots for the three time series.
The ACF plots are on the first row and the PACF plots are on the second row. The ACF plots
point to nonstationarity in the sense of a trend for all three times series. The PACF plots only
show partial autocorrelation like at lag one.
However, the ACF plots for the differenced times series resemble those of white noise with
sample autocorrelation being small within the confidence band for lags one and higher. The
sample partial out of correlation autocorrelation is small for all lags. Thus it is possible plausible
for the Differenced Time Series to be uncorrelated for all three currencies
In summary in this lesson I introduced an example that will be used to illustrate modeling
heteroskedasticity. The example is related to the study of the exchange rate between USD
dollar and three other currencies.
346
4.3.2 ARMA Modeling
In this lesson I will first focus on the implementation application of ARIMA for modeling the
conditional mean within this case study.
The R code on this slide fits an ARIMA model to the differenced time series of one currency
being generalizable to all three currencies. I applied here, the ARMA modeling to the differenced
time series instead of ARIMA, to the time series itself. Since the joint ARMA GARCH model
fitting, approach in R, allows for ARMA, not ARIMA, modeling.
The rest of the AR R code is for performing the residual analysis. This code apply, again, to all
three currencies. The AR R codes for the three currencies are provided in the three different
files, but the results will be presented together for comparison in this lecture.
347
The upper plots are the residual plots, and the lower plots, are the squared visual residual plots
for all three currencies. The residuals for the Euro show some large variability clusters extended
over longer periods of time, while the residuals of the Brazilian Real present a larger variability
clusters over a short periods of time.
Moreover, the residuals plot for the Chinese yuan presents some extreme values or outliers.
Because of these outliers, it is difficult to assess the variability over time. Thus, for this
currencies the volatility shows different patterns between long versus short periods of high
volatility, and outliers in volatility. Temporal patterns in the squared residuals are difficult to
evaluate because of the outliers. Overall, we'll identify nonconstant variance and the residuals
for all three currencies.
348
This These are the ACF plots for the residuals and squared residuals. While the ACF plots of
the residuals resemble the ACF of white noise. The, the ACF plots of the squared residuals do
not for you are representing your route the Euro and Brazilian real indicating that the squared
residuals are correlated.
Last, I applied the testing procedure for uncorrelated data using the box test AR R command for
serial correlation and for ARCH effect in the residuals. The AR R command lines are for Euro
currency since the ARMA model has order six and three. Thus using N for the DF of the feet fit,
and ten for the number of lags. This input will be different for the other two currencies, since the
ARMA models fitted for the other two currencies, have different orders.
The p values for the serial correlation are large for Euro and Chinese Yuan indicating no serial
correlation. But this is small for the Brazilian real. The p values for the ARCH effect are small for
Euro and Brazilian real suggesting correlated squared residuals but not for the Chinese yuan.
A potential explanation for not detecting an ARCH affect effect in the residuals for the Chinese
yuan is that the variability in the residuals present outliers but is otherwise constant over time.
We'll explore the volatility of the three currencies in the next lesson.
349
4.3.3 ARCH Modeling
This lecture is on a case study for modeling heteroskedasticity, and I'll continue this case study
on the modeling of the exchange rate between Euro and US dollar, using an ARCH model.
By checking the PACF of the squared residuals from the ARMA model fit for the currency
exchange rate for Euro versus US dollar, we would identify an order equal to eight or nine, since
the first 8 to 9 PACF values are outside of the confidence band. Note, that we use PACF to
identify the AR order for a time series. I consider an ARCH model of order eight.
I'm also using here the simple AR command, garch, from the tseries library, since we are
considering a simple ARCH model.
The rest of the ARCH R code is for evaluating the goodness of fit of the ARCH model, similar to
the residual analysis performed for the residuals of ARMA models in the previous lesson. Note,
that this is the ARCH R code for the Euro or vs. USD exchange rate. A similar code can be used
for the analysis of the ARMA residuals of the Brazilian real or Chinese yuan.
350
This is the summary of the fit for the ARCH model, applied to the ARMA residuals for the Euro,
USD exchange rate. Recall that the ARCH order is eight, and thus we have eight air AR
coefficients, but also a zero a0, which is the estimate of the mean, or intercept parameter. All
coefficients are statistically significant, since the p-values of the test for significance are all very
small, indicating that an order eight for ARCH model is appropriate. A larger order ARCH model
may be explored.
The tests, in the second part of the output, are for serial correlation in the residuals, and the
squared residuals. According to this test, we reject the null hypothesis of uncorrelated residuals,
but we did not reject the null hypothesis of uncorrelated squared residuals.
351
This is the output of the summary fit for the ARCH model applied to the ARMA residuals for the
Brazilian real. For this model, I use an ARCH of order six. All coefficients are statistically
significant, since the p-values of the test for significance are all very small, indicating that the
order six of the ARCH for the R2 ARCH model is appropriate, a. A larger order ARCH model
should be explored.
Similarly to the model fit for the Euro, USD exchange rate, in the previous slide, we reject the
null hypothesis of uncorrelated residuals, but we do not reject the null hypothesis of
uncorrelated squared residuals.
The ACF plots of the residuals, since squared residuals resemble the ACF plot of white noise,
except for a large ACF value at lag nine for the squared residuals of the ARCH model for the
Brazilian real. Thus, these plots indicate a good fit.
352
4.3.4 GARCH Modeling
In this lesson, I'll continue the analysis of exchange rate with a modeling of joint ARMA-GARCH
modeling for all three currencies.
Recall that the selected orders for ARMA for the euro-USD exchange rate difference time series
were an AR order equal to 6 and an MA order equal to 3. Assuming the ARMA order is fixed for
the conditional mean, we now select the order of the GARCH model for the conditional variance,
assuming an ARMA-GARCH joint model fit. Note that this time, I'm fitting the orders on the
training data, which is a subset of the initial time series without the last months of the time
series, July and August. The last two months of data are now for testing the prediction accuracy
of the model.
To select the GARCH orders, we loop through all combinations of m and n orders, taking values
between 0 and 3, fit to the ARMA with orders 6 and 3 and GARCH model with orders m and n.
And then select the orders m and n for the model with the smallest BIC. Note that I am using
here the BIC, or the Bayesian information criterion, to select a parsimonious GARCH model,
which is preferable when fitting an ARMA-GARCH joint model, which is already very complex.
The selected orders for the GARCH models for the conditional variance are m = 1 and n = 2.
We'll continue with refining the orders for ARMA, fixing the orders for GARCH.
353
Here, I'm selecting p and q for the ARMA model for the conditional mean, but this is this time
fitting a joint ARMA-GARCH model. Recall that fitting ARMA-GARCH simultaneously will also
mean determining both the ARMA and the GARCH order simultaneously. Neither the conditional
mean, nor the conditional variance model, can be estimated consistently if you take them
separately. Joint estimation would typically be more efficient, and that is why it is preferred.
Unfortunately, the task of jointly determining the ARMA and GARCH orders is difficult. Thus,
here, we select the orders of ARMA and GARCH separately by fine tuning the order selection.
We assume that the orders of GARCH are as selected in the previous slide, and allow only the
ARMA orders to vary within the range 0 to 6. The selected orders are now AR = 0 and MA = 0.
Suggesting that the conditional mean of the difference process of the euro-USD currency
exchange rate does not depend on time when we're jointly modeling the conditional mean and
the conditional variance.
I will note here that depending on the solver, there may be combinations of orders for which you
will get an error message. Which is due to the lack of convergence of the numeric algorithm. If
that is the case, the for loop will stop, and you have to restart a search excluding the
combinations of orders for which convergence is not attained. This was the case for p = 1 and q
= 2 in this model selection implementation.
354
Last, assuming the ARMA order's fixed to the values provided by the last models selected for
ARMA, now we fine tune the orders of GARCH. By selecting again among all combinations of m
and n taking values between 0 and 3. The orders selected are m = 1 and n = 2. Thus, with no
change from the order selected when the ARMA orders were 6 and 3.
For all three currencies, we start with the complex ARMA model, the selected ARMA after
accounting for the joint modeling with GARCH for modeling definition of variance reduces to an
ARMA(0,0) for the first two currencies. And an ARMA(1,0) for the third currency. There are also
no improvement in the order selection of GARCH when considering the more complex ARMA
versus the less complex ARMA. And thus for all three currencies, we'll only compare two
models.
355
Next, I compare the two models using multiple information criteria. Again, for the euro-USD
exchange rate times series, the two models to be compared are ARMA with orders 6, 3 and
GARCH with orders 2, 1. And the ARMA with order 0, 0 and the GARCH 2, 1. The R code in this
slide is for the comparison of these two models. It will be similar for the other two currencies
except for the ARMA and GARCH order specification.
These are the outputs for the comparison of the information criteria for the two models for each
of the three currencies. Across all currencies, both models perform similarly, suggesting that the
less complex model with low AR and MA orders for the conditional mean would perform similarly
as the most more complex ARMA with higher AR and MA orders.
356
4.3.5 Prediction
In this lesson I will continue the analysis of the exchange rate with a focus on the prediction.
For this example, we predict the currency for the last two months of the time period, the months
of July and August 2017. The predictions are one lag ahead obtained on a rolling basis. That is,
we loop through all the time points to be predicted. We set the past data as being the time
series observed up to the time point at which a prediction will be made. Then fit the
ARMA-GARCH Model with the specifications provided for the two models in the previous lesson
but this time the model is fitted on the time series observed up to the time point at which a
prediction will be made. Thus with each prediction the training data set changes.
After the model fit, the one lag prediction will for both the mean and variance are obtained, using
the ugarchforecast R function. We thus obtain predictions at each time point within the time
period of the two months. We apply the same code for both models, and for all three currencies.
357
The prediction accuracy measures for the forecast of the mean for the different differenced
Euro, USD currency time series are on this slide. The measures are computed for both models.
The first model being the ARMA with order 6,3 and GARCH with order 2,1, and the second
model being ARMA 0,0 and GARCH 2,1. The prediction occurs is, curious, accuracy is lower for
the second model across all measures, indicating that the model where the conditional mean
does not depend on time performs better than the more complex model for the conditional
mean.
As pointed out in the previous lectures generally the precision measure is most appropriate in
comparing prediction accuracy within this type of modeling. The smaller this measure is the
more accurate a prediction is. The measure is approximately 1.04 indicating that the proportion
between the variability and the prediction versus the variability and the new data is 1.04. Thus,
the variability prediction is similar to that in the data.
358
These sort are the prediction accuracy measures for the mean prediction for the Brazilian Real,
USD Dollar currency exchange. Similar results are noted as for the Euro, USD currency.
Although the prediction accuracy is slightly worst for the Brazilian Real currency than for Euro.
This These are the prediction accuracy measures for the mean prediction for the Chinese Yuan,
USD currency exchange. Again, similar results are noted by the prediction accuracies slightly
worse than for the other two currencies.
359
This is the art R code for comparing the mean predictions to the observed time series by
overlying the predicted means for the two models to the observed time series.
Here are the plots. The predicted means are close to zero, with some variations in the
predictions performed with the more complex ARMA-GARCH model for the Brazilian Real and
Chinese Yuan currencies. Note that the main mean predictions do not capture the variations in
different time series, since these variations are due to the variance, not to the mean of the
process.
360
Last, we'll take a closer look at the predictions for the volatility. The comparison now is between
the squared time series data and the predictions of the conditional variance or volatility.
The plot for the three currencies are here. The volatility's not well captured by the forecasted
conditional variance for the Euro currency, but it is captured much better for the other two
currencies, particularly, much better for the Chinese Yuan.
To conclude, we have seen that a joint ARMA-GARCH model performs differently across
different time series. Note also that the volatility for the Euro currency was smallest among the
three currencies, while the predicted volatility captures much better the conditional variance in
those time series with higher volatility.
361
The GARCH model introduced in this selection provides the basis for the development of many
other extensions of heteroskedasticity models. In the last lesson in this lecture introducing
GARCH models, I pointed out several other extensions such as exponential GARCH or
integrated GARCH. These extensions and others provided on this slide are meant to deal with
different distributions of the volatility. For example, asymmetrical response to positive and
negative shocks, large or small kurtosis, meaning more or less heavy tailed distributions among
others, or non-linear relationships with the past conditional variance.
The good news is that many of the models are already implemented in the R statistical software
and most of them can be applied through the ugarchspec() and ugarchfit() functions from the
rugarch library used to feed fit the joint ARMA-GARCH models. In order to specify the other
model types you will need to specify the model in to the model= options of ugarchspec(). The
default is the GARCH model.
362
Let's compare these models using the differenced currency exchange rate time series for the
Euro, US dollar. Because we have learned from the joint modeling of the conditional mean and
conditional variance, that the conditional mean does not vary with time, I focus here on the
modeling of the conditional variance. The code for forecasting the condition of conditional
variance is not different from the one provided in the previous lesson, and thus, not provided
here.
This is the plot of the predicted variance, along with the served observed squared time series
shown in black. We see that the other models for heteroskedasticity have not improved the
estimation of the volatility since none of the predictions captured the ups and downs in the
squared difference time series. It's possible that such variations are due to randomness alone
rather than serial correlation.
I'll note here that running the forecasting code with all the five models and for a large number of
time points at which we obtain predictions took more than a day of computation and time. Such
models are computationally expensive. And his hence, realtime predictions are not possible
even using more efficient computations. This is one drawback in the implementation of
heteroskedasticity models in general.
363