Zeluiz, V4n4a03
Zeluiz, V4n4a03
1. Introduction
383
384 Pedro Gale an o and Daniel Pen a
have been made with environmental data. We have time series from different lo-
cations and we want to make groups with locations with the same behavior. See
for instance Bohte et al. (1980) , Cowpertwait and Cox (1992) , Gantert (1994) ,
Walden (1994) and Macchiato et al. (1995) . There are several problems not com-
pletely solved in the application of cluster analysis in time series. The standard
approach for splitting a sample of multivariate data into clusters is to assume that
the multivariate observations have been generated by a mixture of multivariate
normal distributions with different means and covariance matrices and unknown
mixture probabilities. If the number of populations were known, the parame-
ters can be estimated by the EM algorithm or by MC 2 Bayesian methods. As
the number of population is unknown, a model selection procedure, such as the
BIC or AIC criteria is applied to select the number of populations involved . The
generalization of these approach to time series is to assume that data has been
generated by some set of possible multivariate time series models or data gen-
erating processes, M I , . . . , Mk, with unknown probabilities, and then the cluster
problem is closely related to the discrimination problem. However, this approach
has not yet been fully explored.
The problem of dimensionality reduction is very important for dynamic data
since for vector ARMA models, as well as for simultaneous equations econometric
models, the number of parameters to estimate grows rapidly with the number of
observed variables. An interesting extension of the idea of principal components
for time series is the canonical analysis of Box and Tiao (1977). Instead of finding
linear combination of maximum (or minimum) variability these authors studied
the problem of finding linear combinations of maximum (or minimum) predictabil-
ity. They showed that the canonical variables are useful for understanding and
simplifying the dynamic structure present in the vector of time series. Factor anal-
ysis oftime series was studied by Geweke and Singleton (1981), Brillinger (1981) ,
. Engle and Watson (1981), Molenaar (1985), Pena and Box (1987), Molenaar et
al. (1992), Pena and Poncela (1998) among others. An alternative approach to
dimension reduction is the reduced rank approach by Velu et al. (1986) and Ahn
and Reinsel (1988). In the nonstationary case estimating the nonstationary fac-
tors is equivalent to testing for cointegration in the econometrics field (whose vast
literature we do not pretend to review here), since the number of cointegration
relations among the components of a vector of time series is the dimension of the
vector minus the number of nonstationary common factors (see Escribano and
Pena, 1994). An alternative useful approach for model simplification is the scalar
components approach by Tiao and Tsay (1989) . Finally the state space approach
to time series includes procedures for dimension reduction (Hannan and Deistler,
1988, Aoki, 1990) .
This paper describes some of the development of these procedures in the time
domain. The reader interested in the devepment in the frequency domain is ad-
vised to read chapter 5 of Shumway and Stoffer (2000), which contains a good
review of this field. The article is organized as follows . In the next section the
problem of discrimination in time series is presented. The standard discriminat
Multivariate Analysis in Vector Time Series 385
analysis is seen as a model fitting exercise and it is shown that in practice, when
the parameters are unkonwn, discriminat analysis for time series is closely related
to the model selection problem that has been the subject of an important area of
research in time series. In Section 3 we present the clustering problem and discuss
some of the measures of distance among time series that have been proposed in the
literature. Some suggestions for further research in this field are also included. We
have decided to consider only in Section 4 the extensions of standard multivariate
methods, as the literature on model simplification and dimension reduction is very
large. Thus, in the section we present the extension of the principal component
idea of Box and Tiao (1977) and the Dynamic factor model. The relationship
between both approaches is discussed and we also relate the nonstationary . fac-
tor model and the cointegration literature. Section 5 presents some concluding
remarks.
The classical approach assumes that both covariance matrices are equal, L.l =
L.2 = L., but the means are unequal. Thus, we assume that the difference between
the two marginal means is due to some deterministic function. For instance, if
Pj; = b Oj + b1ji, the series have a different deterministic trend and if b 1j = 0
the series have a different marginal mean. The Neyman-Pearson lemma for the
hypothesis Hl : x E Ml versus H2 : x E M 2 , leads to the following rule for
accepting HI :
p(x/Hl ) > K (2)
p(X/H2) ,
for some value K that takes into account the probabilities of misclassifying the
time series. Assuming that the costs of missclassification are the same and that
the a priori probabilities of each model are also the same, we will classify the
observation in the model that have the maximum likelihood. This is equivalent
to accept the hypothesis H 1 if
386 Pedro Galeano and Daniel Pena
(3)
(5)
and the likelihood will only depend on the one-step ahead forecasting errors that
are equal to the residuals a. Note that for linear time series we can write the zero
mean process 7r (B) et = ft, where 7r (B) =1-7r1 B -7r2B2 _ ... , as IIe =
f, where
1 o o
o
-7rT-l 1
Multivariate Analysis in Vector Time Series 387
Suppose that the covariance matrix of t is (1"21 . Then , calling I; the covariance
matrix of e, we have
and, therefore,
and
lIn n
e':E-le = e' (1"2 (n'n) e = (1"2 L t; = La;
;=1 ;=1
in agreement with (5). Thus, discriminant analysis can be viewed as assigning
the observed time series x to the model (population) that when fitted to the time
series produces the smallest one step ahead squared forecast error.
where pf are the coefficients of the dual autocorrelation function of the model
given by
00
with 1f (B) , the autoregressive form of the model, VD = (1"2 L 1ft and F = B- 1 ,
;=0
the forward operator. Then, Galeano and Pefia (2000) showed that
(x - x)'(x - x)
x':E;lx = ~--~~~--~
MSE(x)
388 Pedro Galeano and Daniel Peiia
where M SE(15) denotes the mean square interpolation error. That is, the series
Xt is assigned to the model that produces the smallest interpolation error or , in
other words, the model that is better fitted to the data.
As the distribution of Q(x) is difficult to find, Shumway (1982) suggests
that under Hj , j = 1,2, and for large values of T, Q(x) can be approxi-
mated by a normal distribution with mean tr ((E;-I - Ell) Ej ) and variance
2 . tr ((E2I - E2I) Ej) 2 , where tr denotes trace. This method has the principal
inconvenient that the eigenvalues must be obtained numerically, being the matri-
ces (E21 -- E2I) "E j very large, which makes a numerical solution very difficult to
obtain .
When the covariance matrix are different the optimum discriminant rule is
not linear. An alternative approach in these situations is to obtain a good lin-
ear discriminant rule according to some criteria. This is the idea of admissible
linear procedures introduced by Anderson and Bahadur (1962). For Gaussian
populations, under Hi, a linear discriminant rule, c/ x, has a univariate normal
distribution with mean a' Pi and variance a'Eia. Therefore the probability of
misclassifying an observation are given by
where <I> (x) is the cdf of the N(O, 1) distribution. The objective is to make these
=
values as small as possible, and this is equivalent to make the values, YI ~a ,a
and Y2 = ~fJ?;K,
a 2a
small. The set of desirable procedures are those that: (1)
minimize the probability of one error when the other is specified, or (2) minimize
the maximum probability of error, or (3) minimize the probability of error when
a priori probabilities of the two populations are specified. The solutions to these
problems are the set of admissible linear procedures. The set of solutions that
minimizes YI for each given Y2 is characterized by,
(6)
I(I : 2,a'x)
Multivariate Analysis in Vector Time Series 389
The values of a that maximize 1(1 : 2, a'x), 1(2 : 1, a'x) or J(1 : 2, a'x) are of
the form :Ela - ,\:E2a = 'YO, where 0 = (Ill -1l2) , for some values of the scalars
,\ and 'Y . As a consequence of this , the procedures based in the Kullback-Leibler
information and in the divergence are admissible linear procedures. Chaudhuri et
al. (1991) obtain linear discriminant procedures through the maximization of the
Bhattacharyya distance for Gaussian processes with unequal covariance matrices.
If (n, B, v) is a measure space and p is the set of all the probability measures
on B which are absolutely continuous with respect to v, then the Bhattacharyya
distance between two probability measures with density functions PI and P2 be-
longing to p, is defined by,
where the function f depends on the criteria. For instance, for ARMA models the
AIC of Akaike is
Ale = n log 0: 2 + 2(p+ q)
where (p + q) is the number of parameters in the model. The BIC criteria due to
Schwarz (1978) is
This last criterion has been shown to' have a very good performance in many model
selection problems.
Some Bayesian approaches have been proposed that do not adopt a formal
Bayes rules via a loss function. For example, Broemeling and Son (1987) consider
how to assign an observed time series to one of several possible autoregressive
sources with a common known order and unknown parameters and error variance.
Using a vague prior density for the parameters and the variance, the observed time
series data is assigned to one class by using the marginal posterior mass function
of a classification vector,
A = (AI, .. ., Ak)'
with k mass points (1,0, ... ,0)" ... , (0 , ... ,0,1)'. The realization is assigned to the
process i if the posterior mass function of A has its largest value at the i - th mass
point. Marco et al. (1988) consider the case of different autoregressive classes.
The data is assigned to the class k, if it has the greater predictive probability.
Finally, there exist other approaches for discrimination in which discriminant
functions are not used. For instance Kedem and Slud (1982) proposed transform
a stationary time series into binary arrays that retain only the signs of the j - th
difference series. This binary series are used for discriminating among different
models. Li (1996) proposed a generalization of this method through the use of
parametric filtering, i.e., using a family of filters indexes by a parameter. The
series is filtered and the information provided by the autocorrelation function is
used for discriminate the series into different models.
Suppose that we have a large set of time series following different models. In a
nonparametric approach each series is considered as a point on ~T, where T is
the length of the series. A straightforward generalization of the standard cluster
methods is to obtain groups of series by looking at the distance between these
points in the space. In order to identify groups we can work directly with a
distance metric in ~T, or we can try to work in a smaller space by projecting the
points according to some optimality criterion. This criterion should be related to
the possibility of identifying clusters in the projected cloud of points.
A parametric approach would proceed by first fitting time series models to the
data and then representing the series by the vector of estimated parameters. If
Multivariate Analysis in Vector Time Series 391
i = 1, ... ,k
{f:
1
which always exists for every x, y E £ , and being the zero element the sequence
(0 , ... ,0) . We notice that a dual metric can be defined by
{f
1
3=1
for some weighting function Wr that can be used to give weights to the coefficients
that decrease with the lag. This measure is related to the parametric approach
Multivariate Analysis in Vector Time Series 393
4. Dimension Reduction
Thus, m must be the largest eigenvector of the matrix obtained as product of the
matrix of explained variability and the inverse of the matrix of total variability,
Qm; = .A;m;,
the eigenvectors provide the required linear combination and the eigenvalues the
predictability of these linear combinations. Building the matrix M = [m1 .. .mp]'
where .A1 ~ .A2 ~ ... ~ .Ap and the transformation
Zt = M'Xt
a new vector of time series is obtained with components ordered from most to
least predictable. The components are contemporaneously uncorrelated because
it is easy to show that the matrices M'roM and M'Y:.M are both diagonal.
where I!> (B)= l-¢lB- ... -¢pBp and E>(B) =I -fhB- ... -OqBq are polynomial
matrices r x r and the roots of II!> (B) I are on or outside the unit circle and those
of IE> (B) I are outside the unit circle. The sequence at is serially uncorrelated with
zero mean and covariance matrix Y:. a . The components of the vector of common
factors can be either stationary or nonstationary.
The specific dynamic structure associated with each of the observed series is
included in the vector 11t of idiosyncratic components. Some components of this
Multivariate Analysis in Vector Time Series 395
vector can be white noise, while other ones can have stationary dynamic structure.
In general, we assume that nt follows the vector ARMA model
(11 )
where ~n(B) and 0 n (B) include mxm diagonal matrices. The sequence of vectors
et are normally distributed, have zero mean and diagonal covariance matrix I;e.
Therefore, each component follows an univariate ARMA(Pi, qi), i = 1,2,· ··, m ,
being p=max(Pi) and q=max(qi), i =
1,2,···, m. We assume that the noises
from the common factors and specific components are also uncorrelated for all
lags, E(ate~_h) = 0, 'tIh.
The model as stated is not identified, because for any r x r non singular matrix
H the observed series can be expressed in terms of a new set of factors ,
~*(B) H~(B)H-l,
0oO(B) H0(B)H-l,
~: H~aH'.
To solve this identification problem, we can always choose either ~a = lor pIp =
I. Note that as
p*' P* = (H- 1), P' P H- 1
if pIp = I then PoO' poO = (H- 1 ), H-l that will only be the identity matrix if H is
orthogonal. Therefore the model is not yet identified under rotations, and we need
to introduce a restriction to estimate the model. The standard restriction used to
solve this problem in static factor analysis is that p/~;; 1 P should be diagonal.
Harvey (1989) imposes that Pij = 0, for j > i, where P =
[Pij). This condition is
not restrictive, since the factor model can be rotated for a better interpretation
when needed (see Harvey, 1989, for a brief discussion about it).
Peiia and Poncela (1998) showed that the model presented is fairly general
and includes also the case where lagged factors are present in equation (9). For
instance, for ease of exposition assume a stationary model with no specific com-
ponents, but with lagged factors on the observation equation, such as
Xt = Pv(B)Ft + nt,
where v(B) = I + v1B + ... + vlB I , 1<00 and F t follows a VARMA model
Ft = 'iJI(B)at, 'iJl o = I.
396 Pedro Galeano and Daniel Pefia
This model can be rewritten in the standard form presented in (9) with
and,
gives r linear combinations of the time series components that will recover the
factors and m - r combinations that will be white noise.
Multivariate Analysis in Vector Time Series 397
This model has been estudied in the nonstationary case by Pella and Poncela
(1998). They showed that the identification of the nonstationary I(d) factors
can be made through the common eigenstructure of some generalized covariance
matrices, properly normalized. The number of common nonstationary factors is
the number of nonzero eigenvalues. Thus, a similar identification procedure can be
applied in the stationary and in the nonstationary case. Once we have a preliminar
estimation of the dimension of the system we can estimate the factor loading
matrix and the parameters of the VARIMA factor representation by writing the
model in the state space from and use the EM algorithm.
r",(O) = pr/(O)p' + E,
then
(3' Xt = stationary
that is, Xt ,..... / (1) but (3' Xt ,..... / (0). The matrix (3 is called the cointegration
matrix and there will be p linear combinations that lead to stationary processes.
In order to see the implications of this property, suppose the simplest /(1) model
Xt =Xt-l + €t· We can write this model as
(12)
and note that if Xt follows the multivariate random walk the value of II in this
equation is zero and this implies no cointegration. However, if the process is a
stationary VAR(1) process we can write the model as
Xt = (II - I)Xt-l + ft
in this equation II is a full rank matrix becasues II = 4> + / where 4> is the
AR matrix that must have eigenvalues smaller than one for the process to be
stationary. Thus saying that II is a full rank matrix implies that Xt follows and
VAR(l) , all the components are stationary, or they are /(0) . A third intermediate
possibility is that II is neither a zero matrix nor a full rank matrix but it has
rank p . Let us show that this implies cointegration, that is, in this case some
linear combination of the vector of time series are stationary whereas some others
will be nonstationary. To show this property note that if II has rank p it can be
written as
II = a(3'
where a and (3 are m x p matrices of rank p < m. Now if we multiply (12) by (3',
we have
\7(3' Xt =
((3' a )(3' Xt-l + (3' ft
and calling Zt = (3' Xt we have that
and II* is a full rank matrix and Zt stationary. Thus, the p linear combinations
will be stationary whereas if the matrix m x (m - p), 0'.1 belongs to the null
(3' Xt
space of a, that is, it verifies O'~ 0' = 0, we have that the m - p combinations a~ Xt
are nonstationary.
There is a close connection between cointegration and the factor model. Es-
cribano and Peiia (1994) showed that the following two proposition are equivalent:
Multivariate Analysis in Vector Time Series 399
(1) The individual components of Xt are I (1) but there are p cointegration
relationships, (3' Xt, that are 1(0).
(2) Xt can be written as generated by m - p common factors that are 1(1) .
Thus, cointegration implies common factor and common nonstationary factors
implies cointegration . From the practical point of view if the dimension m is large
it is simpler to look for a few factors than for many cointegration relations.
5. Conclusions
Acknowledgment
This work has been partially supported by DGES, grant PB96-0111, and Catedra
BBVA de Metodos para la Mejora de la Calidad.
400 Pedro Galeano and Daniel Peiia
References
Distance based linear discriminant function for stationary time series, Com-
munications in Statistics (Theory and Methods), 20, 2195-2205 .
CHAUDHURI, G . (1992) . Linear discriminant function for complex normal time
series, Statist. Probab. Lett. , 15, 277-279.
CIS Extended Data Base Publication .
DARGAHI-NoUBARY, G. R . (1992). Discrimination between gaussian time series
based on their spectral differences, Communications in Statistics (Theory and
Methods), 21, 2439-2458 .
DARGAHI-NoUBARY, G. R . AND LAYCOCK, P. J. (1981). Spectral ratio discrim-
inants and information theory, J. Time Ser. Anal., 2, 2, 71-86.
ENGLE, R. F. AND WATSON, M . W. (1981). A one-factor multivariate time
series model of metropolitan wage rates, J. Am. Statist. Ass., 76, 774-78l.
ESCRIBANO, A. AND PENA, D. (1994) . Cointegration and common factors , J.
Time Ser. Anal., 15, 577-586.
GALEANO, P . AND PENA, D. (2000). Discrimination in time series and the
interpolation error. Working Paper, Universidad Carlos III de Madrid.
GANTERT , C. (1994) . Classification of trends via the linear state space model,
Biometrical Journal, 36, 825-839.
GERSCH, W., MARTINELLI, F., YONEMOTO, J., Low, M . D. AND McEWAN,
J. A. (1979) . Automatic classification of electroencephalograms: Kullback-
Leibler nearest neighbor rules. Science,205, 193-195.
GEWEKE, J. F. AND SINGLETON K. J. (1981). Maximum Likelihood Confirma-
tory Analysis of Economic Time Series International Economic Review, 22,
37-54.
HANNAN, E. J . AND QUINN, B. J. (1979). The determination of the order of an
autoregression, J . R. Statist. Soc. B, 41, 190-195.
HANNAN E. J. AND DEISTLER, M. (1988). The Statistical Theory of Linear
Systems. New York: John Wiley.
HARVEY, A. (1989). Forecasting Structural Time Series Models and the Kalman
Filter (2nd edn). Cambridge: Cambridge University Press.
HURVICH, C. M. AND TSAI, C. L. (1989). Regression and Time series model
selection in small samples, Biometrika, 76, 297-307.
KAKIZAWA, Y ., SHUMWAY , R . H . AND TANIGUCHI, M. (1998). Discrimination
and clustering for multivariate time series, J. Am. Statist. Ass.,93, 328-340 .
KEDEM, B. AND SLUD, E . (1982). Time series discrimination by higher order
crossings, Annals of Statistics, 10, 786-794.
402 Pedro Galeano and Daniel Pena
Pedro Galeano
Teaching Assistant and PH.D. Student
Departamento de Estadistica y Econometria
Universidad Carlos III de Madrid,
pgaleano@est-econ.uc3m.es
Spain
Daniel Pena
Professor of Statistics
Departamento de Estadistica y Econometria
Universidad Carlos III de Madrid,
dpena@est-econ.uc3m.es
Spain