Similarity Measure For Sequences of Categorical Data
Similarity Measure For Sequences of Categorical Data
Abstract. Similarity measures are usually used to compare items and identify
pairs or groups of similar individuals. The similarity measure strongly depends
on the type of values to compare. We have faced the problem of considering that
the information of the individuals is a sequence of events (i.e. sequences of web
pages visited by a certain user or the personal daily schedule). Some measures
for numerical sequences exist, but very few methods consider sequences of cate-
gorical data. In this paper, we present a new similarity measure for sequences of
categorical labels and compare it with the previous approaches.
1 Introduction
In the last years there is an increasing interest in developing techniques to deal with
sequences of data. Temporal data mining algorithms have been developed to deal with
this type of data [3,6]. Understanding sequence data is becoming very important and
the treatment of those sequences is expected to enable novel classes of applications in
the next years [1]. For example, telecommunication companies store spatio-temporal
data daily, these sequences contain detailed information about the personal or vehicu-
lar behaviour, which can allow to find interesting patterns to be used in many different
applications, such as traffic control. Similarly, people surf the Internet. This is another
great potential source of sequences of users’actions (e.g. web pages visited). The study
of the behaviour on the Internet can also lead to interesting applications, such as in-
trusion detection. There are other domains that also produce temporal sequences [4]:
protein sequences that describe the amino acid composition of proteins and represent
the structure and function of proteins, gene information (DNA) that encode the genetic
makeup, electronic health records that store the clinical history of patients, etc.
However, this type of data requires an adaptation of the classical data mining and de-
cision making techniques applied to static data. Data are called static if all their feature
values do not change with time, or change negligibly. In contrast, sequence data analy-
sis is interested in studying the changes in the values in order to identify interesting
temporal patterns.
In [8] three different approaches to deal with time series are presented: (1) to work
directly with raw data, (2) to convert a raw series data into a feature vector of lower
dimension and (3) to represent the sequence with a certain number of model parameters.
V. Torra and Y. Narukawa (Eds.): MDAI 2008, LNAI 5285, pp. 134–145, 2008.
c Springer-Verlag Berlin Heidelberg 2008
A Similarity Measure for Sequences of Categorical Data 135
In this section, we present the most commonly used dissimilarity functions for numeri-
cal variables. Let’s take two objects i and j represented by the corresponding vectors of
values i = (xi1 , . . . , xiK ) and j = (x j1 , . . . , x jK ).
Euclidean Distance. It is the sum of the squares of the differences of the values.
∑k=1 (xik − x jk )2
2 K
d2 (i, j) = (1)
City-Block or Manhattan Distance. It is the sum of the absolute differences for all
the attributes of the two objects.
With respect to sequences of numerical values, the most common similarity measures
are the following ones (in [8] these and other approaches are presented):
Short Time Series Distance. It is the sum of the squared differences of the slopes in
two time series being compared.
2
x j(k+1) − x j(k) xi(k+1) − xi(k)
(i, j) = ∑
2 K
dST S − (4)
k=1 t(k+1) − t(k) t(k+1) − t(k)
Hamming Distance. It is the number of positions where the two objects are different.
It is limited to cases when they have identical lengths.
with ordered or unordered scales [7]. But, in fact, there are other data representations
that can be considered as textual, spatial, image or multimedia.
As fas as temporal series are concerned, other distinctions must be done as to whether
the data is uniformly or non-uniformly sampled, univariate or multivariate, and whether
the sequences are of equal or unequal length [8].
In this work we want to deal with sequences of events that represent the behaviour of
the user in a particular context. For example tourists visiting a city, where the sequences
show the itinerary that each person has followed to visit the interesting locations in this
city. A private real data set of tourists’ itineraries provided by Dr. Shoval has been tested.
This data set is about a city with 25 interesting places and has about 40 itineraries with
lengths that range from 10 to 30 items.
Another data set we have considered is the list of sequences of visits at the Microsoft
web page. The data was obtained by sampling and processing the www.microsoft.com
logs. The data set records the use of 38000 anonymous, randomly-selected users. For
each user, the data lists all the areas of the web site (Vroots) that he/she visited in a one
week time-frame. The number of Vroots is 294, and the mean number of Vroots visits
per user is 3. This data is publicly available at the UCI Machine Learning
Repository [2].
These two examples of event sequences have the following common characteristics:
· Events are categorical values that belong to a finite set of linguistic labels (city
locations, web pages).
· The sequences have been uniformly sampled in the sense that time slopes are not
taken into account.
· The sequences are univariate, only one concept is studied.
· The lengths of the sequences of the individuals are not equal.
· Events can be repeated in the sequence (for example, a certain tourist visited the
same place, Main Street, more than one time during his holidays).
To facilitate the analysis of the results, the categorical values indicating places or web
pages have been substituted by simpler identifiers. An example of 14 different event
sequences is given in Table 1. Each character may represent a physical place or a web
page.
Id Sequence Id Sequence
1 ab 8 cgdabc
2 bc 9 d
3 abc 10 da
4 cab 11 db
5 dabc 12 cbcbcb
6 edabc 13 bcbcbc
7 fedabc 14 ebebeb
A Similarity Measure for Sequences of Categorical Data 139
The former allows us to measure if the two individuals have done the same things,
that is, if they have visited the same web pages or have gone to the same places. The
latter takes into account the temporal sequence of the events, that is, if two tourists have
visited the Main Street after going to the City Hall or not. This second measure should
also take into account if two events have taken place consecutively or not.
For example, let T 1 and T 2 be tourists who have visited some places of the same
city: T 1 = {a, b, c} and T 2 = {c, a, b, d}. Notice that, there are 3 common places and
also they have visited a before b. So, we could say that they are quite similar.
In this paper we present a new approach to calculate the similarity that takes into
account these two aspects. It is called Ordering-based Sequence Similarity (OSS) and
consists, on one hand, in finding the common elements in the two sequences, and on
the other hand, in comparing the positions of the elements in both sequences. The ‘el-
ements’ that are the basis of this measure can be either single events or sub-sequences
of events that are considered as an indivisible groups (i.e patterns). In case of working
with patterns, they must have a minimum length of two and a maximum length equal to
the shortest sequence.
f (i, j) + g(i, j)
dOSS (i, j) = (9)
card(i) + card( j)
where
g(i, j) = card({xik |xik ∈
/ j}) + card({x jk |x jk ∈
/ i}) (10)
and
∑nk=1 (∑Δp=1 |i(lk ) (p) − j(lk ) (p)|)
f (i, j) = (11)
max{card(i), card( j)}
where i(lk ) = {t|i(t) = lk } and Δ = min(card(i(lk ) ), card( j(lk ) )).
This function has two parts, g is counting the number of non common elements, and f
measures the similarity in the position of the elements in the sequences (the ordering).
The function f is calculated in the symbols space L. So, first, each event in the sequence
i is projected into L, obtaining a numerical vector fore each symbol: i(l1 ) ..i(ln ) . Each of
these new vectors store the positions of the corresponding symbol in the sequence i.
140 C. Gómez-Alonso and A. Valls
The same is done with sequence j, obtaining j(l1 ) .. j(ln ) . Then the projections of the two
sequences i and j into L are compared, and the difference in the positions is calculated
and normalised by the maximum cardinality of the sequences i and j.
If two sequences are equal, the result of dOSS is zero, because the positions are always
equal ( f = 0) and there are not uncommon elements (g = 0). Oppositely, if the two
sequences do not share any element, then g = card(i) + card( j) and f = 0, and dOSS
is equal to 1 when it is divided by card(i) + card( j). The Ordering-based Sequence
Similarity function always gives values between 0 and 1.
The function has the following properties:
· Symmetry: dOSS (i, j) = dOSS ( j, i)
· Positivity: dOSS (i, j) ≥ 0 for all i, j
· Reflexivity: dOSS (i, j) = 0 iff i = j
However, it does not fulfil the triangle inequality: dOSS (i, j) ≤ dOSS (i, k) + dOSS (k, j) for
all i, j, k. From these properties, it is clear that dOSS is a dissimilarity but not a distance.
Proof. The proof of Symmetry, Positivity and Reflexivity is trivial by Definition 1. The
Triangle Inequality does not hold. This is proven in the following counterexample.
Let A, B and C be three sequences defined by A = {b, c}, B = {d, a} and C =
{d, a, b, c}. In this case, dOSS (A, B) = 1.0 because they do not share any item. However,
dOSS (A,C) = 0.5 because they have two elements in common. B and C also have other
two common elements (and they are also in the same position), so dOSS (B,C) = 0.33.
Consequently, dOSS (A,C) + dOSS (B,C) = 0.83 which is less than dOSS (A, B) that is 1.0,
which proofs that the triangle inequality is not fulfilled.
As it has been pointed out, this measure can be applied to different items: single events
or groups of events. Having a sequence {a, b, a, c, d}, in the first case, i = (a, b, a, c, d),
so xi j is any individual event in the sequence. In the second case, i = (ab, ba, ac, cd), so
xi j is any pair of consecutive items, and i = (aba, bac, acd), for triplets.
The following example illustrate how dOSS is calculated for single events (dOSS−1 )
and for pairs of events (dOSS−2 ). Let us take the two following sequences: A = {a, b, c, a},
B = {c, a, d, b, c, a, c}, with cardA = 4 and cardB = 7.
The similarity considering single items gives dOSS−1 (A, B) = 0.36. This result is
obtained in the following way: symbols a, b and c are common in both sequences
A and B. The projection on the symbol a are: A(a) = {0, 3} and B(a) = {1, 5}, so
fa (A, B) = |0 − 1| + |3 − 5| = 3. For symbol b: A(b) = {1}, B(b) = {3} and fb (A, B) =
|1 − 3| = 2. For c: A(c) = {2}, B(c) = {0, 4, 6} and fc (A, B) = |2 − 0| = 2. So, f (A, B) =
f a (A,B)+ f b (A,B)+ f c (A,B)
7 = 1. Calculating the non common elements, we have g(A, B) = 3.
Finally, dOSS−1 (A, B) = f (A,B)+g(A,B)
4+7 = 1+3
11 = 0.36.
Considering the same case example with patterns of length 2, we have a greater
dissimilarity, dOSS−2 (A, B) = 0.629. In this case, the sequences are A = {ab, bc, ca}
and B = {ca, ad, db, bc, ca, ac} with cardinalities 3 and 6. They share 2 elements, for
the bc pair we have A(bc) = {1}, B(bc) = {3} and fbc (A, B) = |1 − 3| = 2, while for
the pair ca: A(ca) = {2}, B(ca) = {0, 4} and fc a(A, B) = |2 − 0| = 2. So, f (A, B) =
f bc (A,B)+ f ca (A,B)
6 = 0.66. Calculating the non common elements, we have g(A, B) = 5.
f (A,B)+g(A,B) 0.66+5
Finally, dOSS−2 (A, B) = 3+6 = 9 = 0.629.
A Similarity Measure for Sequences of Categorical Data 141
5 Experiments
In this section we present the results obtained with a test set of 14 registers, corre-
sponding to the event sequences presented in Table 1. This dataset contains sequences
of different lengths. Register 9 is an extreme case, with a single event in the sequence.
Sequences in registers 12, 13 and 14 show the case of having repeated events.
We have tested the Ordering-based Sequence Similarity considering each event sep-
arately, OSS-1 (Table 2) and with patterns of two events, OSS-2 (Table 3).
Notice that the OSS applied to pairs of events gives higher dissimilarity values in
many cases (see the number of 1’s in Table 3). This is due to the fact that finding
common pairs of events is much more difficult than finding common single events.
For OSS-2, the sequences in the register id=6 {e, d, a, b, c} and id=7 { f , e, d, a, b, c}
are the most similar ones (0.19), because they share 4 common pairs in very similar
positions (ed, da, ab, bc) and 1 uncommon pair ( f e). The next ones are id=5 {d, a, b, c}
and id=6 that have 3 common pairs and 1 uncommon pair. And in third place we find
sequences id=12 {c, b, c, b, c, b} and id=13 {b, c, b, c, b, c}, that also share 4 common
pairs and 2 uncommon ones.
On the contrary, OSS-1 considers that similarity between registers id=12 and id=13
is higher (0.08) that the one between id=6 and id=7 (0.19). This is because in the first
case there are 6 common symbols (all) and in the second case there are only 5 common
symbols and 1 uncommon. From our point of view, OSS-1 is able to better capture
the degrees of similarity for the sequences of events than the OSS-2, since it is not so
important that they happen together than in similar positions.
After analysing the OSS results, the behaviour of the OSS function has been com-
pared to the Edit Distance, which is a measure that is quite popular for comparing
sequences [4]. The Edit Distance counts the number of changes needed to be applied on
Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0.62 0.2 0.33 0.41 0.54 0.62 0.62 1 0.62 0.5 0.75 0.77 0.75
2 0.62 0 0.33 0.4 0.5 0.59 0.66 0.60 1 1 0.62 0.54 0.5 0.77
3 0.2 0.33 0 0.22 0.25 0.4 0.5 0.48 1 0.66 0.6 0.59 0.59 0.77
4 0.33 0.4 0.22 0 0.25 0.4 0.5 0.40 1 0.6 0.66 0.57 0.61 0.79
5 0.41 0.5 0.25 0.25 0 0.19 0.33 0.35 0.6 0.33 0.37 0.66 0.66 0.81
6 0.54 0.59 0.4 0.4 0.19 0 0.16 0.37 0.7 0.48 0.51 0.72 0.72 0.66
7 0.62 0.66 0.5 0.5 0.33 0.16 0 0.40 0.76 0.58 0.60 0.77 0.77 0.72
8 0.62 0.60 0.48 0.40 0.35 0.37 0.40 0 0.76 0.58 0.60 0.58 0.59 0.87
9 1 1 1 1 0.6 0.7 0.76 0.76 0 0.33 0.33 1 1 1
10 0.62 1 0.66 0.6 0.3 0.48 0.58 0.58 0.33 0 0.5 1 1 1
11 0.5 0.62 0.6 0.66 0.37 0.51 0.60 0.60 0.33 0.5 0 0.75 0.77 0.75
12 0.75 0.54 0.59 0.57 0.66 0.72 0.77 0.58 1 1 0.75 0 0.08 0.5
13 0.77 0.5 0.59 0.61 0.66 0.72 0.77 0.59 1 1 0.77 0.08 0 0.54
14 0.75 0.77 0.77 0.79 0.81 0.66 0.72 0.87 1 1 0.75 0.5 0.54 0
142 C. Gómez-Alonso and A. Valls
Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0.33 0.5 0.58 0.7 0.76 0.76 - 1 1 1 1 1
2 1 0 0.5 1 0.66 0.75 0.8 0.8 - 1 1 0.7 0.66 1
3 0.33 0.5 0 0.62 0.33 0.5 0.59 0.59 - 1 1 0.71 0.74 1
4 0.5 1 0.62 0 0.6 0.70 0.77 0.77 - 1 1 1 1 1
5 0.58 0.66 0.33 0.6 0 0.25 0.4 0.4 - 0.5 1 0.77 0.8 1
6 0.7 0.75 0.5 0.70 0.25 0 0.19 0.39 - 0.65 1 0.82 0.84 1
7 0.76 0.8 0.59 0.77 0.4 0.19 0 0.4 - 0.73 1 0.86 0.88 1
8 0.76 0.8 0.59 0.77 0.4 0.39 0.4 0 - 0.73 1 0.86 0.88 1
9 - - - - - - - - - - - - - -
10 1 1 1 1 0.5 0.65 0.73 0.73 - 0 1 1 1 1
11 1 1 1 1 1 1 1 1 - 1 0 1 1 1
12 1 0.7 0.71 1 0.77 0.82 0.86 0.86 - 1 1 0 0.28 1
13 1 0.66 0.74 1 0.8 0.84 0.88 0.88 - 1 1 0.28 0 1
14 1 1 1 1 1 1 1 1 - 1 1 1 1 0
Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0.33 0.33 0.5 0.6 0.66 0.66 1 1 0.5 0.83 0.83 0.83
2 1 0 0.33 1 0.5 0.6 0.66 0.66 1 1 1 0.66 0.66 0.83
3 0.33 0.33 0 0.66 0.25 0.4 0.5 0.5 1 1 0.66 0.66 0.66 0.83
4 0.33 1 0.66 0 0.5 0.6 0.66 0.5 1 0.66 0.66 0.66 0.66 0.83
5 0.5 0.5 0.25 0.5 0 0.2 0.33 0.33 0.75 0.5 0.5 0.66 0.66 0.83
6 0.6 0.6 0.4 0.6 0.2 0 0.16 0.33 0.8 0.6 0.6 0.66 0.66 0.66
7 0.66 0.66 0.5 0.66 0.33 0.16 0 0.33 0.83 0.66 0.66 0.83 0.66 0.83
8 0.66 0.66 0.5 0.5 0.33 0.33 0.33 0 0.83 0.66 0.66 0.66 0.66 1
9 1 1 1 1 0.75 0.8 0.83 0.83 0 0.5 0.5 1 1 1
10 1 1 1 0.66 0.5 0.6 0.66 0.66 0.5 0 0.5 1 1 1
11 0.5 1 0.66 0.66 0.5 0.6 0.66 0.66 0.5 0.5 0 0.83 0.83 0.83
12 0.83 0.66 0.66 0.66 0.66 0.66 0.83 0.66 1 1 0.83 0 0.33 0.5
13 0.83 0.66 0.66 0.66 0.66 0.66 0.66 0.66 1 1 0.83 0.33 0 0.66
14 0.83 0.83 0.83 0.83 0.83 0.66 0.83 1 1 1 0.83 0.5 0.66 0
one sequence to obtain another one. To scale the values into the unit interval, we have
divided the number of changes by the length of the longest sequence. Table 4 presents
the results of the normalised Edit Distance (ED) on the same data set.
The first significant difference in the results of Table 4 is the pair of registers that
achieves the minimum dissimilarity in each measure. ED considers that the most simi-
lar ones are {e, d, a, b, c} (id=6) and { f , e, d, a, b, c} (id=7), because they are the longest
sequences with a single difference, the introduction of a new symbol. OSS-1 gives
the same similarity value to this pair, 0.16, but it finds another most similar pair of
A Similarity Measure for Sequences of Categorical Data 143
OSS-1 OSS-2 ED
Register
MIN MAX MIN MAX MIN MAX
Id Seq Id Value Id Id Value Id Id Value Id
1 ab 3 0.2 9 3 0.33 2,10,11,12, 3,4 0.33 2,9,10
13,14
2 bc 3 0.33 9,10 3 0.5 1,4,10,11, 3 0.33 1,4,9,10,11
14
3 abc 1 0.2 9 1,5 0.33 10,11,14 5 0.25 9,10
4 cba 3 0.22 9 1 0.5 2,10,11,12, 1 0.33 2,9
13,14
5 dabc 6 0.19 14 6 0.25 11,14 6 0.2 14
6 edabc 7 0.16 12,13 7 0.19 11,14 7 0.16 9
7 fedabc 6 0.16 12,13 6 0.19 11,14 6 0.16 9,12,14
8 cgdabc 5 0.35 14 6 0.39 11,14 5,6,7 0.33 14
9 d 10,11 0.33 1,2,3,4,12, - - - 10,11 0.5 1,2,3,4,12,
13,14 13,14
10 da 5,9 0.33 2,12,13,14 5 0.5 1,2,3,4,11, 5,9,11 0.5 1,2,3,12,
12,13,14 13,14
11 db 9 0.33 13 all 1.0 all 1,5,9,10 0.5 2
12 cbcbcb 13 0.08 9,10 13 0.28 1,4,10,11, 13 0.33 9,10
14
13 bcbcbc 12 0.08 9,10 12 0.28 1,4,10,11, 12 0.33 9,10
14
14 ebebeb 12 0.5 9,10 all 1.0 all 12 0.5 8,9,10
sequences: {c, b, c, b, c, b} (id=12) and {b, c, b, c, b, c} (id=13), which are the longest
sequences that share exactly the same symbols in very similar positions. In fact, they
have the same sequence {c, b, c, b, c} but adding b before or after it.
If we consider now the first row of the matrices, the one that compares the register
{a, b} (id=1) with the rest, it can be seen that the behaviour of the Edit Distance is
quite different from the one of the OSS-1. ED considers that sequence {a, b} (id=1) is
equally similar to {a, b, c} (id=3) and {c, a, b} (id=4). Whereas, OSS-1 considers that
the former is more similar to register id=1 because both individuals have started the
sequence doing the same event a, followed by b. And the difference is that the individual
id=1 has stopped and the other has continued one step more. However, sequence id=4
has not started doing the same event. This difference is able to be captured if the relative
ordering of the events is considered.
This difficulty to distinguish two sequences that have different items than two se-
quences that have the different items but in different order is the main drawback of the
Edit Distance [13]. An extreme case is the result given with sequences like {a, a, b, b}
and {b, b, a, a}. ED will give a dissimilarity of 4 changes (or 1 if it is normalised). In
144 C. Gómez-Alonso and A. Valls
this case OSS-1 clearly improves ED giving a dissimilarity value of 0.125, which is
more like what common sense would indicate.
To make a deeper analysis of the behaviour of the similarity matrix for clustering
purposes, it is needed to identify which is the closest sequence to a given one. Table 5
shows the identifier of the register/s with minimum dissimilarity to each of the 14 case
studies, for the 3 measures. Also the corresponding dissimilarity value that links those
pairs of sequences is given. Finally, the sequence with maximum dissimilarity is shown.
An analysis of the table suggests that the dissimilarity function OOS-1 is more pre-
cise than ED and OOS-2 to determine the minimum and maximum values. It usually
identifies unique values.
Acknowledgments
The authors want to specially thank the collaboration of Dr. N. Shoval. This work has
been supported by the Spanish research projects E-AEGIS (TSI-2007-65406-C03) and
Consolider-Ingenio 2010 ARES (CSD2007-00004).
A Similarity Measure for Sequences of Categorical Data 145
References
1. Abul, O., Atzori, M., Bonchi, F., Giannotti, F.: Hiding sequences. In: ICDE Workshops, pp.
147–156. IEEE Computer Society, Los Alamitos (2007)
2. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://archive.ics.uci.edu/ml/
3. Dietterich, T.G.: Machine learning for sequential data: A review. In: Caelli, T., Amin, A.,
Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396,
pp. 15–30. Springer, Heidelberg (2002)
4. Dong, G., Pei, J.: Sequence Data Mining. Advances in Database Systems, vol. 33. Springer,
US (2007)
5. Figueira, J., Greco, S., Ehrgott, M.: Multiple Criteria Decision Analysis: State of the Art
Surveys. ISOR & MS, vol. 78. Springer, Heidelberg (2005)
6. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. The Morgan Kauf-
mann Series in Data Management Systems. Morgan Kaufmann Publishers, San Francisco
(2006)
7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Sur-
veys 31(3), 264–323 (1999)
8. Liao, T.W.: Clustering of time series data–a survey. Pattern Recognition 38(11), 1857–1874
(2005)
9. Mount, D.W.: Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Labo-
ratory Press (September 2004)
10. Nin, J., Torra, V.: Extending microaggregation procedures for time series protection. In:
Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowinski, R.
(eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 899–908. Springer, Heidelberg (2006)
11. Notredame, C.: Recent evolutions of multiple sequence alignment algorithms. PLoS Com-
putational Biology 3(8), e123+ (2007)
12. Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Current Opin-
ion in Structural Biology 15(3), 261–266 (2005)
13. Yang, J., Wang, W.: Cluseq: Efficient and effective sequence clustering. In: 19th International
Conference on Data Engineering (ICDE 2003), vol. 00, p. 101 (2003)