0% found this document useful (0 votes)
57 views12 pages

Similarity Measure For Sequences of Categorical Data

This document proposes a new similarity measure for comparing sequences of categorical data. It begins with an introduction to analyzing sequential data and a review of existing similarity measures for both numerical and categorical sequences. It then presents a new similarity measure for categorical sequences that is based on comparing the common items and their positions in the two sequences. The document aims to address the need for methods to measure similarity between sequences of categorical data, which are important in many domains but still lack dedicated techniques.

Uploaded by

Neti Suherawati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views12 pages

Similarity Measure For Sequences of Categorical Data

This document proposes a new similarity measure for comparing sequences of categorical data. It begins with an introduction to analyzing sequential data and a review of existing similarity measures for both numerical and categorical sequences. It then presents a new similarity measure for categorical sequences that is based on comparing the common items and their positions in the two sequences. The document aims to address the need for methods to measure similarity between sequences of categorical data, which are important in many domains but still lack dedicated techniques.

Uploaded by

Neti Suherawati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

A Similarity Measure for Sequences of Categorical Data

Based on the Ordering of Common Elements

Cristina Gómez-Alonso and Aida Valls

iTAKA Research Group - Intelligent Tech. for Advanced Knowledge Acquisition


Department of Computer Science and Mathematics
Universitat Rovira i Virgili
43007 Tarragona, Catalonia, Spain
{cristina.gomez,aida.valls}@urv.cat

Abstract. Similarity measures are usually used to compare items and identify
pairs or groups of similar individuals. The similarity measure strongly depends
on the type of values to compare. We have faced the problem of considering that
the information of the individuals is a sequence of events (i.e. sequences of web
pages visited by a certain user or the personal daily schedule). Some measures
for numerical sequences exist, but very few methods consider sequences of cate-
gorical data. In this paper, we present a new similarity measure for sequences of
categorical labels and compare it with the previous approaches.

1 Introduction
In the last years there is an increasing interest in developing techniques to deal with
sequences of data. Temporal data mining algorithms have been developed to deal with
this type of data [3,6]. Understanding sequence data is becoming very important and
the treatment of those sequences is expected to enable novel classes of applications in
the next years [1]. For example, telecommunication companies store spatio-temporal
data daily, these sequences contain detailed information about the personal or vehicu-
lar behaviour, which can allow to find interesting patterns to be used in many different
applications, such as traffic control. Similarly, people surf the Internet. This is another
great potential source of sequences of users’actions (e.g. web pages visited). The study
of the behaviour on the Internet can also lead to interesting applications, such as in-
trusion detection. There are other domains that also produce temporal sequences [4]:
protein sequences that describe the amino acid composition of proteins and represent
the structure and function of proteins, gene information (DNA) that encode the genetic
makeup, electronic health records that store the clinical history of patients, etc.
However, this type of data requires an adaptation of the classical data mining and de-
cision making techniques applied to static data. Data are called static if all their feature
values do not change with time, or change negligibly. In contrast, sequence data analy-
sis is interested in studying the changes in the values in order to identify interesting
temporal patterns.
In [8] three different approaches to deal with time series are presented: (1) to work
directly with raw data, (2) to convert a raw series data into a feature vector of lower
dimension and (3) to represent the sequence with a certain number of model parameters.

V. Torra and Y. Narukawa (Eds.): MDAI 2008, LNAI 5285, pp. 134–145, 2008.
c Springer-Verlag Berlin Heidelberg 2008
A Similarity Measure for Sequences of Categorical Data 135

The feature-based and model-based approaches permit to apply conventional algorithms


since there is no need to deal with the sequential data. However, sometimes it is not
possible to build those feature vectors or models. In this work we are interested in the
first approach, which requires to adapt the classical techniques in order to be able to
deal with the particularities of sequential data.
In this paper we consider the problem of measuring the similarity of two time se-
quences of items (i.e. events). Comparing elements is a basic key point in many meth-
ods for analysing data, such as clustering techniques (which build clusters of similar
objects), classification of objects into existing clusters, characterisation of prototypes,
recommender systems or decision making methods (such as those based on dominance
rough sets that consider dominance, indiscernibility and similarity relations [5]).
In [8] a survey of similarity/distance measures for sequential data is given. Nine
measures are defined and most of them can only be applied to numerical values. In the
examples of temporal sequences presented before, the items of the sequence are not
numbers but categorical values (places, web pages, proteins, etc.). Although sequences
of categorical values are very important nowadays, there still exist few attempts to work
with them due to the inherent complexity of dealing with non-numerical values.
In this paper we present a similarity measure between two categorical sequences that
is based on the comparison of the common items in the two sequences and the positions
where they appear.
First in section 2, a review of other approaches to similarity measurement in time
series is introduced. Section 3 presents different features that must be taken into ac-
count for working with sequences and then describes the type of sequences that we
have considered. In section 4 a new similarity function is defined. Section 5 shows a
case study where different similarity measures for sequences are compared. Finally,
section 6 gives the conclusions and outlines the future research lines.

2 Review of Dissimilarity Measures


A dissimilarity function d on two objects i and j must satisfy the following conditions:
1. Symmetry: d(i, j) = d( j, i)
2. Positivity: d(i, j) ≥ 0 for all i, j
If conditions:
3. Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j) for all i, j, k; and
4. Reflexivity: d(i, j) = 0 iff i = j
also hold, it is called metric or distance function.
Moreover, d is a normalized distance function if 0 ≤ d(i, j) ≤ 1 for all objects i
and j.
Dissimilarity functions can be classified according to the type of value they can deal
with into: numerical, categorical or mixed functions. In this section, some of the clas-
sical distance functions for static data are presented. After this, the existing distance
measures for sequential data are reviewed. The cases of numerical and categorical
information are presented separately.
136 C. Gómez-Alonso and A. Valls

2.1 Dissimilarity Functions for Numerical Variables

In this section, we present the most commonly used dissimilarity functions for numeri-
cal variables. Let’s take two objects i and j represented by the corresponding vectors of
values i = (xi1 , . . . , xiK ) and j = (x j1 , . . . , x jK ).

Euclidean Distance. It is the sum of the squares of the differences of the values.

∑k=1 (xik − x jk )2
2 K
d2 (i, j) = (1)

City-Block or Manhattan Distance. It is the sum of the absolute differences for all
the attributes of the two objects.

d1 (i, j) = ∑k=1 |xik − x jk |


K
(2)

Minkowski Distance. It is a generic distance which is defined as the q − th root of the


sum of powers q − th of absolute differences of the values of the two objects. Note
that the Euclidean distance and Manhattan distance are particular cases for q = 2
and q = 1, respectively.
 1
∑k=1 |xik − x jk |q
K q
dq (i, j) = (3)

With respect to sequences of numerical values, the most common similarity measures
are the following ones (in [8] these and other approaches are presented):

Short Time Series Distance. It is the sum of the squared differences of the slopes in
two time series being compared.
  2

 x j(k+1) − x j(k) xi(k+1) − xi(k)
(i, j) =  ∑
2 K
dST S − (4)
k=1 t(k+1) − t(k) t(k+1) − t(k)

where tk is the time point for data values xik and x jk

Dynamic Time Warping Distance. It consists in the alignment of two series Q =


(q1 , q2 , ..., qn ) and R = (r1 , r2 , ..., rm ) in order to minimize their difference. To this
end, an n · m matrix is built, where the (i, j) element of the matrix contains the dis-
tance d(qi , r j ) (generally Euclidean distance). A warping path W = w1 , w2 , ..., wK
is calculated, where max(m, n) ≤ K ≤ m + n − 1 . Then, the minimum distance
between the two series is calculated as:

K
∑k=1 wk
dDTW (i, j) = min (5)
K
A Similarity Measure for Sequences of Categorical Data 137

2.2 Dissimilarity Functions for Categorical Variables


Now, the case of categorical variables for static and sequential data is outlined. For the
static case, the following two distances are well-known:

Chi-Squared χ 2 Distance. It is based on the number of objects in the dataset that
have the same value than object i for the k-th variable, Iki .

χ 2 (i, j) = ∑k=1 dk (i, j)


K
(6)
 
where dk (i, j) is 0 when xik = x jk and I1ki + Ik1j otherwise.

Hamming Distance. It is the number of positions where the two objects are different.
It is limited to cases when they have identical lengths.

dH (i, j) = ∑k=1 dk (i, j)


K
(7)

where dk is 0 when xik = x jk and 1 if xik = x jk .


In the case of sequences of categorical values, there are three approaches: Hamming
distance (an extension of eq. 7), String metrics and Alignment-based distances.
With respect to String metrics, we have:

Edit or Levenshtein Distance. It calculates the minimum number of edit operations to


transform S1 into S2 , where an edit operation is an insertion, deletion or substitution
of a single character.

Damerau-Levenshtein Distance. A modification of Levenshtein distance adding the


transposition operation, which is a function that swaps two elements of a sequence.

Kullback-Liebler Divergence. It measures the difference between two probability dis-


tributions.


Pi (x|X)
dKL (i, j) = ∑k=1 (Pi (x|X) − Pj (x|X))log
K
(8)
Pj (x|X)
where Pi denote the conditional probability distribution for Si
Otherwise, sequence alignment comes from ADN, RNA or protein sequences stud-
ies. The main characteristic of all these cases is that elements of the sequences are
characters. In this case, methods are based on the Dynamic Time Warping Distance
(see eq. 5). A more detailed analysis can be found at [9,11,12].

3 Description of the Data Sequences to Compare


As it usually happens in many Artificial Intelligence techniques, the nature of the val-
ues in the data set determines the characteristics of the method that can be applied. The
usual main classification distinguishes: numerical values versus categorical values. Nu-
merical scores can be continuous, discrete or intervals, and can represent quantitative
measurements, ratios or ordinal scales. Categorical values represent qualitative features
138 C. Gómez-Alonso and A. Valls

with ordered or unordered scales [7]. But, in fact, there are other data representations
that can be considered as textual, spatial, image or multimedia.
As fas as temporal series are concerned, other distinctions must be done as to whether
the data is uniformly or non-uniformly sampled, univariate or multivariate, and whether
the sequences are of equal or unequal length [8].
In this work we want to deal with sequences of events that represent the behaviour of
the user in a particular context. For example tourists visiting a city, where the sequences
show the itinerary that each person has followed to visit the interesting locations in this
city. A private real data set of tourists’ itineraries provided by Dr. Shoval has been tested.
This data set is about a city with 25 interesting places and has about 40 itineraries with
lengths that range from 10 to 30 items.
Another data set we have considered is the list of sequences of visits at the Microsoft
web page. The data was obtained by sampling and processing the www.microsoft.com
logs. The data set records the use of 38000 anonymous, randomly-selected users. For
each user, the data lists all the areas of the web site (Vroots) that he/she visited in a one
week time-frame. The number of Vroots is 294, and the mean number of Vroots visits
per user is 3. This data is publicly available at the UCI Machine Learning
Repository [2].
These two examples of event sequences have the following common characteristics:

· Events are categorical values that belong to a finite set of linguistic labels (city
locations, web pages).
· The sequences have been uniformly sampled in the sense that time slopes are not
taken into account.
· The sequences are univariate, only one concept is studied.
· The lengths of the sequences of the individuals are not equal.
· Events can be repeated in the sequence (for example, a certain tourist visited the
same place, Main Street, more than one time during his holidays).

To facilitate the analysis of the results, the categorical values indicating places or web
pages have been substituted by simpler identifiers. An example of 14 different event
sequences is given in Table 1. Each character may represent a physical place or a web
page.

Table 1. Example of data sequences

Id Sequence Id Sequence
1 ab 8 cgdabc
2 bc 9 d
3 abc 10 da
4 cab 11 db
5 dabc 12 cbcbcb
6 edabc 13 bcbcbc
7 fedabc 14 ebebeb
A Similarity Measure for Sequences of Categorical Data 139

4 A New Similarity Measure: Ordered-Based Sequence Similarity


The similarity measures for sequences of categorical values presented in section 2
are quite simple and are not adequate to deal with temporal event sequences like the
web logs or the tourists’ itineraries. To compare this type of sequences two issues are
important:

1. The number common elements in the two sequences


2. The order between the common elements

The former allows us to measure if the two individuals have done the same things,
that is, if they have visited the same web pages or have gone to the same places. The
latter takes into account the temporal sequence of the events, that is, if two tourists have
visited the Main Street after going to the City Hall or not. This second measure should
also take into account if two events have taken place consecutively or not.
For example, let T 1 and T 2 be tourists who have visited some places of the same
city: T 1 = {a, b, c} and T 2 = {c, a, b, d}. Notice that, there are 3 common places and
also they have visited a before b. So, we could say that they are quite similar.
In this paper we present a new approach to calculate the similarity that takes into
account these two aspects. It is called Ordering-based Sequence Similarity (OSS) and
consists, on one hand, in finding the common elements in the two sequences, and on
the other hand, in comparing the positions of the elements in both sequences. The ‘el-
ements’ that are the basis of this measure can be either single events or sub-sequences
of events that are considered as an indivisible groups (i.e patterns). In case of working
with patterns, they must have a minimum length of two and a maximum length equal to
the shortest sequence.

Definition 1. Let i and j be two sequences of items of different lengths, i = (xi,1 , . . . ,


xi,card(i) ) and j = (x j,1 , . . . , x j,card( j) ). Let L = {l1 , ..., ln } be a set of n symbols to rep-
resent all the possible elements of those sequences (L is called a language). Then, the
Ordering-based Sequence Similarity (OSS) is defined as:

f (i, j) + g(i, j)
dOSS (i, j) = (9)
card(i) + card( j)
where
g(i, j) = card({xik |xik ∈
/ j}) + card({x jk |x jk ∈
/ i}) (10)
and
∑nk=1 (∑Δp=1 |i(lk ) (p) − j(lk ) (p)|)
f (i, j) = (11)
max{card(i), card( j)}
where i(lk ) = {t|i(t) = lk } and Δ = min(card(i(lk ) ), card( j(lk ) )).

This function has two parts, g is counting the number of non common elements, and f
measures the similarity in the position of the elements in the sequences (the ordering).
The function f is calculated in the symbols space L. So, first, each event in the sequence
i is projected into L, obtaining a numerical vector fore each symbol: i(l1 ) ..i(ln ) . Each of
these new vectors store the positions of the corresponding symbol in the sequence i.
140 C. Gómez-Alonso and A. Valls

The same is done with sequence j, obtaining j(l1 ) .. j(ln ) . Then the projections of the two
sequences i and j into L are compared, and the difference in the positions is calculated
and normalised by the maximum cardinality of the sequences i and j.
If two sequences are equal, the result of dOSS is zero, because the positions are always
equal ( f = 0) and there are not uncommon elements (g = 0). Oppositely, if the two
sequences do not share any element, then g = card(i) + card( j) and f = 0, and dOSS
is equal to 1 when it is divided by card(i) + card( j). The Ordering-based Sequence
Similarity function always gives values between 0 and 1.
The function has the following properties:
· Symmetry: dOSS (i, j) = dOSS ( j, i)
· Positivity: dOSS (i, j) ≥ 0 for all i, j
· Reflexivity: dOSS (i, j) = 0 iff i = j
However, it does not fulfil the triangle inequality: dOSS (i, j) ≤ dOSS (i, k) + dOSS (k, j) for
all i, j, k. From these properties, it is clear that dOSS is a dissimilarity but not a distance.
Proof. The proof of Symmetry, Positivity and Reflexivity is trivial by Definition 1. The
Triangle Inequality does not hold. This is proven in the following counterexample.
Let A, B and C be three sequences defined by A = {b, c}, B = {d, a} and C =
{d, a, b, c}. In this case, dOSS (A, B) = 1.0 because they do not share any item. However,
dOSS (A,C) = 0.5 because they have two elements in common. B and C also have other
two common elements (and they are also in the same position), so dOSS (B,C) = 0.33.
Consequently, dOSS (A,C) + dOSS (B,C) = 0.83 which is less than dOSS (A, B) that is 1.0,
which proofs that the triangle inequality is not fulfilled. 

As it has been pointed out, this measure can be applied to different items: single events
or groups of events. Having a sequence {a, b, a, c, d}, in the first case, i = (a, b, a, c, d),
so xi j is any individual event in the sequence. In the second case, i = (ab, ba, ac, cd), so
xi j is any pair of consecutive items, and i = (aba, bac, acd), for triplets.
The following example illustrate how dOSS is calculated for single events (dOSS−1 )
and for pairs of events (dOSS−2 ). Let us take the two following sequences: A = {a, b, c, a},
B = {c, a, d, b, c, a, c}, with cardA = 4 and cardB = 7.
The similarity considering single items gives dOSS−1 (A, B) = 0.36. This result is
obtained in the following way: symbols a, b and c are common in both sequences
A and B. The projection on the symbol a are: A(a) = {0, 3} and B(a) = {1, 5}, so
fa (A, B) = |0 − 1| + |3 − 5| = 3. For symbol b: A(b) = {1}, B(b) = {3} and fb (A, B) =
|1 − 3| = 2. For c: A(c) = {2}, B(c) = {0, 4, 6} and fc (A, B) = |2 − 0| = 2. So, f (A, B) =
f a (A,B)+ f b (A,B)+ f c (A,B)
7 = 1. Calculating the non common elements, we have g(A, B) = 3.
Finally, dOSS−1 (A, B) = f (A,B)+g(A,B)
4+7 = 1+3
11 = 0.36.
Considering the same case example with patterns of length 2, we have a greater
dissimilarity, dOSS−2 (A, B) = 0.629. In this case, the sequences are A = {ab, bc, ca}
and B = {ca, ad, db, bc, ca, ac} with cardinalities 3 and 6. They share 2 elements, for
the bc pair we have A(bc) = {1}, B(bc) = {3} and fbc (A, B) = |1 − 3| = 2, while for
the pair ca: A(ca) = {2}, B(ca) = {0, 4} and fc a(A, B) = |2 − 0| = 2. So, f (A, B) =
f bc (A,B)+ f ca (A,B)
6 = 0.66. Calculating the non common elements, we have g(A, B) = 5.
f (A,B)+g(A,B) 0.66+5
Finally, dOSS−2 (A, B) = 3+6 = 9 = 0.629.
A Similarity Measure for Sequences of Categorical Data 141

5 Experiments
In this section we present the results obtained with a test set of 14 registers, corre-
sponding to the event sequences presented in Table 1. This dataset contains sequences
of different lengths. Register 9 is an extreme case, with a single event in the sequence.
Sequences in registers 12, 13 and 14 show the case of having repeated events.
We have tested the Ordering-based Sequence Similarity considering each event sep-
arately, OSS-1 (Table 2) and with patterns of two events, OSS-2 (Table 3).
Notice that the OSS applied to pairs of events gives higher dissimilarity values in
many cases (see the number of 1’s in Table 3). This is due to the fact that finding
common pairs of events is much more difficult than finding common single events.
For OSS-2, the sequences in the register id=6 {e, d, a, b, c} and id=7 { f , e, d, a, b, c}
are the most similar ones (0.19), because they share 4 common pairs in very similar
positions (ed, da, ab, bc) and 1 uncommon pair ( f e). The next ones are id=5 {d, a, b, c}
and id=6 that have 3 common pairs and 1 uncommon pair. And in third place we find
sequences id=12 {c, b, c, b, c, b} and id=13 {b, c, b, c, b, c}, that also share 4 common
pairs and 2 uncommon ones.
On the contrary, OSS-1 considers that similarity between registers id=12 and id=13
is higher (0.08) that the one between id=6 and id=7 (0.19). This is because in the first
case there are 6 common symbols (all) and in the second case there are only 5 common
symbols and 1 uncommon. From our point of view, OSS-1 is able to better capture
the degrees of similarity for the sequences of events than the OSS-2, since it is not so
important that they happen together than in similar positions.
After analysing the OSS results, the behaviour of the OSS function has been com-
pared to the Edit Distance, which is a measure that is quite popular for comparing
sequences [4]. The Edit Distance counts the number of changes needed to be applied on

Table 2. Results of the OSS applied to individual events (OSS-1)

Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0.62 0.2 0.33 0.41 0.54 0.62 0.62 1 0.62 0.5 0.75 0.77 0.75
2 0.62 0 0.33 0.4 0.5 0.59 0.66 0.60 1 1 0.62 0.54 0.5 0.77
3 0.2 0.33 0 0.22 0.25 0.4 0.5 0.48 1 0.66 0.6 0.59 0.59 0.77
4 0.33 0.4 0.22 0 0.25 0.4 0.5 0.40 1 0.6 0.66 0.57 0.61 0.79
5 0.41 0.5 0.25 0.25 0 0.19 0.33 0.35 0.6 0.33 0.37 0.66 0.66 0.81
6 0.54 0.59 0.4 0.4 0.19 0 0.16 0.37 0.7 0.48 0.51 0.72 0.72 0.66
7 0.62 0.66 0.5 0.5 0.33 0.16 0 0.40 0.76 0.58 0.60 0.77 0.77 0.72
8 0.62 0.60 0.48 0.40 0.35 0.37 0.40 0 0.76 0.58 0.60 0.58 0.59 0.87
9 1 1 1 1 0.6 0.7 0.76 0.76 0 0.33 0.33 1 1 1
10 0.62 1 0.66 0.6 0.3 0.48 0.58 0.58 0.33 0 0.5 1 1 1
11 0.5 0.62 0.6 0.66 0.37 0.51 0.60 0.60 0.33 0.5 0 0.75 0.77 0.75
12 0.75 0.54 0.59 0.57 0.66 0.72 0.77 0.58 1 1 0.75 0 0.08 0.5
13 0.77 0.5 0.59 0.61 0.66 0.72 0.77 0.59 1 1 0.77 0.08 0 0.54
14 0.75 0.77 0.77 0.79 0.81 0.66 0.72 0.87 1 1 0.75 0.5 0.54 0
142 C. Gómez-Alonso and A. Valls

Table 3. Results of the OSS applied to pairs of events (OSS-2)

Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0.33 0.5 0.58 0.7 0.76 0.76 - 1 1 1 1 1
2 1 0 0.5 1 0.66 0.75 0.8 0.8 - 1 1 0.7 0.66 1
3 0.33 0.5 0 0.62 0.33 0.5 0.59 0.59 - 1 1 0.71 0.74 1
4 0.5 1 0.62 0 0.6 0.70 0.77 0.77 - 1 1 1 1 1
5 0.58 0.66 0.33 0.6 0 0.25 0.4 0.4 - 0.5 1 0.77 0.8 1
6 0.7 0.75 0.5 0.70 0.25 0 0.19 0.39 - 0.65 1 0.82 0.84 1
7 0.76 0.8 0.59 0.77 0.4 0.19 0 0.4 - 0.73 1 0.86 0.88 1
8 0.76 0.8 0.59 0.77 0.4 0.39 0.4 0 - 0.73 1 0.86 0.88 1
9 - - - - - - - - - - - - - -
10 1 1 1 1 0.5 0.65 0.73 0.73 - 0 1 1 1 1
11 1 1 1 1 1 1 1 1 - 1 0 1 1 1
12 1 0.7 0.71 1 0.77 0.82 0.86 0.86 - 1 1 0 0.28 1
13 1 0.66 0.74 1 0.8 0.84 0.88 0.88 - 1 1 0.28 0 1
14 1 1 1 1 1 1 1 1 - 1 1 1 1 0

Table 4. Results according to the Edit Distance (ED)

Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0.33 0.33 0.5 0.6 0.66 0.66 1 1 0.5 0.83 0.83 0.83
2 1 0 0.33 1 0.5 0.6 0.66 0.66 1 1 1 0.66 0.66 0.83
3 0.33 0.33 0 0.66 0.25 0.4 0.5 0.5 1 1 0.66 0.66 0.66 0.83
4 0.33 1 0.66 0 0.5 0.6 0.66 0.5 1 0.66 0.66 0.66 0.66 0.83
5 0.5 0.5 0.25 0.5 0 0.2 0.33 0.33 0.75 0.5 0.5 0.66 0.66 0.83
6 0.6 0.6 0.4 0.6 0.2 0 0.16 0.33 0.8 0.6 0.6 0.66 0.66 0.66
7 0.66 0.66 0.5 0.66 0.33 0.16 0 0.33 0.83 0.66 0.66 0.83 0.66 0.83
8 0.66 0.66 0.5 0.5 0.33 0.33 0.33 0 0.83 0.66 0.66 0.66 0.66 1
9 1 1 1 1 0.75 0.8 0.83 0.83 0 0.5 0.5 1 1 1
10 1 1 1 0.66 0.5 0.6 0.66 0.66 0.5 0 0.5 1 1 1
11 0.5 1 0.66 0.66 0.5 0.6 0.66 0.66 0.5 0.5 0 0.83 0.83 0.83
12 0.83 0.66 0.66 0.66 0.66 0.66 0.83 0.66 1 1 0.83 0 0.33 0.5
13 0.83 0.66 0.66 0.66 0.66 0.66 0.66 0.66 1 1 0.83 0.33 0 0.66
14 0.83 0.83 0.83 0.83 0.83 0.66 0.83 1 1 1 0.83 0.5 0.66 0

one sequence to obtain another one. To scale the values into the unit interval, we have
divided the number of changes by the length of the longest sequence. Table 4 presents
the results of the normalised Edit Distance (ED) on the same data set.
The first significant difference in the results of Table 4 is the pair of registers that
achieves the minimum dissimilarity in each measure. ED considers that the most simi-
lar ones are {e, d, a, b, c} (id=6) and { f , e, d, a, b, c} (id=7), because they are the longest
sequences with a single difference, the introduction of a new symbol. OSS-1 gives
the same similarity value to this pair, 0.16, but it finds another most similar pair of
A Similarity Measure for Sequences of Categorical Data 143

Table 5. Comparison of the results

OSS-1 OSS-2 ED
Register
MIN MAX MIN MAX MIN MAX
Id Seq Id Value Id Id Value Id Id Value Id
1 ab 3 0.2 9 3 0.33 2,10,11,12, 3,4 0.33 2,9,10
13,14
2 bc 3 0.33 9,10 3 0.5 1,4,10,11, 3 0.33 1,4,9,10,11
14
3 abc 1 0.2 9 1,5 0.33 10,11,14 5 0.25 9,10
4 cba 3 0.22 9 1 0.5 2,10,11,12, 1 0.33 2,9
13,14
5 dabc 6 0.19 14 6 0.25 11,14 6 0.2 14
6 edabc 7 0.16 12,13 7 0.19 11,14 7 0.16 9
7 fedabc 6 0.16 12,13 6 0.19 11,14 6 0.16 9,12,14
8 cgdabc 5 0.35 14 6 0.39 11,14 5,6,7 0.33 14
9 d 10,11 0.33 1,2,3,4,12, - - - 10,11 0.5 1,2,3,4,12,
13,14 13,14
10 da 5,9 0.33 2,12,13,14 5 0.5 1,2,3,4,11, 5,9,11 0.5 1,2,3,12,
12,13,14 13,14
11 db 9 0.33 13 all 1.0 all 1,5,9,10 0.5 2
12 cbcbcb 13 0.08 9,10 13 0.28 1,4,10,11, 13 0.33 9,10
14
13 bcbcbc 12 0.08 9,10 12 0.28 1,4,10,11, 12 0.33 9,10
14
14 ebebeb 12 0.5 9,10 all 1.0 all 12 0.5 8,9,10

sequences: {c, b, c, b, c, b} (id=12) and {b, c, b, c, b, c} (id=13), which are the longest
sequences that share exactly the same symbols in very similar positions. In fact, they
have the same sequence {c, b, c, b, c} but adding b before or after it.
If we consider now the first row of the matrices, the one that compares the register
{a, b} (id=1) with the rest, it can be seen that the behaviour of the Edit Distance is
quite different from the one of the OSS-1. ED considers that sequence {a, b} (id=1) is
equally similar to {a, b, c} (id=3) and {c, a, b} (id=4). Whereas, OSS-1 considers that
the former is more similar to register id=1 because both individuals have started the
sequence doing the same event a, followed by b. And the difference is that the individual
id=1 has stopped and the other has continued one step more. However, sequence id=4
has not started doing the same event. This difference is able to be captured if the relative
ordering of the events is considered.
This difficulty to distinguish two sequences that have different items than two se-
quences that have the different items but in different order is the main drawback of the
Edit Distance [13]. An extreme case is the result given with sequences like {a, a, b, b}
and {b, b, a, a}. ED will give a dissimilarity of 4 changes (or 1 if it is normalised). In
144 C. Gómez-Alonso and A. Valls

this case OSS-1 clearly improves ED giving a dissimilarity value of 0.125, which is
more like what common sense would indicate.
To make a deeper analysis of the behaviour of the similarity matrix for clustering
purposes, it is needed to identify which is the closest sequence to a given one. Table 5
shows the identifier of the register/s with minimum dissimilarity to each of the 14 case
studies, for the 3 measures. Also the corresponding dissimilarity value that links those
pairs of sequences is given. Finally, the sequence with maximum dissimilarity is shown.
An analysis of the table suggests that the dissimilarity function OOS-1 is more pre-
cise than ED and OOS-2 to determine the minimum and maximum values. It usually
identifies unique values.

6 Conclusions and Future Work


In this paper we have proposed a new measure of dissimilarity for sequences of cate-
gorical values that considers two main criteria: which are the common and not common
symbols, and which is the difference in the ordering of the common symbols in both
sequences. The rationale for establishing these criteria is that the sequences to be com-
pared contain an ordered list of events (f.i. an itinerary that indicates the places visited
by a tourist), and we are interested in capturing this sequentiality of the items.
The paper shows that, for event sequences, the Ordering-based Sequence Similarity
(OOS) gives better results than the Edit Distance. The results show that OSS is a proper
approach to prioritize both the number of common elements as its order. It can also be
easily seen that OSS also behaves better than the Hamming distance, which compares
the sequences position by position and charge each mismatch 1 unit in the dissimilarity
value without taking into account if the symbol appears in other nearer position. One of
the future analysis to be done is a comparison with the alignment-based approaches.
We are now interested in using the Ordering-based Sequence Similarity for cluster-
ing. Building clusters is interesting in many problems, as it has been mentioned in the
introduction. It can be used to learn the underlying structure of a domain. In this sense,
clustering of event sequences can lead to identify groups of individuals that behave in a
similar way [3]. Other use of clustering methods in which we are particularly interested
in is the field of privacy preserving. Microaggregation is one of the standard tools for
numerical database protection commonly in use in National Statistical Offices. In the
last years, research on the protection of numerical time series has started [10], due to
the increasing number of sequence data available, as argued in the introduction of this
paper. The OSS similarity is a first step towards the extension of microaggregation to
the case of categorical event sequences.

Acknowledgments
The authors want to specially thank the collaboration of Dr. N. Shoval. This work has
been supported by the Spanish research projects E-AEGIS (TSI-2007-65406-C03) and
Consolider-Ingenio 2010 ARES (CSD2007-00004).
A Similarity Measure for Sequences of Categorical Data 145

References
1. Abul, O., Atzori, M., Bonchi, F., Giannotti, F.: Hiding sequences. In: ICDE Workshops, pp.
147–156. IEEE Computer Society, Los Alamitos (2007)
2. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://archive.ics.uci.edu/ml/
3. Dietterich, T.G.: Machine learning for sequential data: A review. In: Caelli, T., Amin, A.,
Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396,
pp. 15–30. Springer, Heidelberg (2002)
4. Dong, G., Pei, J.: Sequence Data Mining. Advances in Database Systems, vol. 33. Springer,
US (2007)
5. Figueira, J., Greco, S., Ehrgott, M.: Multiple Criteria Decision Analysis: State of the Art
Surveys. ISOR & MS, vol. 78. Springer, Heidelberg (2005)
6. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. The Morgan Kauf-
mann Series in Data Management Systems. Morgan Kaufmann Publishers, San Francisco
(2006)
7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Sur-
veys 31(3), 264–323 (1999)
8. Liao, T.W.: Clustering of time series data–a survey. Pattern Recognition 38(11), 1857–1874
(2005)
9. Mount, D.W.: Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Labo-
ratory Press (September 2004)
10. Nin, J., Torra, V.: Extending microaggregation procedures for time series protection. In:
Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowinski, R.
(eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 899–908. Springer, Heidelberg (2006)
11. Notredame, C.: Recent evolutions of multiple sequence alignment algorithms. PLoS Com-
putational Biology 3(8), e123+ (2007)
12. Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Current Opin-
ion in Structural Biology 15(3), 261–266 (2005)
13. Yang, J., Wang, W.: Cluseq: Efficient and effective sequence clustering. In: 19th International
Conference on Data Engineering (ICDE 2003), vol. 00, p. 101 (2003)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy