0% found this document useful (0 votes)

57 views12 pages

Similarity Measure For Sequences of Categorical Data

This document proposes a new similarity measure for comparing sequences of categorical data. It begins with an introduction to analyzing sequential data and a review of existing similarity measures for both numerical and categorical sequences. It then presents a new similarity measure for categorical sequences that is based on comparing the common items and their positions in the two sequences. The document aims to address the need for methods to measure similarity between sequences of categorical data, which are important in many domains but still lack dedicated techniques.

Uploaded by

Neti Suherawati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views12 pages

Similarity Measure For Sequences of Categorical Data

Uploaded by

Neti Suherawati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

A Similarity Measure for Sequences of Categorical Data

Based on the Ordering of Common Elements

Cristina Gómez-Alonso and Aida Valls

iTAKA Research Group - Intelligent Tech. for Advanced Knowledge Acquisition

Department of Computer Science and Mathematics
Universitat Rovira i Virgili
43007 Tarragona, Catalonia, Spain
{cristina.gomez,aida.valls}@urv.cat

Abstract. Similarity measures are usually used to compare items and identify
pairs or groups of similar individuals. The similarity measure strongly depends
on the type of values to compare. We have faced the problem of considering that
the information of the individuals is a sequence of events (i.e. sequences of web
pages visited by a certain user or the personal daily schedule). Some measures
for numerical sequences exist, but very few methods consider sequences of cate-
gorical data. In this paper, we present a new similarity measure for sequences of
categorical labels and compare it with the previous approaches.

1 Introduction
In the last years there is an increasing interest in developing techniques to deal with
sequences of data. Temporal data mining algorithms have been developed to deal with
this type of data [3,6]. Understanding sequence data is becoming very important and
the treatment of those sequences is expected to enable novel classes of applications in
the next years [1]. For example, telecommunication companies store spatio-temporal
data daily, these sequences contain detailed information about the personal or vehicu-
lar behaviour, which can allow to find interesting patterns to be used in many different
applications, such as traffic control. Similarly, people surf the Internet. This is another
great potential source of sequences of users’actions (e.g. web pages visited). The study
of the behaviour on the Internet can also lead to interesting applications, such as in-
trusion detection. There are other domains that also produce temporal sequences [4]:
protein sequences that describe the amino acid composition of proteins and represent
the structure and function of proteins, gene information (DNA) that encode the genetic
makeup, electronic health records that store the clinical history of patients, etc.
However, this type of data requires an adaptation of the classical data mining and de-
cision making techniques applied to static data. Data are called static if all their feature
values do not change with time, or change negligibly. In contrast, sequence data analy-
sis is interested in studying the changes in the values in order to identify interesting
temporal patterns.
In [8] three different approaches to deal with time series are presented: (1) to work
directly with raw data, (2) to convert a raw series data into a feature vector of lower
dimension and (3) to represent the sequence with a certain number of model parameters.

V. Torra and Y. Narukawa (Eds.): MDAI 2008, LNAI 5285, pp. 134–145, 2008.
c Springer-Verlag Berlin Heidelberg 2008
A Similarity Measure for Sequences of Categorical Data 135

The feature-based and model-based approaches permit to apply conventional algorithms

since there is no need to deal with the sequential data. However, sometimes it is not
possible to build those feature vectors or models. In this work we are interested in the
first approach, which requires to adapt the classical techniques in order to be able to
deal with the particularities of sequential data.
In this paper we consider the problem of measuring the similarity of two time se-
quences of items (i.e. events). Comparing elements is a basic key point in many meth-
ods for analysing data, such as clustering techniques (which build clusters of similar
objects), classification of objects into existing clusters, characterisation of prototypes,
recommender systems or decision making methods (such as those based on dominance
rough sets that consider dominance, indiscernibility and similarity relations [5]).
In [8] a survey of similarity/distance measures for sequential data is given. Nine
measures are defined and most of them can only be applied to numerical values. In the
examples of temporal sequences presented before, the items of the sequence are not
numbers but categorical values (places, web pages, proteins, etc.). Although sequences
of categorical values are very important nowadays, there still exist few attempts to work
with them due to the inherent complexity of dealing with non-numerical values.
In this paper we present a similarity measure between two categorical sequences that
is based on the comparison of the common items in the two sequences and the positions
where they appear.
First in section 2, a review of other approaches to similarity measurement in time
series is introduced. Section 3 presents different features that must be taken into ac-
count for working with sequences and then describes the type of sequences that we
have considered. In section 4 a new similarity function is defined. Section 5 shows a
case study where different similarity measures for sequences are compared. Finally,
section 6 gives the conclusions and outlines the future research lines.

2 Review of Dissimilarity Measures

A dissimilarity function d on two objects i and j must satisfy the following conditions:
1. Symmetry: d(i, j) = d( j, i)
2. Positivity: d(i, j) ≥ 0 for all i, j
If conditions:
3. Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j) for all i, j, k; and
4. Reflexivity: d(i, j) = 0 iff i = j
also hold, it is called metric or distance function.
Moreover, d is a normalized distance function if 0 ≤ d(i, j) ≤ 1 for all objects i
and j.
Dissimilarity functions can be classified according to the type of value they can deal
with into: numerical, categorical or mixed functions. In this section, some of the clas-
sical distance functions for static data are presented. After this, the existing distance
measures for sequential data are reviewed. The cases of numerical and categorical
information are presented separately.
136 C. Gómez-Alonso and A. Valls

2.1 Dissimilarity Functions for Numerical Variables

In this section, we present the most commonly used dissimilarity functions for numeri-
cal variables. Let’s take two objects i and j represented by the corresponding vectors of
values i = (xi1 , . . . , xiK ) and j = (x j1 , . . . , x jK ).

Euclidean Distance. It is the sum of the squares of the differences of the values.

∑k=1 (xik − x jk )2
2 K
d2 (i, j) = (1)

City-Block or Manhattan Distance. It is the sum of the absolute differences for all
the attributes of the two objects.

d1 (i, j) = ∑k=1 |xik − x jk |

K
(2)

Minkowski Distance. It is a generic distance which is defined as the q − th root of the

sum of powers q − th of absolute differences of the values of the two objects. Note
that the Euclidean distance and Manhattan distance are particular cases for q = 2
and q = 1, respectively.
1
∑k=1 |xik − x jk |q
K q
dq (i, j) = (3)

With respect to sequences of numerical values, the most common similarity measures
are the following ones (in [8] these and other approaches are presented):

Short Time Series Distance. It is the sum of the squared differences of the slopes in
two time series being compared.
2

x j(k+1) − x j(k) xi(k+1) − xi(k)
(i, j) = ∑
2 K
dST S − (4)
k=1 t(k+1) − t(k) t(k+1) − t(k)

where tk is the time point for data values xik and x jk

Dynamic Time Warping Distance. It consists in the alignment of two series Q =

(q1 , q2 , ..., qn ) and R = (r1 , r2 , ..., rm ) in order to minimize their difference. To this
end, an n · m matrix is built, where the (i, j) element of the matrix contains the dis-
tance d(qi , r j ) (generally Euclidean distance). A warping path W = w1 , w2 , ..., wK
is calculated, where max(m, n) ≤ K ≤ m + n − 1 . Then, the minimum distance
between the two series is calculated as:

K
∑k=1 wk
dDTW (i, j) = min (5)
K
A Similarity Measure for Sequences of Categorical Data 137

2.2 Dissimilarity Functions for Categorical Variables

Now, the case of categorical variables for static and sequential data is outlined. For the
static case, the following two distances are well-known:

Chi-Squared χ 2 Distance. It is based on the number of objects in the dataset that
have the same value than object i for the k-th variable, Iki .

χ 2 (i, j) = ∑k=1 dk (i, j)

K
(6)

where dk (i, j) is 0 when xik = x jk and I1ki + Ik1j otherwise.

Hamming Distance. It is the number of positions where the two objects are different.
It is limited to cases when they have identical lengths.

dH (i, j) = ∑k=1 dk (i, j)

K
(7)

where dk is 0 when xik = x jk and 1 if xik = x jk .

In the case of sequences of categorical values, there are three approaches: Hamming
distance (an extension of eq. 7), String metrics and Alignment-based distances.
With respect to String metrics, we have:

Edit or Levenshtein Distance. It calculates the minimum number of edit operations to

transform S1 into S2 , where an edit operation is an insertion, deletion or substitution
of a single character.

Damerau-Levenshtein Distance. A modification of Levenshtein distance adding the

transposition operation, which is a function that swaps two elements of a sequence.

Kullback-Liebler Divergence. It measures the difference between two probability dis-

tributions.

Pi (x|X)
dKL (i, j) = ∑k=1 (Pi (x|X) − Pj (x|X))log
K
(8)
Pj (x|X)
where Pi denote the conditional probability distribution for Si
Otherwise, sequence alignment comes from ADN, RNA or protein sequences stud-
ies. The main characteristic of all these cases is that elements of the sequences are
characters. In this case, methods are based on the Dynamic Time Warping Distance
(see eq. 5). A more detailed analysis can be found at [9,11,12].

3 Description of the Data Sequences to Compare

As it usually happens in many Artificial Intelligence techniques, the nature of the val-
ues in the data set determines the characteristics of the method that can be applied. The
usual main classification distinguishes: numerical values versus categorical values. Nu-
merical scores can be continuous, discrete or intervals, and can represent quantitative
measurements, ratios or ordinal scales. Categorical values represent qualitative features
138 C. Gómez-Alonso and A. Valls

with ordered or unordered scales [7]. But, in fact, there are other data representations
that can be considered as textual, spatial, image or multimedia.
As fas as temporal series are concerned, other distinctions must be done as to whether
the data is uniformly or non-uniformly sampled, univariate or multivariate, and whether
the sequences are of equal or unequal length [8].
In this work we want to deal with sequences of events that represent the behaviour of
the user in a particular context. For example tourists visiting a city, where the sequences
show the itinerary that each person has followed to visit the interesting locations in this
city. A private real data set of tourists’ itineraries provided by Dr. Shoval has been tested.
This data set is about a city with 25 interesting places and has about 40 itineraries with
lengths that range from 10 to 30 items.
Another data set we have considered is the list of sequences of visits at the Microsoft
web page. The data was obtained by sampling and processing the www.microsoft.com
logs. The data set records the use of 38000 anonymous, randomly-selected users. For
each user, the data lists all the areas of the web site (Vroots) that he/she visited in a one
week time-frame. The number of Vroots is 294, and the mean number of Vroots visits
per user is 3. This data is publicly available at the UCI Machine Learning
Repository [2].
These two examples of event sequences have the following common characteristics:

· Events are categorical values that belong to a finite set of linguistic labels (city
locations, web pages).
· The sequences have been uniformly sampled in the sense that time slopes are not
taken into account.
· The sequences are univariate, only one concept is studied.
· The lengths of the sequences of the individuals are not equal.
· Events can be repeated in the sequence (for example, a certain tourist visited the
same place, Main Street, more than one time during his holidays).

To facilitate the analysis of the results, the categorical values indicating places or web
pages have been substituted by simpler identifiers. An example of 14 different event
sequences is given in Table 1. Each character may represent a physical place or a web
page.

Table 1. Example of data sequences

Id Sequence Id Sequence
1 ab 8 cgdabc
2 bc 9 d
3 abc 10 da
4 cab 11 db
5 dabc 12 cbcbcb
6 edabc 13 bcbcbc
7 fedabc 14 ebebeb
A Similarity Measure for Sequences of Categorical Data 139

4 A New Similarity Measure: Ordered-Based Sequence Similarity

The similarity measures for sequences of categorical values presented in section 2
are quite simple and are not adequate to deal with temporal event sequences like the
web logs or the tourists’ itineraries. To compare this type of sequences two issues are
important:

1. The number common elements in the two sequences

2. The order between the common elements

The former allows us to measure if the two individuals have done the same things,
that is, if they have visited the same web pages or have gone to the same places. The
latter takes into account the temporal sequence of the events, that is, if two tourists have
visited the Main Street after going to the City Hall or not. This second measure should
also take into account if two events have taken place consecutively or not.
For example, let T 1 and T 2 be tourists who have visited some places of the same
city: T 1 = {a, b, c} and T 2 = {c, a, b, d}. Notice that, there are 3 common places and
also they have visited a before b. So, we could say that they are quite similar.
In this paper we present a new approach to calculate the similarity that takes into
account these two aspects. It is called Ordering-based Sequence Similarity (OSS) and
consists, on one hand, in finding the common elements in the two sequences, and on
the other hand, in comparing the positions of the elements in both sequences. The ‘el-
ements’ that are the basis of this measure can be either single events or sub-sequences
of events that are considered as an indivisible groups (i.e patterns). In case of working
with patterns, they must have a minimum length of two and a maximum length equal to
the shortest sequence.

Definition 1. Let i and j be two sequences of items of different lengths, i = (xi,1 , . . . ,

xi,card(i) ) and j = (x j,1 , . . . , x j,card( j) ). Let L = {l1 , ..., ln } be a set of n symbols to rep-
resent all the possible elements of those sequences (L is called a language). Then, the
Ordering-based Sequence Similarity (OSS) is defined as:

f (i, j) + g(i, j)
dOSS (i, j) = (9)
card(i) + card( j)
where
g(i, j) = card({xik |xik ∈
/ j}) + card({x jk |x jk ∈
/ i}) (10)
and
∑nk=1 (∑Δp=1 |i(lk ) (p) − j(lk ) (p)|)
f (i, j) = (11)
max{card(i), card( j)}
where i(lk ) = {t|i(t) = lk } and Δ = min(card(i(lk ) ), card( j(lk ) )).

This function has two parts, g is counting the number of non common elements, and f
measures the similarity in the position of the elements in the sequences (the ordering).
The function f is calculated in the symbols space L. So, first, each event in the sequence
i is projected into L, obtaining a numerical vector fore each symbol: i(l1 ) ..i(ln ) . Each of
these new vectors store the positions of the corresponding symbol in the sequence i.
140 C. Gómez-Alonso and A. Valls

The same is done with sequence j, obtaining j(l1 ) .. j(ln ) . Then the projections of the two
sequences i and j into L are compared, and the difference in the positions is calculated
and normalised by the maximum cardinality of the sequences i and j.
If two sequences are equal, the result of dOSS is zero, because the positions are always
equal ( f = 0) and there are not uncommon elements (g = 0). Oppositely, if the two
sequences do not share any element, then g = card(i) + card( j) and f = 0, and dOSS
is equal to 1 when it is divided by card(i) + card( j). The Ordering-based Sequence
Similarity function always gives values between 0 and 1.
The function has the following properties:
· Symmetry: dOSS (i, j) = dOSS ( j, i)
· Positivity: dOSS (i, j) ≥ 0 for all i, j
· Reflexivity: dOSS (i, j) = 0 iff i = j
However, it does not fulfil the triangle inequality: dOSS (i, j) ≤ dOSS (i, k) + dOSS (k, j) for
all i, j, k. From these properties, it is clear that dOSS is a dissimilarity but not a distance.
Proof. The proof of Symmetry, Positivity and Reflexivity is trivial by Definition 1. The
Triangle Inequality does not hold. This is proven in the following counterexample.
Let A, B and C be three sequences defined by A = {b, c}, B = {d, a} and C =
{d, a, b, c}. In this case, dOSS (A, B) = 1.0 because they do not share any item. However,
dOSS (A,C) = 0.5 because they have two elements in common. B and C also have other
two common elements (and they are also in the same position), so dOSS (B,C) = 0.33.
Consequently, dOSS (A,C) + dOSS (B,C) = 0.83 which is less than dOSS (A, B) that is 1.0,
which proofs that the triangle inequality is not fulfilled.

As it has been pointed out, this measure can be applied to different items: single events
or groups of events. Having a sequence {a, b, a, c, d}, in the first case, i = (a, b, a, c, d),
so xi j is any individual event in the sequence. In the second case, i = (ab, ba, ac, cd), so
xi j is any pair of consecutive items, and i = (aba, bac, acd), for triplets.
The following example illustrate how dOSS is calculated for single events (dOSS−1 )
and for pairs of events (dOSS−2 ). Let us take the two following sequences: A = {a, b, c, a},
B = {c, a, d, b, c, a, c}, with cardA = 4 and cardB = 7.
The similarity considering single items gives dOSS−1 (A, B) = 0.36. This result is
obtained in the following way: symbols a, b and c are common in both sequences
A and B. The projection on the symbol a are: A(a) = {0, 3} and B(a) = {1, 5}, so
fa (A, B) = |0 − 1| + |3 − 5| = 3. For symbol b: A(b) = {1}, B(b) = {3} and fb (A, B) =
|1 − 3| = 2. For c: A(c) = {2}, B(c) = {0, 4, 6} and fc (A, B) = |2 − 0| = 2. So, f (A, B) =
f a (A,B)+ f b (A,B)+ f c (A,B)
7 = 1. Calculating the non common elements, we have g(A, B) = 3.
Finally, dOSS−1 (A, B) = f (A,B)+g(A,B)
4+7 = 1+3
11 = 0.36.
Considering the same case example with patterns of length 2, we have a greater
dissimilarity, dOSS−2 (A, B) = 0.629. In this case, the sequences are A = {ab, bc, ca}
and B = {ca, ad, db, bc, ca, ac} with cardinalities 3 and 6. They share 2 elements, for
the bc pair we have A(bc) = {1}, B(bc) = {3} and fbc (A, B) = |1 − 3| = 2, while for
the pair ca: A(ca) = {2}, B(ca) = {0, 4} and fc a(A, B) = |2 − 0| = 2. So, f (A, B) =
f bc (A,B)+ f ca (A,B)
6 = 0.66. Calculating the non common elements, we have g(A, B) = 5.
f (A,B)+g(A,B) 0.66+5
Finally, dOSS−2 (A, B) = 3+6 = 9 = 0.629.
A Similarity Measure for Sequences of Categorical Data 141

5 Experiments
In this section we present the results obtained with a test set of 14 registers, corre-
sponding to the event sequences presented in Table 1. This dataset contains sequences
of different lengths. Register 9 is an extreme case, with a single event in the sequence.
Sequences in registers 12, 13 and 14 show the case of having repeated events.
We have tested the Ordering-based Sequence Similarity considering each event sep-
arately, OSS-1 (Table 2) and with patterns of two events, OSS-2 (Table 3).
Notice that the OSS applied to pairs of events gives higher dissimilarity values in
many cases (see the number of 1’s in Table 3). This is due to the fact that finding
common pairs of events is much more difficult than finding common single events.
For OSS-2, the sequences in the register id=6 {e, d, a, b, c} and id=7 { f , e, d, a, b, c}
are the most similar ones (0.19), because they share 4 common pairs in very similar
positions (ed, da, ab, bc) and 1 uncommon pair ( f e). The next ones are id=5 {d, a, b, c}
and id=6 that have 3 common pairs and 1 uncommon pair. And in third place we find
sequences id=12 {c, b, c, b, c, b} and id=13 {b, c, b, c, b, c}, that also share 4 common
pairs and 2 uncommon ones.
On the contrary, OSS-1 considers that similarity between registers id=12 and id=13
is higher (0.08) that the one between id=6 and id=7 (0.19). This is because in the first
case there are 6 common symbols (all) and in the second case there are only 5 common
symbols and 1 uncommon. From our point of view, OSS-1 is able to better capture
the degrees of similarity for the sequences of events than the OSS-2, since it is not so
important that they happen together than in similar positions.
After analysing the OSS results, the behaviour of the OSS function has been com-
pared to the Edit Distance, which is a measure that is quite popular for comparing
sequences [4]. The Edit Distance counts the number of changes needed to be applied on

Table 2. Results of the OSS applied to individual events (OSS-1)

Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0.62 0.2 0.33 0.41 0.54 0.62 0.62 1 0.62 0.5 0.75 0.77 0.75
2 0.62 0 0.33 0.4 0.5 0.59 0.66 0.60 1 1 0.62 0.54 0.5 0.77
3 0.2 0.33 0 0.22 0.25 0.4 0.5 0.48 1 0.66 0.6 0.59 0.59 0.77
4 0.33 0.4 0.22 0 0.25 0.4 0.5 0.40 1 0.6 0.66 0.57 0.61 0.79
5 0.41 0.5 0.25 0.25 0 0.19 0.33 0.35 0.6 0.33 0.37 0.66 0.66 0.81
6 0.54 0.59 0.4 0.4 0.19 0 0.16 0.37 0.7 0.48 0.51 0.72 0.72 0.66
7 0.62 0.66 0.5 0.5 0.33 0.16 0 0.40 0.76 0.58 0.60 0.77 0.77 0.72
8 0.62 0.60 0.48 0.40 0.35 0.37 0.40 0 0.76 0.58 0.60 0.58 0.59 0.87
9 1 1 1 1 0.6 0.7 0.76 0.76 0 0.33 0.33 1 1 1
10 0.62 1 0.66 0.6 0.3 0.48 0.58 0.58 0.33 0 0.5 1 1 1
11 0.5 0.62 0.6 0.66 0.37 0.51 0.60 0.60 0.33 0.5 0 0.75 0.77 0.75
12 0.75 0.54 0.59 0.57 0.66 0.72 0.77 0.58 1 1 0.75 0 0.08 0.5
13 0.77 0.5 0.59 0.61 0.66 0.72 0.77 0.59 1 1 0.77 0.08 0 0.54
14 0.75 0.77 0.77 0.79 0.81 0.66 0.72 0.87 1 1 0.75 0.5 0.54 0
142 C. Gómez-Alonso and A. Valls

Table 3. Results of the OSS applied to pairs of events (OSS-2)

Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0.33 0.5 0.58 0.7 0.76 0.76 - 1 1 1 1 1
2 1 0 0.5 1 0.66 0.75 0.8 0.8 - 1 1 0.7 0.66 1
3 0.33 0.5 0 0.62 0.33 0.5 0.59 0.59 - 1 1 0.71 0.74 1
4 0.5 1 0.62 0 0.6 0.70 0.77 0.77 - 1 1 1 1 1
5 0.58 0.66 0.33 0.6 0 0.25 0.4 0.4 - 0.5 1 0.77 0.8 1
6 0.7 0.75 0.5 0.70 0.25 0 0.19 0.39 - 0.65 1 0.82 0.84 1
7 0.76 0.8 0.59 0.77 0.4 0.19 0 0.4 - 0.73 1 0.86 0.88 1
8 0.76 0.8 0.59 0.77 0.4 0.39 0.4 0 - 0.73 1 0.86 0.88 1
9 - - - - - - - - - - - - - -
10 1 1 1 1 0.5 0.65 0.73 0.73 - 0 1 1 1 1
11 1 1 1 1 1 1 1 1 - 1 0 1 1 1
12 1 0.7 0.71 1 0.77 0.82 0.86 0.86 - 1 1 0 0.28 1
13 1 0.66 0.74 1 0.8 0.84 0.88 0.88 - 1 1 0.28 0 1
14 1 1 1 1 1 1 1 1 - 1 1 1 1 0

Table 4. Results according to the Edit Distance (ED)

Id.Reg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0.33 0.33 0.5 0.6 0.66 0.66 1 1 0.5 0.83 0.83 0.83
2 1 0 0.33 1 0.5 0.6 0.66 0.66 1 1 1 0.66 0.66 0.83
3 0.33 0.33 0 0.66 0.25 0.4 0.5 0.5 1 1 0.66 0.66 0.66 0.83
4 0.33 1 0.66 0 0.5 0.6 0.66 0.5 1 0.66 0.66 0.66 0.66 0.83
5 0.5 0.5 0.25 0.5 0 0.2 0.33 0.33 0.75 0.5 0.5 0.66 0.66 0.83
6 0.6 0.6 0.4 0.6 0.2 0 0.16 0.33 0.8 0.6 0.6 0.66 0.66 0.66
7 0.66 0.66 0.5 0.66 0.33 0.16 0 0.33 0.83 0.66 0.66 0.83 0.66 0.83
8 0.66 0.66 0.5 0.5 0.33 0.33 0.33 0 0.83 0.66 0.66 0.66 0.66 1
9 1 1 1 1 0.75 0.8 0.83 0.83 0 0.5 0.5 1 1 1
10 1 1 1 0.66 0.5 0.6 0.66 0.66 0.5 0 0.5 1 1 1
11 0.5 1 0.66 0.66 0.5 0.6 0.66 0.66 0.5 0.5 0 0.83 0.83 0.83
12 0.83 0.66 0.66 0.66 0.66 0.66 0.83 0.66 1 1 0.83 0 0.33 0.5
13 0.83 0.66 0.66 0.66 0.66 0.66 0.66 0.66 1 1 0.83 0.33 0 0.66
14 0.83 0.83 0.83 0.83 0.83 0.66 0.83 1 1 1 0.83 0.5 0.66 0

one sequence to obtain another one. To scale the values into the unit interval, we have
divided the number of changes by the length of the longest sequence. Table 4 presents
the results of the normalised Edit Distance (ED) on the same data set.
The first significant difference in the results of Table 4 is the pair of registers that
achieves the minimum dissimilarity in each measure. ED considers that the most simi-
lar ones are {e, d, a, b, c} (id=6) and { f , e, d, a, b, c} (id=7), because they are the longest
sequences with a single difference, the introduction of a new symbol. OSS-1 gives
the same similarity value to this pair, 0.16, but it finds another most similar pair of
A Similarity Measure for Sequences of Categorical Data 143

Table 5. Comparison of the results

OSS-1 OSS-2 ED
Register
MIN MAX MIN MAX MIN MAX
Id Seq Id Value Id Id Value Id Id Value Id
1 ab 3 0.2 9 3 0.33 2,10,11,12, 3,4 0.33 2,9,10
13,14
2 bc 3 0.33 9,10 3 0.5 1,4,10,11, 3 0.33 1,4,9,10,11
14
3 abc 1 0.2 9 1,5 0.33 10,11,14 5 0.25 9,10
4 cba 3 0.22 9 1 0.5 2,10,11,12, 1 0.33 2,9
13,14
5 dabc 6 0.19 14 6 0.25 11,14 6 0.2 14
6 edabc 7 0.16 12,13 7 0.19 11,14 7 0.16 9
7 fedabc 6 0.16 12,13 6 0.19 11,14 6 0.16 9,12,14
8 cgdabc 5 0.35 14 6 0.39 11,14 5,6,7 0.33 14
9 d 10,11 0.33 1,2,3,4,12, - - - 10,11 0.5 1,2,3,4,12,
13,14 13,14
10 da 5,9 0.33 2,12,13,14 5 0.5 1,2,3,4,11, 5,9,11 0.5 1,2,3,12,
12,13,14 13,14
11 db 9 0.33 13 all 1.0 all 1,5,9,10 0.5 2
12 cbcbcb 13 0.08 9,10 13 0.28 1,4,10,11, 13 0.33 9,10
14
13 bcbcbc 12 0.08 9,10 12 0.28 1,4,10,11, 12 0.33 9,10
14
14 ebebeb 12 0.5 9,10 all 1.0 all 12 0.5 8,9,10

sequences: {c, b, c, b, c, b} (id=12) and {b, c, b, c, b, c} (id=13), which are the longest
sequences that share exactly the same symbols in very similar positions. In fact, they
have the same sequence {c, b, c, b, c} but adding b before or after it.
If we consider now the first row of the matrices, the one that compares the register
{a, b} (id=1) with the rest, it can be seen that the behaviour of the Edit Distance is
quite different from the one of the OSS-1. ED considers that sequence {a, b} (id=1) is
equally similar to {a, b, c} (id=3) and {c, a, b} (id=4). Whereas, OSS-1 considers that
the former is more similar to register id=1 because both individuals have started the
sequence doing the same event a, followed by b. And the difference is that the individual
id=1 has stopped and the other has continued one step more. However, sequence id=4
has not started doing the same event. This difference is able to be captured if the relative
ordering of the events is considered.
This difficulty to distinguish two sequences that have different items than two se-
quences that have the different items but in different order is the main drawback of the
Edit Distance [13]. An extreme case is the result given with sequences like {a, a, b, b}
and {b, b, a, a}. ED will give a dissimilarity of 4 changes (or 1 if it is normalised). In
144 C. Gómez-Alonso and A. Valls

this case OSS-1 clearly improves ED giving a dissimilarity value of 0.125, which is
more like what common sense would indicate.
To make a deeper analysis of the behaviour of the similarity matrix for clustering
purposes, it is needed to identify which is the closest sequence to a given one. Table 5
shows the identifier of the register/s with minimum dissimilarity to each of the 14 case
studies, for the 3 measures. Also the corresponding dissimilarity value that links those
pairs of sequences is given. Finally, the sequence with maximum dissimilarity is shown.
An analysis of the table suggests that the dissimilarity function OOS-1 is more pre-
cise than ED and OOS-2 to determine the minimum and maximum values. It usually
identifies unique values.

6 Conclusions and Future Work

In this paper we have proposed a new measure of dissimilarity for sequences of cate-
gorical values that considers two main criteria: which are the common and not common
symbols, and which is the difference in the ordering of the common symbols in both
sequences. The rationale for establishing these criteria is that the sequences to be com-
pared contain an ordered list of events (f.i. an itinerary that indicates the places visited
by a tourist), and we are interested in capturing this sequentiality of the items.
The paper shows that, for event sequences, the Ordering-based Sequence Similarity
(OOS) gives better results than the Edit Distance. The results show that OSS is a proper
approach to prioritize both the number of common elements as its order. It can also be
easily seen that OSS also behaves better than the Hamming distance, which compares
the sequences position by position and charge each mismatch 1 unit in the dissimilarity
value without taking into account if the symbol appears in other nearer position. One of
the future analysis to be done is a comparison with the alignment-based approaches.
We are now interested in using the Ordering-based Sequence Similarity for cluster-
ing. Building clusters is interesting in many problems, as it has been mentioned in the
introduction. It can be used to learn the underlying structure of a domain. In this sense,
clustering of event sequences can lead to identify groups of individuals that behave in a
similar way [3]. Other use of clustering methods in which we are particularly interested
in is the field of privacy preserving. Microaggregation is one of the standard tools for
numerical database protection commonly in use in National Statistical Offices. In the
last years, research on the protection of numerical time series has started [10], due to
the increasing number of sequence data available, as argued in the introduction of this
paper. The OSS similarity is a first step towards the extension of microaggregation to
the case of categorical event sequences.

Acknowledgments
The authors want to specially thank the collaboration of Dr. N. Shoval. This work has
been supported by the Spanish research projects E-AEGIS (TSI-2007-65406-C03) and
Consolider-Ingenio 2010 ARES (CSD2007-00004).
A Similarity Measure for Sequences of Categorical Data 145

References
1. Abul, O., Atzori, M., Bonchi, F., Giannotti, F.: Hiding sequences. In: ICDE Workshops, pp.
147–156. IEEE Computer Society, Los Alamitos (2007)
2. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://archive.ics.uci.edu/ml/
3. Dietterich, T.G.: Machine learning for sequential data: A review. In: Caelli, T., Amin, A.,
Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396,
pp. 15–30. Springer, Heidelberg (2002)
4. Dong, G., Pei, J.: Sequence Data Mining. Advances in Database Systems, vol. 33. Springer,
US (2007)
5. Figueira, J., Greco, S., Ehrgott, M.: Multiple Criteria Decision Analysis: State of the Art
Surveys. ISOR & MS, vol. 78. Springer, Heidelberg (2005)
6. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. The Morgan Kauf-
mann Series in Data Management Systems. Morgan Kaufmann Publishers, San Francisco
(2006)
7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Sur-
veys 31(3), 264–323 (1999)
8. Liao, T.W.: Clustering of time series data–a survey. Pattern Recognition 38(11), 1857–1874
(2005)
9. Mount, D.W.: Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Labo-
ratory Press (September 2004)
10. Nin, J., Torra, V.: Extending microaggregation procedures for time series protection. In:
Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowinski, R.
(eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 899–908. Springer, Heidelberg (2006)
11. Notredame, C.: Recent evolutions of multiple sequence alignment algorithms. PLoS Com-
putational Biology 3(8), e123+ (2007)
12. Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Current Opin-
ion in Structural Biology 15(3), 261–266 (2005)
13. Yang, J., Wang, W.: Cluseq: Efficient and effective sequence clustering. In: 19th International
Conference on Data Engineering (ICDE 2003), vol. 00, p. 101 (2003)

How To Transmit SAP Purchase Order To Vendor Via E-Mail
100% (14)
How To Transmit SAP Purchase Order To Vendor Via E-Mail
16 pages
T25. Forecasting Big Time Series - Theory and Practice
No ratings yet
T25. Forecasting Big Time Series - Theory and Practice
166 pages
Unit - 3 Image Proc
No ratings yet
Unit - 3 Image Proc
71 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
Similarity and Dissimilarity Measures
No ratings yet
Similarity and Dissimilarity Measures
61 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Correlation Based Dynamic Time Warping of Multivariate Time Series
No ratings yet
Correlation Based Dynamic Time Warping of Multivariate Time Series
28 pages
TSIndexing
No ratings yet
TSIndexing
64 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Similarity Matching in CEP Systems
100% (1)
Similarity Matching in CEP Systems
15 pages
Clustering Time Series Online in A Transformed Space
No ratings yet
Clustering Time Series Online in A Transformed Space
7 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Fast Similarity Search in The Presence of Noise, Scaling, and Translation in Time-Series Databases
No ratings yet
Fast Similarity Search in The Presence of Noise, Scaling, and Translation in Time-Series Databases
12 pages
A Comparison Study On Similarity and Dissimilarity Measures in Clustering Continuous Data
No ratings yet
A Comparison Study On Similarity and Dissimilarity Measures in Clustering Continuous Data
20 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Mining The Online Encyclopedia of Integer Sequences - Nguyen, Taggart
No ratings yet
Mining The Online Encyclopedia of Integer Sequences - Nguyen, Taggart
14 pages
Extensions and Relationships of Some Existing Lower-Bound Functions For Dynamic Time Warping
No ratings yet
Extensions and Relationships of Some Existing Lower-Bound Functions For Dynamic Time Warping
24 pages
Composite Distance Measure For
No ratings yet
Composite Distance Measure For
25 pages
Atomic Wedgie
No ratings yet
Atomic Wedgie
8 pages
2010 23 Jumping
No ratings yet
2010 23 Jumping
6 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
A Review On Distance Based Time Series Classification
No ratings yet
A Review On Distance Based Time Series Classification
28 pages
DSB - Unit3
No ratings yet
DSB - Unit3
87 pages
HW1
0% (1)
HW1
2 pages
Chandola Kumar
No ratings yet
Chandola Kumar
13 pages
Electrical Engineering and Computer Science Department
No ratings yet
Electrical Engineering and Computer Science Department
27 pages
Similarity Measures For Categorical Data
No ratings yet
Similarity Measures For Categorical Data
12 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
2014 An Empirical Evaluation of Similarity Measures For Tim 2014 Knowledge Based
No ratings yet
2014 An Empirical Evaluation of Similarity Measures For Tim 2014 Knowledge Based
10 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
No ratings yet
Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
23 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Efficient Similarity Search On Vector Sets
No ratings yet
Efficient Similarity Search On Vector Sets
19 pages
Evaluation of Time Series
No ratings yet
Evaluation of Time Series
12 pages
SM2524
No ratings yet
SM2524
17 pages
CLuster Time Series
No ratings yet
CLuster Time Series
8 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Alam Uri 2014
No ratings yet
Alam Uri 2014
8 pages
A Comparative Study On Distance Measuring Approach
No ratings yet
A Comparative Study On Distance Measuring Approach
3 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Clustering
0% (1)
Clustering
127 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Assumption-Free Anomaly Detection in Time Series: Figure 1. A Snapshot of The Anomaly Detection Tool
No ratings yet
Assumption-Free Anomaly Detection in Time Series: Figure 1. A Snapshot of The Anomaly Detection Tool
4 pages
Novel Lost
No ratings yet
Novel Lost
335 pages
JS-A.L.U01 (Types and Coercion)
No ratings yet
JS-A.L.U01 (Types and Coercion)
92 pages
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
No ratings yet
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
6 pages
Temporal Data Mining: Time Series Analysis and Time-Lag Detection
No ratings yet
Temporal Data Mining: Time Series Analysis and Time-Lag Detection
11 pages
PHYHOME - FTTH PON Series
No ratings yet
PHYHOME - FTTH PON Series
37 pages
Computer SSC CGL 2022 Tier II Paper I - RBE - Compressed
No ratings yet
Computer SSC CGL 2022 Tier II Paper I - RBE - Compressed
17 pages
Auditing Notes by Rehan Farhat ISA 300
No ratings yet
Auditing Notes by Rehan Farhat ISA 300
21 pages
Visibility Sensor - User Guide
100% (1)
Visibility Sensor - User Guide
98 pages
US Address Generator - Fake Address, Random Address Generator 2 PDF
No ratings yet
US Address Generator - Fake Address, Random Address Generator 2 PDF
1 page
RFC To Webservices Sap Technical
No ratings yet
RFC To Webservices Sap Technical
12 pages
PC 22: Internal Control Evaluation Manual SL No. Points Key To Point
No ratings yet
PC 22: Internal Control Evaluation Manual SL No. Points Key To Point
9 pages
Computer Assembly and Repair Lab Manual1
No ratings yet
Computer Assembly and Repair Lab Manual1
31 pages
DL Texturing OBA DTY EFK en
100% (1)
DL Texturing OBA DTY EFK en
28 pages
Separation of The Common-Mode - and Differential-Mode-Conducted EM1 Noise
No ratings yet
Separation of The Common-Mode - and Differential-Mode-Conducted EM1 Noise
9 pages
SIMATIC IT Historian Pres
No ratings yet
SIMATIC IT Historian Pres
59 pages
C Programming Class 12 Functions
No ratings yet
C Programming Class 12 Functions
36 pages
Lecture 11
No ratings yet
Lecture 11
29 pages
JCM Tabela de Preços
No ratings yet
JCM Tabela de Preços
5 pages
Get PC: Red Giant Pluraleyes
No ratings yet
Get PC: Red Giant Pluraleyes
6 pages
Module 4 Learning Plan 1
No ratings yet
Module 4 Learning Plan 1
11 pages
Wheatstone Bridge's Sensitivity, Resistors' Values Effect PDF
No ratings yet
Wheatstone Bridge's Sensitivity, Resistors' Values Effect PDF
6 pages
Midterm - Lecture 1 - WEBAPPS
No ratings yet
Midterm - Lecture 1 - WEBAPPS
18 pages
Gov Uscourts FLSD 521536 237 7
No ratings yet
Gov Uscourts FLSD 521536 237 7
5 pages
Hyper Upgraded Titan Speakerman Toilet Tower Defense Wiki Fandom
No ratings yet
Hyper Upgraded Titan Speakerman Toilet Tower Defense Wiki Fandom
1 page
Paper 1 Answers MPPSC 2021 P
No ratings yet
Paper 1 Answers MPPSC 2021 P
12 pages
Visual Media Portfolio: Breanne Huber
No ratings yet
Visual Media Portfolio: Breanne Huber
18 pages
Nilai Historis Pada Makanan Tradisional Tiliaya Dalam Konteks Kebudayaan Gorontalo
No ratings yet
Nilai Historis Pada Makanan Tradisional Tiliaya Dalam Konteks Kebudayaan Gorontalo
14 pages
Categorical Data Clustering
No ratings yet
Categorical Data Clustering
14 pages
The Dragonflybsd Operating System: Jeffrey M. Hsu, Member, Freebsd and Dragonflybsd
No ratings yet
The Dragonflybsd Operating System: Jeffrey M. Hsu, Member, Freebsd and Dragonflybsd
6 pages
International Journal of Hospitality Management: Soocheong (Shawn) Jang, Aejin Ha, Carol A. Silkes
No ratings yet
International Journal of Hospitality Management: Soocheong (Shawn) Jang, Aejin Ha, Carol A. Silkes
8 pages
Kategorisasi Berita Menggunakan Metode Pembobotan TF - ABS Dan TF - CHI
No ratings yet
Kategorisasi Berita Menggunakan Metode Pembobotan TF - ABS Dan TF - CHI
8 pages
Similarity Measure For Categorical Data
No ratings yet
Similarity Measure For Categorical Data
8 pages
Table No. 5 Research Criteria and Weights
No ratings yet
Table No. 5 Research Criteria and Weights
3 pages
Run The System File Checker Tool
No ratings yet
Run The System File Checker Tool
5 pages
SNP Log
No ratings yet
SNP Log
3 pages
1547124175889899
No ratings yet
1547124175889899
2 pages
Urban Planning and GIS
No ratings yet
Urban Planning and GIS
2 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
A Treatise on the Calculus of Finite Differences
From Everand
A Treatise on the Calculus of Finite Differences
George Boole
4/5 (1)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Similarity Measure For Sequences of Categorical Data

Uploaded by

Similarity Measure For Sequences of Categorical Data

Uploaded by

A Similarity Measure for Sequences of Categorical Data

Based on the Ordering of Common Elements

Cristina Gómez-Alonso and Aida Valls

iTAKA Research Group - Intelligent Tech. for Advanced Knowledge Acquisition

The feature-based and model-based approaches permit to apply conventional algorithms

2 Review of Dissimilarity Measures

2.1 Dissimilarity Functions for Numerical Variables

d1 (i, j) = ∑k=1 |xik − x jk |

Minkowski Distance. It is a generic distance which is defined as the q − th root of the

where tk is the time point for data values xik and x jk

Dynamic Time Warping Distance. It consists in the alignment of two series Q =

2.2 Dissimilarity Functions for Categorical Variables

χ 2 (i, j) = ∑k=1 dk (i, j)

dH (i, j) = ∑k=1 dk (i, j)

where dk is 0 when xik = x jk and 1 if xik = x jk .

Edit or Levenshtein Distance. It calculates the minimum number of edit operations to

Damerau-Levenshtein Distance. A modification of Levenshtein distance adding the

Kullback-Liebler Divergence. It measures the difference between two probability dis-

3 Description of the Data Sequences to Compare

Table 1. Example of data sequences

4 A New Similarity Measure: Ordered-Based Sequence Similarity

1. The number common elements in the two sequences

Definition 1. Let i and j be two sequences of items of different lengths, i = (xi,1 , . . . ,

Table 2. Results of the OSS applied to individual events (OSS-1)

Table 3. Results of the OSS applied to pairs of events (OSS-2)

Table 4. Results according to the Edit Distance (ED)

Table 5. Comparison of the results

6 Conclusions and Future Work

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.