0% found this document useful (0 votes)
22 views20 pages

AlgebraOLAP

Uploaded by

ashupratster2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

AlgebraOLAP

Uploaded by

ashupratster2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Formal Algebra for OLAP1

Bart Kuijpers2 and Alejandro Vaisman3


arXiv:1609.05020v1 [cs.DB] 16 Sep 2016

Abstract
Online Analytical Processing (OLAP) comprises tools and algorithms that allow
querying multidimensional databases. It is based on the multidimensional model,
where data can be seen as a cube, where each cell contains one or more measures
can be aggregated along dimensions. Despite the extensive corpus of work in the field,
a standard language for OLAP is still needed, since there is no well-defined, accepted
semantics, for many of the usual OLAP operations. In this paper, we address this
problem, and present a set of operations for manipulating a data cube. We clearly de-
fine the semantics of these operations, and prove that they can be composed, yielding
a language powerful enough to express complex OLAP queries. We express these op-
erations as a sequence of atomic transformations over a fixed multidimensional matrix,
whose cells contain a sequence of measures. Each atomic transformation produces a
new measure. When a sequence of transformations defines an OLAP operation, a flag
is produced indicating which cells must be considered as input for the next operation.
In this way, an elegant algebra is defined. Our main contribution, with respect to other
similar efforts in the field is that, for the first time, a formal proof of the correctness
of the operations is given, thus providing a clear semantics for them. We believe the
present work will serve as a basis to build more solid practical tools for data analysis.

Keywords: OLAP; Data Warehousing; Algebra; Data Cube; Dimension Hierarchy


.

1 Introduction
Online Analytical Processing(OLAP) [5] comprises a set of tools and algorithms that allow
efficiently querying multidimensional (MD) databases containing large amounts of data,
usually called Data Warehouses (DW). Conceptually, in the MD model, data can be seen
as a cube, where each cell contains one or more measures of interest, that quantify facts.
Measure values can be aggregated along dimensions, which give context to facts. At the
logical level, OLAP data are typically organized as a set of dimension and fact tables.
Current database technology allows alphanumerical warehouse data to be integrated for
example, with geographical or social network data, for decision making. In the era of so-
called “Big Data”, the kinds of data that could be handled by data management tools,
are likely to increase in the near future. Moreover, OLAP and Business Intelligence (BI)
tools allow to capture, integrate, manage, and query, different kinds of information. For
1
Extended abstract. Full version to appear in Intelligent Data Analysis, 21(5), 2017.
2
Databases and Theoretical Computer Science Research Group, Hasselt University and Transnational
University of Limburg; email: bart.kuijpers@uhasselt.be
3
Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina; email: avaisman@itba.edu.ar.

1
example, alphanumerical data coming from a local DW, spatial data (e.g., temperature)
represented as rasterized images, and/or economical data published on the semantic web.
Ideally, a BI user would just like to deal with what she knows well, namely the data cube,
using only the classical OLAP operators, like Roll-up, Drill-down, Slice, and Dice (among
other ones), regardless the cube’s underlying data type. Data types should only be handled
at the logical and physical levels, not at the conceptual level. Building on this idea, Ciferri
et al. [2] proposed a conceptual, user-oriented model, independent of OLAP technologies.
In this model, the user only manipulates a data cube. Associated with the model, there is a
query language providing high-level operations over the cube. This language, called Cube
Algebra, was sketched informally in the mentioned work. Extensive examples on the use
of Cube Algebra presented in [7], suggest that this idea can lead to a language much more
intuitive and simple than MDX, the de facto standard for OLAP. Nevertheless, these works
do not give any evidence of the correctness of the languages and operations proposed, other
than examples at various degrees of comprehensiveness. In fact, surprisingly, and in spite
of the large corpus of work in the field, a formally-defined reference language for OLAP
is still needed [6]. There is not even a well-defined, accepted semantics, for many of the
usual OLAP operations. We believe that, far for being just a problem of classical OLAP,
this formalization is also needed in current “Big Data” scenarios, where there is a need to
efficiently perform real-time OLAP operations [3], that, of course, must be well defined.

Contributions In this paper we (a) introduce a collection of operators that manipulate


a data cube, and clearly define their semantics; and (b) prove, formally, that our operators
can be composed, yielding a language powerful enough to express complex queries and
cube navigation (“à la OLAP”) paths.
We achieve the above representing the data cube as a fixed d-dimensional matrix,
and a set of k measures, and expressing each OLAP operation as a sequence of atomic
transformations. Each transformation produces a new measure, and, additionally, when a
sequence forms an OLAP operation, a flag that indicates which are the cells that must be
considered as input for the next operation. This formalism allows us to elegantly define
an algebra as a collection of operations, and give a series of properties that show their
correctness. We provide the proofs in the full paper. We limit ourselves to the most
usual operations, namely slice, dice, roll-up and drill-down, which constitute the core of
all practical OLAP tools. We denote these the classical OLAP operations. This allows us
to focus on our main interest, which is, to prove the feasibility of the approach. Other
not-so-usual operations are left for future work.
The main contribution of our work, with respect to other similar efforts in the field is
that, for the first time, a formal proof to practical problems is given, so the present work
will serve as a basis to build more solid tools for data analysis. Existing work either lacks of
formalism, or of applicability, and no work of any of these kinds give sound mathematical
prove of its claims. In this extended abstract we present the main properties, and leave
the proofs for the full paper.
The remainder of the paper is organized as follows. In Section 2, we present our
MD data model, on which we base the rest of our work. Section 3 presents the atomic
transformations that we use to build the OLAP operations. In Section 4 we discuss the
classical OLAP operations in terms of the transformations, show how they can be composed
to address complex queries. We conclude in Section 5.

2
2 The OLAP Data Model
In this section we describe the OLAP data model we use in the sequel.

2.1 Multidimensional Matrix


We next give the definitions of multidimensional matrix schema and instance. In the sequel,
d, with d ≥ 1, is a natural number representing the number of dimensions of a data cube.
Definition 1 (Matrix Schema). A d-dimensional matrix schema is a sequence (D1 , D2 , ...,
Dd ) of d dimension names. ⊔

Dimension names can be considered to be strings. As illustrated in the following ex-
ample, the convention will be that dimension names start with a capital letter.
Example 1. The running example we use in this paper, deals with sales information of
certain products, at certain locations, at certain moments in time. For this purpose, we
will define a 3-dimensional matrix schema (D1 , D2 , D3 ) = (P roduct, Location, T ime). ⊔

Definition 2 (Matrix Instance). A d-dimensional matrix instance (matrix, for short) over
the d-dimensional matrix schema (D1 , D2 , ..., Dd ) is the product dom(D1 ) × dom(D2 ) ×
· · · × dom(Dd ), i = 1, 2, ..., d, where dom(Di ) is a non-empty, finite, ordered set, called
the domain, that is associated with the dimension name Di . For all i = 1, 2, ..., d, we
denote by <, the order that we assume on the elements of dom(Di ). For a1 ∈ dom(D1 ),
a2 ∈ dom(D2 ), ..., ad ∈ dom(Dd ), we call the tuple (a1 , a2 , ..., ad ), a cell of the matrix. ⊔

The cells of a matrix serve as placeholders for the measures that are contained in
the data cube (see Definition 7 below). Note that, as it is common practice in OLAP, we
assumed an order < on the domain. The role of the order is further discussed in Section 2.4.
As a notational convention, elements of the domains dom(Di ) start with a lower case
letter, as it is shown in the following example.
Example 2. For the 3-dimensional matrix schema (D1 , D2 , D3 ) = (P roduct, Location,
T ime) of Example 1, the non-empty sets dom(D1 ) = {lego, brio, apples, oranges},
dom(D2 ) = {antwerp, brussels, paris, marseille}, and dom(D3 ) = {1/1/2014, ...,
31/1/2014} produce the matrix instance dom(D1 ) × dom(D2 ) × dom(D3 ). The cells of the
matrix will contain the sales for each combination of values in the domain. In dom(D2 ), we
have, for instance, the order antwerp < brussels < paris < marseille. Over the dimension
T ime, we have the usual temporal order. ⊔

2.2 Level Instance, Hierarchy Instance and Dimension Graph


We now define the notions of dimension schema and instance.
Definition 3 (Dimension Schema, Hierarchy and Level). Let D be a name for a dimension.
A dimension schema σ(D) for D is a lattice, with a unique top-node, called All (which has
only incoming edges) and a unique bottom-node, called Bottom (which has only outgoing
edges), such that all maximal-length paths in the graph go from Bottom to All. Any path
from Bottom to All in a dimension schema σ(D) is called a hierarchy of σ(D). Each node
in a hierarchy (i.e., in a dimension schema) is called a level (of σ(D)). ⊔

3
Figure 1: Dimension schemas for the dimensions Location, in (a), and T ime , in (b).

As a convention, level names start with a capital letter. Note that the Bottom node is
often renamed, depending on the application.
Example 3. Fig. 1 gives examples of dimension schemas σ(Location) and σ(T ime) for
the dimensions Location and T ime in Example 1. For the dimension Location, we have
Bottom = City, and there is only one hierarchy, denoted City → Region → Country →
All. The node Region is an example of a level in this hierarchy. For the dimension T ime,
we have Bottom = Day, and two hierarchies, namely Day → M onth → Semester →
Y ear → All and Day → W eek → All. ⊔

Definition 4 (Level Instance, Hierarchy Instance, Dimension Graph). Let D be a dimen-


sion with schema σ(D), and let ℓ be a level of σ(D). A level instance of ℓ is a non-empty,
finite set dom(D.ℓ). If ℓ = All, then dom(D.All) is the singleton {all}. If ℓ = Bottom, then
dom(D.Bottom) is the the domain of the dimension D, that is, dom(D) (as in Definition 2).
A dimension graph (or instance)
S I(σ(D)) over the dimension schema σ(D) is a directed
acyclic graph with node set ℓ dom(D.ℓ), where the union is taken over all levels in σ(D).
The edge set of this directed acyclic graph is defined as follows. Let ℓ and ℓ′ be two levels
of σ(D), and let a ∈ dom(D.ℓ) and a′ ∈ dom(D.ℓ′ ). Then, only if there is a directed edge
from ℓ to ℓ′ in σ(D), there can be a directed edge in I(σ(D)) from a to a′ .
If H is a hierarchy in σ(D), then the hierarchy instance (relative to the dimension
instance I(σ(D))) is the subgraph of I(σ(D)) with nodes from dom(D.ℓ), for ℓ appearing
in H. This subgraph is denoted IH (σ(D)). ⊔

As notational convention, the names of objects in a set dom(D.ℓ) start with a lower
case character. We remark that a hierarchy instance IH (σ(D)) is always a (directed) tree.
Also, if a and b are two nodes in a hierarchy instance IH (σ(D)), such that (a, b) is in the
transitive closure of the edge relation of IH (σ(D)), we will say that a rolls-up to b and we
denote this by ρH (a, b) (or ρ(a, b) if H is clear from the context).

4
Figure 2: An example of a dimension graph (or instance) I(σ(Location)).

Example 4. Consider the Location dimension, whose schema σ(Location) is given in


Fig. 1 (a). From Example 2, we have dom(Location) = {antwerp, brussels, paris,
marseille}, which is dom(Location.Bottom), or dom(Location.City).
An example of a dimension instance I(σ(Location)) is depicted in Fig. 2. This example
expresses, for instance, that the city brussels is located in the region capital which is part
of the country belgium, meaning that brussels rolls-up to capital and to belgium, that is,
ρ(brussels, captial) and ρ(brussels, belgium). ⊔

In a dimension graph, we must guarantee that rolling-up through different paths gives
the same results. This is formalized by the concept of “sound” dimension graph.
Definition 5 (Sound Dimension Graph). Let I(σ(D)) be a dimension graph (as in Def-
inition 4). We call this dimension graph sound, if for any level ℓ in σ(D) and any two
hierarchies H1 and H2 that reach ℓ from the Bottom level and any a ∈ dom(D) and
b1 , b2 ∈ dom(D.ℓ), we have that ρH1 (a, b1 ) and ρH2 (a, b2 ) imply that b1 = b2 . ⊔

In this paper, we assume that dimension graphs are always sound.

2.3 Multidimensional Data Cube


Essentially, a data cube is a matrix in which the cells are filled with measures that are taken
from some value domain Γ. For many applications, Γ will be the set of real or rational
numbers, although some other ones may include, e.g., spatial regions or geometric objects.
Definition 6 (Data Cube Schema). A d-dimensional data cube schema consists of (a) a
d-dimensional matrix schema (D1 , D2 , ..., Dd ); and (b) a hierarchy schema σ(Di ) for each
dimension Di , with i = 1, 2, ..., d. ⊔

Definition 7 (Data Cube Instance). Let Γ be a non-empty set of “values”. A d-di-
mensional, k-ary data cube instance (or data cube, for short) D over the d-dimensional
matrix schema (D1 , D2 , ..., Dd ) and hierarchy schemas σ(Di ) for Di , for i = 1, 2, ..., d, with
values from Γ, consists of (a) a d-dimensional matrix instance over the matrix schema
(D1 , D2 , ..., Dd ), M (D); (b) for each i = 1, 2, ..., d, a sound dimension graph I(σ(Di )) over
σ(Di ); (c) k measures µ1 , µ2 , ..., µk , which are functions from dom(D1 ) × dom(D2 ) × · · · ×
dom(Dd ) to the value domain Γ; and (d) a flag ϕ , which is a function from dom(D1 ) ×
· · · × dom(Dd ) to the set {0, 1}. ⊔

5
In the remainder of this paper we assume that Γ = Q, the set of the rational numbers.
For most applications, this suffices. Also, as a notational convention, we use calligraphic
characters, like D, to represent data cube instances.
The flag ϕ can be considered as a (k + 1)-st Boolean measure. The role of ϕ is to
indicate which of the matrix cells are currently “active”. The active cells have a flag value
1 and the others have a flag value 0. When we operate over a data cube, flags are used to
indicate the input or output parts of the matrix of the cube. Typically, in the beginning
of the operations, all cells have a flag value of 1. The role of flags will become more clear
in the next sections, when we discuss OLAP transformations and operations.

2.4 Ordered Domains and the Representation of Higher-level Objects


When performing OLAP transformations and operations, we may need to store aggregate
information about certain measures up to some level above the Bottom one. We do not
want to use extra space for this in the data cube. Instead, we use the available cells of the
original data cube to store this information. For this, we make use of the order assumed
in Definition 2, for the representation of high-level objects by Bottom-level objects.
Definition 8. Let D ∈ {D1 , D2 , ..., Dd } be an arbitrary dimension with domain dom(D) =
dom(D.Bottom). Let ℓ be a level of σ(D). An element b ∈ dom(D.ℓ) is represented by the
smallest element a ∈ dom(D) (according to <) for which ρ(a, b) holds. We denote this as
rep(b) = a, and say that a represents b. ⊔

Example 5. Continuing with the previous examples, we consider the dimension Location
with dom(Location) = {antwerp, brussels, paris, marseille} (i.e., dom(Location.City).
On this set, we assume the order antwerp < brussels < paris < marseille. For this
dimension, we have the hierarchy and the dimension instance, given in Figs. 1 and 2,
respectively. At the Bottom = City level, cities represent themselves. At higher levels,
regions and countries are represented by their “first” city in dom(Location) (according to
<). Thus, f landers and belgium are represented by antwerp, f rance is represented by
paris, and south is represented by marseille. At the level All, antwerp represents all. ⊔⊓
Note that the Bottom-level representatives of higher-level objects, will be flagged 1,
and other cells flagged 0. Also, in our example, if we aggregate information at level
Region, with dom(Location.Region) = {f landers, capital, north, south}, then all cities
in dom(Location) become flagged. Thus, it would not be clear if the cube contains infor-
mation at the level City or at the level Region. To solve this, we could keep a log of the
OLAP operations that are performed, making the level of aggregation clear. The following
property shows how the order on the Bottom level induces and order on higher levels.
Property 1. Let D ∈ {D1 , D2 , ..., Dd } be a (sound) dimension of a data cube D and let
ℓ be a level in the dimension schema σ(D). The order < on dom(D) induces an order on
dom(D.ℓ) as follows. If b1 , b2 ∈ dom(D.ℓ), then b1 < b2 if and only if rep(b1 ) < rep(b2 ). ⊔

3 OLAP Transformations and Operations


A typical OLAP user manipulates a data cube by means of well-known operations. For
instance, using our running example, the query “Total sales by region, for regions in Bel-
gium or France”, is actually expressed as a sequence of operations, whose semantics should

6
be clearly defined, and which can be applied in different order. For example, we can first
apply a Roll-Up (i.e., an aggregation) to the Country level, and once at that level apply a
Dice operation, which keeps the cube cells corresponding to Belgium or France. Finally, a
Drill-Down can be applied to disaggregate the sales down to the level Region, returning the
desired result. In what follows, we characterize OLAP operations as the result of sequences
of “atomic” OLAP transformations, which are measure-creating updates to a data cube.

3.1 Introduction to OLAP Transformations and Operations


An atomic OLAP transformation acts on a data cube instance, by adding a measure to the
existing data cube measures. OLAP operations like the ones informally introduced above
are defined, in our approach, as a sequence of transformations. The process of OLAP
transformations starts from a given input data cube Din . We assume that this original
data cube has k given measures µ1 , µ2 , ..., µk (as in Definition 7). These k measures have
a special status in the sense that they are “protected” and can never be altered (see
Section 3.2). However, there is one exception to this protection. These original measures
can be “destroyed” in some cells (see further on), for instance, as the result of slice or dice
operations, which are destructive by nature. Operations of these types destroy the content
of some matrix cells and remove even the protected measures in it.
Typically, the input-flag ϕ of the original data cube Din is set to 1 in every cell and
signals that every cell of M (Din ) is part of the input cube.
Atomic OLAP transformations can be applied to data cubes. They add (or create)
new measures to the sequence of existing measures by adding new measure values in each
cell of the data cube’s matrix. At any moment in this process, we may assume that the
data cube D has k + l measures µ1 , µ2 , ..., µk ; τ1 , ..., τl , where the first k are the original
measures of Din , and the last l (with l ≥ 0) ones have been created subsequently by l
OLAP transformations (where τ1 , ..., τl is the empty sequence of τ ’s, for l = 0). The next
OLAP transformation adds a new measure τl+1 to the matrix cells.
We have said that we use OLAP transformations to compute OLAP operations. We
indicate that the computation of an OLAP operation O is finished by creating an m-
(m)
ary output flag ϕO . This output flag is a Boolean measure, that is created via atomic
OLAP transformations. It indicates which of the cells of M (D) should be considered as
belonging to the output of O. It is m-ary in the sense that it keeps the last m created
measures τl−m+1 , τl−m+2 , ..., τl and “trashes” the rest. It also removes the previous flag,
which it replaces. The initial measures µ1 , µ2 , ..., µk of the input data cube Din are never
removed (unless they are “destroyed” in some cells). They remain in the cube throughout
the process of applying one OLAP operation after another to Din , and can be used at
any stage. Summarizing, after an OLAP operation of output arity m is completed on
some cube D, the measures in the cells of the output data cube D ′ = O(D) are of the
(m)
form µ1 , µ2 , ..., µk ; τl−m+1 , τl−m+2 , ..., τl ; ϕO . Here, the underlining indicates the protected
status of these measures. After each OLAP operation, we do a “cleaning” by renaming
the unprotected measures with the symbols τ1 , τ2 , ..., τm and the output measures become
(m)
µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τm ; ϕO . The next OLAP operation O′ can then act on D ′ and use
in its computation all the measures above. We remark that the dimensions, the hierarchy
schemas and instances of D remain unaltered during the entire OLAP process.
We end this description with a remark on destructors. A destructor, optionally, pre-

7
cedes the creation of an output flag. A destructor δ takes the value 1 for some cells of
the matrix of a data cube, and 0 on other cells. When δ is invoked (and activated by the
output flag that follows it) on a data cube D with measures µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τm and
(m)
flag ϕO , it empties all cells for which the value of the destructor δ is 0 by removing all
measures from them, even the protected ones, thereby effectively “destroying” these cells.
This is the only case where the protected measures are altered (see operations Slice or Dice,
(m)
later). The output of a destructive operation O looks like µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τl ; δ; ϕO ,
in which the destructor precedes the output flag. The effect of the presence of a de-
structor is the following. A cell such that δ = 0 is emptied, after which it contains no
more measures and flag. For cells with δ = 1, the sequence of measures µ1 , µ2 , ..., µk ;
(m) (m)
τ1 , τ2 , ..., τl ; δ; ϕO ; is transformed to µ1 , µ2 , ..., µk ; τl−m+1 , τl−m+2 , ..., τl ; ϕO ; which is re-
named as µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τm ; ϕ; before the next transformation takes place. This
transformation will act, cell per cell, on the matrix of a cube, and it does nothing with
emptied cells. That is, no new measure can ever be added to a destroyed cell.
The following definition specifies how an OLAP transformation acts on a data cube.
We then address in detail each atomic OLAP transformation appearing in this definition.

Definition 9 (OLAP Transformation). Let D be a d-dimensional, (k + l)-ary data cube


instance with given (or protected) measures µ1 , µ2 , ..., µk , created measures τ1 , ..., τl (with
l ≥ 0) and flag ϕ over some value domain Γ. An OLAP transformation T , applied to
D, results in the creation of a new measure τl+1 in D. Transformation T adds measure
τl+1 to non-empty cells of M (D); τl+1 is produced from: µ1 , µ2 , ..., µk (in non-empty cells);
ϕ (in non-empty cells); τ1 , τ2 , ..., τl (in non-empty cells) and the hierarchy schemas and
instances of D; and belongs to one of the following classes: (a) Arithmetic transformations
(Definition 11); (b) Boolean transformations (Definition 12); (c) Selectors (Definition 13);
(d) Counting, sum, min-max (Definitions 14, 19); (e) Grouping (Definition 18).
An OLAP transformation can also result in the creation of a measure that is an output
flag ϕ(m) of arity m. This should be a measure with a Boolean value. To indicate that it is
a flag of arity m, we use the reserved symbol ϕ(m) instead of τl+1 . An output flag ϕ(m) may
(optionally) be preceded by a destructor δ. This should be a measure with a Boolean value
(to indicate which cells are destroyed). We use the reserved symbol δ instead of τl+1 . ⊔ ⊓

3.2 OLAP Operations and their Composition


Before we give the definition of an OLAP operation, we describe the input to the OLAP pro-
cess (this process may involve multiple OLAP operations). Such input is a d-dimensional,
k-ary data cube instance Din , with measures µ1 , µ2 , ..., µk and flag ϕ. These measures are
protected in the sense that they remain the first k measures throughout the entire OLAP
process and are never altered or removed unless they are destroyed in some cells. The cube
Din has also a Boolean flag ϕ, which typically has value 1 in all cells of M (Din ). Thus, the
measures of the input cube Din are denoted µ1 , µ2 , ..., µk ; ϕ.
After applying a sequence of OLAP operations to Din , we obtain a data cube D.

Definition 10 (OLAP Operation). Let D be a d-dimensional, (k + l)-ary input data


cube instance with given measures µ1 , µ2 , ..., µk , computed measures τ1 , ..., τl and flag ϕ.
The data cube D acts as the input of an OLAP operation O (of arity m), which consists

8
of a sequence of n consecutive OLAP transformations that create the additional measures
(m)
τl+1 , ..., τl+n , followed by the creation of an m-ary flag ϕO (possibly preceded by a destruc-
(m)
tor δ). As the result of the creation of ϕO , the measures in the cells of the data cube are
(m)
changed from µ1 , µ2 , ..., µk ; τ1 , ..., τl ; ϕ; τl+1 , ..., τl+n to µ1 , µ2 , ..., µk ; τl+n−m+1 , ..., τl+n ; ϕO ,
which become µ1 , µ2 , ..., µk ; τ1 , ..., τm ; ϕ, after renaming. The output cube D = O(D) has ′

the same dimensions, hierarchy schemas and instances as D, and measures µ1 , µ2 , ..., µk ;
(m)
τ1 , ..., τm ; ϕ. In the case where ϕO is preceded by a destructor δ, the same procedure is
followed, except for the cells of M (D) for which δ takes the value 0. These cells of M (D)
are emptied, contain no measures, and become inaccessible for future transformations. ⊔ ⊓

3.3 Atomic OLAP Transformations


We now address the five classes of atomic OLAP transformations of Definition 9. We use the
following notational convention. For a measure α, we write α(x1 , x2 , ..., xd ) to indicate the
value of α in the cell (x1 , x2 , ..., xd ) ∈ dom(D1 )×dom(D2 )×· · ·×dom(Dd ). We remark that
α(x1 , x2 , ..., xd ) does not exist for empty cells and it is thus not considered in computations.
Also, we assume that there are protected measures µ1 , µ2 , ..., µk , and computed measures
τ1 , ..., τl in the non-empty cells, and call τl+1 the next computed measure.

3.3.1 Arithmetic Transformations


Definition 11 (Arithmetic Transformations). The following creations of a new measure
τl+1 are arithmetic transformations:

1. (Rational constant) τl+1 = α, with α ∈ Q, a rational number.

2. (Sum) τl+1 = α + β, with α, β ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl }.

3. (Product) τl+1 = α · β, with α, β ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl }.

4. (Quotient) τl+1 = α/β, with α, β ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl }. Here, by convention,


a/0 := a for all a ∈ Q. ⊔

3.3.2 Boolean Transformations


Definition 12 (Boolean Transformations). The following creations of a new measure τl+1
are Boolean transformations:

1. (Equality test on measures) τl+1 = (α = β), with α, β ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ...,


τl }. Here, the result of (α = β) is a Boolean 1 or 0 (cell per cell in the non-empty
cells of the matrix).

2. (Comparison test on measures) τl+1 = (α < β), with α, β ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 ,


..., τl }. Here, the result of the comparison (α < β) is a Boolean 1 or 0 (cell per cell
in the non-empty cells of the matrix).

3. (Equality test on levels) For a level ℓ in the dimension schema σ(Di ) of dimension
Di , and a constant object c ∈ dom(Di .ℓ), τl+1 (x1 , x2 , ..., xd ) = (ℓ = c) is an “equality”

9
test. Here, the result of (ℓ = c) is a Boolean 1 or 0 (cell per cell in the non-empty
cells of the matrix) such that τl+1 (x1 , x2 , ..., xd ) is 1 if and only if xi rolls-up to c at
level ℓ, that is ρ(xi , c).

4. (Comparison test on levels) For a level ℓ in the dimension schema σ(Di ) of di-
mension Di , and a constant c ∈ dom(Di .ℓ), τl+1 (x1 , x2 , ..., xd ) = (ℓ <ℓ c) is a “com-
parison” test. The result of (ℓ <ℓ c) is a Boolean 1 or 0 (cell per cell in the non-empty
cells of the matrix), such that τl+1 (x1 , x2 , ..., xd ) is 1 if and only if xi rolls-up to an
object b at level ℓ for which b <ℓ c. The order <ℓ can be any order that is defined on
level ℓ. Transformation τl+1 (x1 , x2 , ..., xd ) = (c <ℓ ℓ) is defined similarly. ⊔

Example 6. We illustrate the use of Boolean transformations by means of a sequence


of transformations that implement a “dice” (see Section 4.2 for more details). The query
DICE(D, sales > 50) asks for the cells in the matrix of D which contain sales that are higher
than 50. This query can be implemented by the following sequence of transformations:

• τ1 = 49.99 (rational constant);

• τ2 = (τ1 < sales) (comparison test on measures);

• τ3 = µ1 · τ2 (product);

• δ = τ2 (destructor); and

• ϕ(1) = τ2 (unary flag)

The measure τ3 contains the sales values larger than or equal to 50 (and a 0 if the
sales are lower). The destructor δ destroys the cells that contain a O. Finally, the flag
ϕ(1) selects all cells from the input as output cells (it will contain a 1 for all such cells that
satisfy the condition), and concludes the DICE(D, sales > 50) operation. The output of
this operation is sales; τ3 ; ϕ(1) , which is then renamed to sales; τ1 ; ϕ. ⊔

3.3.3 Selectors
Definition 13 (Selector Transformations). The following creations of a new measure τl+1
are selector transformations (or selectors), and their definition is cell per cell of M (D):

1. (Constant selector) For a level ℓ in the dimension schema σ(Di ) of a dimension


Di , and c ∈ dom(Di .ℓ), τl+1 can be a constant-selector for c, denoted σDi .ℓ=c , and it
corresponds to the equality test on levels τl+1 (x1 , x2 , ..., xd ) = (ℓ = c).

2. (Level selector) For a level ℓ in the dimension schema σ(Di ) of a dimension Di ,


τl+1 can be a level-selector for ℓ, denoted by σDi .ℓ , which means that we have, for all
xj ∈ dom(Dj ) with j 6= i,

 1 if a = rep(b)
τl+1 (x1 , ..., xi1 , a, xi+1 , ..., xd ) = for some b ∈ dom(Di .ℓ),
0 otherwise.


10
The constant selector in Definition 13, corresponds to the equality test on levels (see
3. in Definition 12). Here, this transformation appears with a different functionality and
we reserve a special notation for it, and we repeated it. Also, note that the level selector
selects all representatives (at the Bottom level) of objects at level ℓ of dimension Di .

Example 7. The query DICE(D, Location.City = antwerp OR Location.City = brussels),


asks for the sales in the cities of antwerp and brussels. It can be implemented by the fol-
lowing sequence of transformations, where τ3 can take values 0 or 1, since the cities antwerp
and brussels do not overlap:

• τ1 = σLocation.City=antwerp (constant selector);

• τ2 = σLocation.City=brussels (constant selector);

• τ3 = τ1 + τ2 (sum);

• τ4 = τ3 · µ1 (product);

• δ = τ3 (destroys the cells outside antwerp and brussels);

• ϕ(1) = τ3 (unary flag creation).



3.3.4 Count, Sum and Min-Max


Definition 14 (Counting, Sum, and Min-Max Transformations). The creations of a new
measure τl+1 defined next, are denoted counting, sum and min-max transformations:

1. (Count-Distinct) τl+1 = #6= (α), α ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl } counts the number
of distinct values of measure α in the complete matrix M (D) of the data cube.
P
2. (d-dimensional sum) τl+1 = (x1 ,x2 ,...,xd)∈M (D) α(x1 , x2 ..., xd ), with α ∈ {µ1 , µ2 , ...,
µk , τ1 , τ2 , ..., τl }, gives the sum of the measure α over all non-empty matrix cells. We
abbreviate this operation by writing τl+1 = SUMd (α), and call this transformation
the d-dimensional sum.

3. (Min-Max) τl+1 = min(α), with α ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl }, gives the smallest
value of the measure α in non-empty cells of the matrix M (D). Similarly, τl+1 =
max(α), gives the largest value of the measure α in the matrix M (D). ⊔

It is important to remark that the above transformations create the same new measure
value for all cells of the matrix M (D).

Example 8. Now, we look at the query “total sales in antwerp”. The query can be com-
puted as follows, given µ1 = sales:

• τ1 = σLocation.City=antwerp (constant selector on antwerp);

• τ2 = τ1 · µ1 (product that selects the sales in antwerp, puts a 0 in all other ones);

• τ3 = SUM3 (τ2 ) (this is the total sales in antwerp in every cell);

11
• τ4 = τ3 · τ1 (this is the total sales in antwerp in the cells of antwerp);
• ϕ(1) = τ1 (this flag creation selects the cells of antwerp).
The output measures are sales; τ4 ; ϕ(1) , which are renamed sales; τ1 ; ϕ. Thus, the value of
the total of sales in antwerp is now available in every cell corresponding to antwerp. For
the cells outside antwerp there is a 0. We remark that this example can be modified with
a destructor that effectively empties cells outside antwerp. ⊔

3.3.5 Grouping
The most common OLAP operations (e.g., roll-up, slice), require grouping data before ag-
gregating them. For example, typically we will ask queries like “total sales by city”, which
requires grouping facts by city, and, for each group, sum all of its sales. Therefore, we
need a transformation to express “grouping”. To deal with grouping, we use the concept of
“prime labels” for sets and products of sets. We will use these labels to identify elements
in dimensions and in dimension levels. Before giving the definition of the grouping trans-
formations, we elaborate on prime labels and product of prime labels. As we show, these
prime labels work in the context of measures that take rational values (as it is often the
case, in practice). The following definition specifies our infinite supply of prime labels.
Definition 15 (Prime Labels). Let pn denote√ the
√ n-th
√ √ prime
√ number, for n ≥ 1. We define

the sequence of prime labels as follows: 1, 2, 3, 5, 7, 11, ..., pn , .... We denote the

set of all prime labels by P. ⊔

Definition 16 (Prime Labeling of Sets). Let A, A1 , A√ 2 , ..., An be (finite) sets. A prime
labeling of the set A is an injective function w : A → P. For a ∈ A, we call w(a) the
prime label of a (for the prime labeling w).
Let I be a subset of {1, 2, ..., n}, which serves as an index set. A prime product I-
labeling of the Cartesian product A1 × A2 × · · · × An consists of prime labelings wi of the
sets Ai , for i ∈ I, that satisfy the condition that wi (Ai ) ∩Qwj (Aj ) is empty for i, j ∈ I
and i 6= j. For (a1 , a2 , ..., an ) ∈ A1 × A2 × · · · × An , we call i∈I wi (ai ) the prime product
I-label of (a1 , a2 , ..., an ) (given the prime labelings wi , for i ∈ I). When I is a strict subset
of {1, 2, ..., n}, we speak about a partial prime product labeling and when I = {1, 2, ..., n},
we speak about a full prime product labeling. ⊔

If we view a Cartesian product A1 × A2 × · · · × An as a finite matrix, whose cells
contain rational-valued measures, we can use prime (product) labelings as follows in the
aggregation process. Let us assume that the cells of A1 × A2 × · · · × An contain rational
values of a measure µ and let us denote the value of this measure in the cell (a1 , a2 , ..., an )
by µ(a1 , a2 , ..., an ). If we have a full prime product labeling on A1 × A2 × · · · × An , then
we can consider the sum over this Cartesian product of the product of the prime product
labels with the value of µ:
X
µ(a1 , a2 , ..., an ) · w1 (a1 ) · w2 (a2 ) · · · wn (an ). (†1 )
(a1 ,a2 ,...,an )∈A1 ×A2 ×···×An

Since each cell of A1 × A2 × · · · × An has a unique prime product label, and since
these labels are rationally independent (see Property 2), this sum enables us to retrieve
the values µ(a1 , a2 , ..., an ).

12
If we have a partial prime product labeling on A1 × A2 × · · · × An , determined by
an index set I, then, again, we can consider the sum over this Cartesian product of the
product of the partial prime product labels with the value of µ:
X Y
µ(a1 , a2 , ..., an ) · wi (ai ). (†2 )
(a1 ,a2 ,...,an)∈A1 ×A2 ×···×An i∈I

Now, all cells in A1 × A2 × · · · × An above a cell in the projection of A1 × A2 × · · · × An


on its components with indices in I, receive the same prime label. This means that these
cells are “grouped” together and the above sum allows us to retrieve the part of the sum
that belongs to each group. The following definition gives a name to the above sums.
Definition 17 (Prime Sums). We call sums of type (†1 ) full prime sums and sums of type
(†2 ) partial prime sums (over I). ⊔

√The√following property can√ be derived


√ from the well-known fact that the field extension
√ √
Q( 2, 3, ..., pn ) = {a0 + a1 2 + a2 3 + · · · + an pn | a0 , a1 , a2 , ..., an ∈ Q} has degree
2n over Q and corollaries of this property (see Chapter 8 in [4]). No square root of a prime
number is a rational combination of square roots of other primes.
Property 2. Let n ≥ 1 and let A1 × A2 × · · · × An be a Cartesian product of finite sets.
We assume that the cells (a1 , a2 , ..., an ) of this set contain rational values µ(a1 , a2 , ..., an )
of a measure µ. Let I be a subset of {1, 2, ..., n} and let wi be prime labelings of the sets
Ai , for i ∈ I, that form
P a prime product I-labeling. Then, the prime sum (†2 ) uniquely
determines the values ×i∈I c Ai µ(a1 , a2 , ..., an ) for all cells of A1 × A2 × · · · × An . ⊔

We remark that we use these prime (product) labels in a purely symbolic way without
actually calculating the square root values in them. We are now ready to define atomic
OLAP operations that allow us to implement grouping. In what follows, we apply these
prime labels to the case where the sets Ai in A1 × A2 × · · · × An are domains of dimensions
(e.g., at the bottom level), or domains of dimensions at some level.
Definition 18 (Grouping Transformations). The following creations of a new measure
τl+1 are grouping transformations:
1. (Prime labels for groups in one dimension) Let Di be a dimension and ℓ a level
in the dimension schema σ(Di ) of a dimension Di . Let dom(Di .ℓ) = {b1 , b2 , ..., bm }
with induced order b1 < b2 < · · · < bm (see Property 1). If the prime labels
w1 , w2 , ..., wk have been used by previous transformations, then for all j, with j 6= i,
and all xj ∈ dom(Dj ), we have τl+1 (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) = wk+l if ρ(xi , bl ).
We denote this transformation by γDi .ℓ (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) or γDi .ℓ , for short,
and call the result of such a transformation a prime labeling.
2. (Projection of a prime sum) If the result of some previous transformation τm is
Pk+l
a (full or partial) prime sum i=k ai · wi (over the complete matrix M (D)) in which
prime (product) labels wk , wk+1 , ..., wk+l (computed in a previous transformation τn )
are used, then τl+1 is a new measure that “projects” on the appropriate component
from the prime sum, that is, τl+1 (x1 , x2 ..., xd ) = ak+l if the prime (product) label
τn (x1 , x2 ..., xd ) = wk+l . We denote this projection transformation by τm |τn . ⊔

13
Example 9. Consider the query “for each country, give the total number of cities”. This
query can be implemented as follows (explained below, using the data in Example 4):
• τ1 = γLocation.Country (this gives each country a prime label);
• τ2 = γLocation.City (this gives each city a (fresh) prime label);
• τ3 = τ1 · τ2 (this gives each city a product of prime labels);
• τ4 = SUM3 (τ3 );
• τ5 = γP roduct.Bottom (gives each product a different prime label);
• τ6 = #6= (τ5 ) (counts the number of products);
• τ7 = γT ime.Bottom (gives each time moment a different prime label);
• τ8 = #6= (τ7 ) (counts the number of moments in time);
• τ9 = τ6 · τ8 (is the number of products times the number of time moments);
• τ10 = τ4 /τ9 (normalization of the sum);
• τ11 = τ10 |τ2 ; (projection over the prime labels of city);
• τ12 = SUM3 (τ11 ) (3-dimensional sum);
• τ13 = τ12 /τ9 (normalization of the sum);
• τ14 = τ13 |τ1 (projection over the prime labels of country);
• ϕ(1) = σLocation.Bottom (this flag creation selects all cells of the matrix).
Transformation τ1 gives each country a next available prime √ label. Since no labels have
been used yet, belgium gets label 1 and f rance gets √ label 2. Transformation τ2 gives
each
√ city a next available √ prime label. Since 1
√ and 2 have been used, antwerp
√ gets label
3, brussels gets label 5, paris gets label √ 7, and marseille
√ gets label 11. √ √
Transformation √ τ3 gives
√ √antwerp the value 3 (i.e., 1. √3, brussels √ √ the value 5(1. 5),
paris the value 14 ( 2. 7), and marseille the value 22 ( 2. √ 11).√ If there
√ are
√ 10
products and 100 time moments, then τ4 puts the value 10 · 100 · ( 3 + 5 + 14 + 22)
in each cell of the matrix M (D).
Transformations τ6 and τ8 count the number of products and the number of time
moments (using fresh prime labels), and the product √ √of these
√ quantities
√ is computed in τ9 .
In τ10 , τ3 is divided by this product, putting 3 + 5 + 14 + 22 in every √ √ cell. √
√ Transformation τ 11 is a projection on the prime
√ labels
√ of City.
√ Since
√ 3, √ 5, 7,√and
11
√ √ are the prime
√ √ labels for the cities, and since 3 + 5 + 14 + 22 = 1 · 3+√1 · 5 +
2 · 7 + 2 · 11 , this will put 1 in the cells of antwerp and brussels, and 2 in the
cells of paris and marseille. √ √
Next, τ12 puts 10 · 100 · (2 · 1 + 2 · 2) in every cell of the cube and τ13 puts 2 · 1 + 2 · 2
in every
√ cell of the cube. Finally, τ14 projects on the prime labels of countries, which are 1
and 2. This puts a 2 in every cell of a Belgian city and a 2 in every cell in a French city.
This is the result of the query, as the flag indicates, that is returned in every cell. Now
every cell of a city in belgium has the count of 2 cities, as has every city in f rance. ⊔

14
3.3.6 Counting and Min-Max Revisited
We can now extend the transformations of Definition 14, in a way that the counting,
minimum, and maximum, are taken over cells which share a common prime product label.

Definition 19. The following creations of a new measure τl+1 are generalizations of the
counting and min-max transformations:

1. (Count-Distinct) If the result of some previous transformation τm is a prime (prod-


uct) labeling of the cells of M (D), then τl+1 (x1 , x2 ..., xd ) = #6= |τm (α), with
α ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl } counts the number of different values of the measure
α in cells of M (D) that have the same prime product label as τm (x1 , x2 ..., xd ).

2. (Min-Max) If the result of some previous transformation τm is a prime (prod-


uct) labeling of the cells of M (D), then τl+1 (x1 , x2 ..., xd ) = min |τm (α), with
α ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl }, gives the the smallest value of the measure α in
cells of the matrix M (D) that have the same prime product label as τm (x1 , x2 ..., xd ).
And τl+1 (x1 , x2 ..., xd ) = max |τm (α) is defined similarly. ⊔

We remark that when there is only one prime label throughout M (D), the above gen-
eralization of the counting and min-max transformations correspond to Definition 14.

4 The Classical OLAP Operations


In this section, we prove that the classical OLAP operations can be expressed using the
OLAP transformations from Section 3. These classic operations can be combined to express
complex analytical queries. The classical OLAP operations are Dice, Slice, Slice-and-Dice,
Roll-Up and Drill-Down (see Section 4.5). We assume in the sequel, that the input data
cube Din has k given measures µ1 , µ2 , ..., µk , and that at some point in the OLAP process
this cube is transformed to a cube D, having measures µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τl ; ϕ, where
τ1 , τ2 , ..., τl , with l ≥ 0, are created measures and ϕ is an input/output flag.

4.1 Boolean Cell-selection Condition


Before we start, we need to define the notion of a Boolean cell-selection condition, and give
a lemma about its expressiveness we will use throughout Section 4.

Definition 20 (Boolean condition on cells). Let M (D) = dom(D1 ) × dom(D2 ) × · · · ×


dom(Dd ) be the matrix of D. A Boolean condition on the cells of M (D) is a function φ
from M (D) to {0, 1}. We say that the cells of M (D) in the set φ−1 ({1}) are selected by φ.
We say that a Boolean condition φ is transformation-expressible if there is a sequence
of OLAP transformations τ1 , τ2 , ..., τk such that φ(x1 , x2 , ..., xd ) = τk (x1 , x2 , ..., xd ) for all
(x1 , x2 , ..., xd ) ∈ M (D). ⊔

Lemma 1. If φ, φ1 , φ2 are transformation-expressible Boolean conditions on cells, then


NOT φ, φ1 AND φ2 , and φ1 OR φ2 are transformation-expressible Boolean conditions on
cells. ⊔

15
4.2 Dice
Intuitively, the Dice operation selects the cells in a cube D that satisfy a Boolean condition
φ on the cells. The syntax for this operation is DICE(D, φ), where φ is a Boolean condition
over level values and measures. The resulting cube has the same dimensionality as the
original cube. This operation is analogous to a selection in the relational algebra. In a data
cube, it selects the cells that satisfy the condition φ by flagging them with a 1 in the output
cube. Our approach covers all typical cases in real-world OLAP [7]. We next formalize the
operator’s definition in terms of our transformation language. In the remainder, we use
the term OLAP operation to express a sequence of OLAP transformations.
Definition 21 (Dice). Given a data cube D, the operation DICE(D, φ), selects all cells
of the matrix M (D) that satisfy the Boolean condition φ by giving them a 1 flag in the
output. The condition φ is a Boolean combination of conditions of the form: (a) A selector
on a value b at a certain level ℓ of some dimension Di ; (b) A comparison condition at some
level ℓ from a dimension schema σ(Di ) of a dimension Di of the cube of the form ℓ < c or
c < ℓ, where c is a constant (at that level ℓ); (c) An equality or comparison condition on
some measure α of the form α = c, α < c or c < α, where c is a (rational) constant. ⊔

Property 3. Let D be a data cube en let φ be a Boolean condition on the cells of M (D)
(as in Definition 21). The operation DICE(D, φ) is expressible as an OLAP operation. ⊔

4.3 Slice
Intuitively, the Slice operation takes as input a d-dimensional, k-ary data cube D and a
dimension Di and returns as output SLICE(D, Di ), which is a “(d − 1)-dimensional” data
cube in which the original measures µ1 , ..., µk are replaced by their aggregation (sum) over
different values of elements in dom(Di ). In other words, dimension Di is removed from
the data cube, and will not be visible in the next operations. That means, for instance,
that we will not be able to dice on the levels of the removed dimension. As we will see,
the “removal” of dimensions is, in our approach, implemented by means of the destroyer
measure δ. We remark that the aggregation above is due to the fact that, in order to
eliminate a dimension Di , this dimension should have exactly one element [1], therefore a
roll-up (which we explain later in Section 4.5) to the level All in Di is performed.
Definition 22 (Slice). Given a data cube D, and one of its dimensions Di , the operation
SLICE(D, Di ) “replaces” the measures µ1 , µ2 , ..., P µk by their aggregation (sum) µn Σi (for 1 ≤
Σ
n ≤ k) as: µn (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) = xi ∈dom(Di ) µn (x1 , ..., xi−1 , xi , xi+1 , ..., xd ),
i

for all (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) ∈ M (D). Further, the operation SLICE(D, Di ) destroys
all cells except those of the representative of all for dimension Di . We abbreviate the above
1-dimensional sum as SUMDi (µn ). ⊔

Property 4. Let D be a data cube and let Di be one of its dimensions. The operation
SLICE(D, Di ) is expressible as an OLAP operation. ⊔

Example 10. Consider dimensions P roduct, Location, and T ime, and measure µ1 =
sales, in our running example. The operation SLICE(D, Location) returns a cube with
(product, time)-cells containing the sums of µ1 for each product-time combination, over
all location. All cells not belonging to the representative of all in the dimension Location
(i.e., antwerp), are destroyed. The query is expressed by the following transformations.

16
• τl+1 = γP roduct.Bottom (prime labels on products);
• τl+2 = γT ime.Bottom (fresh prime labels on time moments);
• τl+3 = τl+1 · τl+2 (product of the two previous prime labels);
• τl+4 = µ1 · τl+3 (product);
• τl+5 = SUM3 (τl+4 ) (3-dimensional sum);
• τl+6 = τl+5 |τl+3 (projection on prime product labels);
• τl+7 = σLocation.All (selects the representative of all in the dimension Location);
• δ = τl+7 (destroys all cells except the representative of all in dimension Location);
• ϕ(1) = σLocation.All (this flag creation selects the relevant cells of the matrix).
Transformation τl+4 gives each (product, time)-combination a unique prime product
label. This label is multiplied by the sales in each cell. Then, τl+5 is the global sum over
M (D); τl+6 = τl+5 |τl+3 is the projection over the prime product labels for (product, time)-
combinations. This gives each cell above some fixed (product, time)-combination, the sum
of the sales, over all locations, for that combination. All cells of M (D) that do not belong
to antwerp (selected in τl+7 ), which represents all, are destroyed by δ. ⊔

4.4 Slice and Dice


A particular case of the Slice operation occurs when the dimension to be removed already
contains a unique value at the bottom level. Then, we can avoid the roll-up to All, and
define a new operation, called Slice-and-Dice. Although this can be seen as a Dice operation
followed by a Slice one, in practice, both operations are usually applied together.
Definition 23. Given a data cube D, one of its dimensions Di and some value a in the
domain dom(Di ), the operation SLICE-DICE(D, Di , a) contains all the cells in the matrix
M (D) such that the value of the dimension Di equals a. All other cells are destroyed. ⊔

Property 5. Let D be a data cube, Di on of its dimensions en let a ∈ dom(Di ). The
operation SLICE-DICE(D, Di , a) is expressible as an OLAP operation. ⊔

Example 11. In our running example, the operation SLICE-DICE(D, Location, antwerp)
is implemented by the output flag σLocation.City=antwerp . ⊔

4.5 Roll-Up and Drill-Down


Intuitively, Roll-Up aggregates measure values along a dimension up to a certain level,
whereas Drill-Down disagregates measure values down to a dimension level. Although at
first sight it may appear that Drill-Down is the inverse of Roll-Up [1], this is not always
the case, e.g., if a Roll-Up is followed by a Slice or a Dice; here, we cannot just undo the
Roll-Up, but we need to account for the cells that have been eliminated on the way.
More precisely, the Roll-Up operation takes as input a data cube D, a dimension Di
and a subpath h of a hierarchy H over Di , starting in a node ℓ′ and ending in a node ℓ,

17
and returns the aggregation of the original cube along Di up to level ℓ for some of the
input measures α1 , α2 , ..., αr . Roll-Up uses one of the classic SQL aggregation functions,
applied to the indicated protected and computed measures α1 , α2 , ..., αr (selected from
µ1 , µ2 , ..., µk ; τ1 , ..., τl ; ϕ), namely sum (SUM), average (AVG), minimum /maximum (MIN
and MAX), count and count-distinct (COUNT and COUNT-DISTINCT). Usually, measures
have an associated default aggregation function. The typical aggregation function for the
measure sales, e.g., is SUM. We denote the above operation as ROLL-UP(D, Di , H(ℓ′ →
ℓ), {(αi , fi ) | i = 1, 2, ..., r}), where fi is one of the above aggregation functions that is
associated to αi , for i = 1, 2, ..., r. Since we are mainly interested in the expressiveness of
this operation as a sequence of atomic transformations, only the destination node ℓ in the
path h is relevant. Indeed, the result of this roll-up remains the same if the subpath h is
extended to start from the Bottom node of dimension Di . So, we can simplify the notation,
replacing H(ℓ′ → ℓ) with H(ℓ), and assume that the roll-up starts at the Bottom level.
The Drill-down operation takes as input a data cube D, a dimension Di and a subpath
h of a hierarchy H over Di , starting in a node ℓ and ending in a node ℓ′ (at a lower level in
the hierarchy), and returns the aggregation of the original cube along Di from the bottom
level up to level ℓ′ . The drill-down uses the same type of aggregation functions as the
roll-up. Again, since we are only interested in the expressiveness of this operation, the
drill-down operation DRILL-DOWN(D, Di , H(ℓ′ ← ℓ), {(αi , fi ) | i = 1, 2, ..., r}), has the
same output as ROLL-UP(D, Di , H(ℓ′ ), {(αi , fi ) | i = 1, 2, ..., r}). Therefore, we can limit
the further discussion in this section to the roll-up.
Definition 24 (Roll-Up). Given a data cube D, one of its dimensions Di , and a hierarchy
H over Di , ending in a node ℓ, the operation ROLL-UP(D, Di , H(ℓ), {(αi , fi ) | i = 1, ..., r})
computes the aggregation of the measures αi by their aggregation functions fi , for i =
1, 2, ..., r, as follows:

αi fi (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) =


fi ({αi ((x1 , ..., xi−1 , yi , xi+1 , ..., xd ) | yi ∈ dom(Di ) and ρH (yi , b)}),

for all (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) ∈ M (D), for which ρH (yi , b), for some b ∈ dom(Di .ℓ).
This roll-up flags all representative Bottom-level objects as active. ⊔

Property 6. Let D be a data cube, let Di be one of its dimensions, and let H be a hierarchy
over Di ending in a node ℓ. Let {(αi , fi ) | i = 1, 2, ..., r} be a set of selected measures (taken
from the protected measures µ1 , µ2 , ..., µk and the computed measures τ1 , ..., τk of D), with
their associated aggregation functions. The operation ROLL-UP(D, Di , H(ℓ), {(αi , fi ) | i =
1, 2, ..., r}) is expressible as an OLAP operation. ⊔

Example 12. We next express the Roll-Up operation, using prime (product) labels, sums,
projections, and the 3-dimensional sum. We look at the query “total sales per country”.
We use the simplified syntax, only indicating the target level of the roll-up on the Location
dimension (i.e., Country). The query ROLL-UP(D, Location, Country, {(sales, SUM)}) is
the result of the following transformations, given the measure µ1 = sales:

1. τℓ+1 = γP roduct.Bottom (prime labels on products);


2. τℓ+2 = γT ime.Bottom (prime labels on time moments);

18
3. τℓ+3 = γLocation.Country (prime labels on countries);
4. τℓ+4 = τℓ+1 · τℓ+2 · τℓ+3 ; (prime product label – in one step);
5. τℓ+5 = µ1 · τℓ+4 (product of labels with sales);
6. τℓ+6 = SUM3 (τℓ+5 ) (3-dimensional sum);
7. τℓ+7 = τℓ+5 |τℓ+4 (projection on prime product labels);

8. ϕ(1) = σLocation.Country (output flag on country-representatives).

Transformation τℓ+4 gives every product-date-country combination a unique prime product


label. Normally this product takes more steps. Above, we have abbreviated it to one
transformation. The transformation τℓ+7 gives the aggregation result, and ϕ(1) is the flag
that says that only the cities antwerp and paris, which represent the level Country, are
active in the output (and nothing else of the original cube). ⊔

4.6 The Composition of Classical OLAP Operations


The main result of this paper is the proof of the completeness of an OLAP algebra, com-
posed of the OLAP operations Dice (Section 4.2, Slice (Section 4.3), Slice-and-Dice (Sec-
tion 4.4), Roll-Up, and Drill-Down (Section 4.5). This is summarized by Theorem 1.
Theorem 1. The classical OLAP operations and their composition are expressible by
OLAP operations (that is, as sequences of atomic OLAP transformations). ⊔

We next illustrate the power and generality of our approach, combining a sequence of
OLAP operations, and expressing them as a sequence of OLAP transformations.
Example 13. An OLAP user is analyzing sales in different countries and regions. She
wants to compare sales in the north of Belgium (the Flanders region), and in the south
of France (which we, generically, have denoted south in our running example). She first
filters the cube, keeping just the cells of those two regions. This is done with the expres-
sion: DICE(D, Location.Region = f landers OR Location.Region = south). We showed
that this can be implemented as a sequence of atomic OLAP transformations. Now she
has a cube with the cells that have not been destroyed. Next, within the same navi-
gation process, she obtains the total sales in France and Belgium, only considering the
desired regions, by means of: ROLL-UP(D, Location, Country, {(sales, SUM)}). This will
only consider the valid cells for rolling up. After this, our user only wants to keep
the sales in France. Thus, she writes: DICE(D, Location.Country = f rance). Finally,
she wants to go back to the details, one level below in the hierarchy, so she writes:
DRILL-DOWN(D, Location, Region, {(sales, SUM)}), implemented as a roll-up from the
bottom level to Region, only considering the cells that have not been destroyed. ⊔

5 Conclusion and Discussion


We have presented a formal, mathematical approach, to solve a practical problem, which
is, to provide a formal semantics to a collection of the OLAP operations most frequently

19
used in real-world practice. Although OLAP is a very popular field in data analytics,
this is the first time a formalization like this is given. The need for this formalization is
clear: in a world being flooded by data of different kinds, users must be provided with
tools allowing them to have an abstract “cube view” and cube manipulation capabilities,
regardless of the underlying data types. Without a solid basis and unambiguous definition
of cube operations, the former could not be achieved. We claim that our work is the first
one of this kind, and will serve as a basis to build more robust practical tools to address
the forthcoming challenges in this field.
We have addressed the four core OLAP operations: slice, dice, roll-up, and drill-down.
This does not harm the value of the work. On the contrary, this approach allows us to
focus on our main interest, that is, to study the formal basis of the problem. Our line of
work can be extended to address other kinds of OLAP queries, like queries involving more
complex aggregate functions like moving averages, rankings, and the like. Further, cube
combination operations, like drill-across, must be included in the picture. We believe that
our contribution provides a solid basis upon which, a complete OLAP theory can be built.

Acknowledgements: Alejandro Vaisman was supported by a travel grant from Hasselt


University (Korte verblijven–inkomende mobiliteit, BOF15KV13). He was also partially
supported by PICT-2014 Project 0787.

References
[1] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In
Proceedings of the 15th International Conference on Data Engineering, (ICDE), pages
232–243, Birmingham, UK, 1997. IEEE Computer Society.

[2] C. Ciferri, R. Ciferri, L. Gómez, M. Schneider, A. Vaisman, and E. Zimányi. Cube


algebra: A generic user-centric model and query language for OLAP cubes. Interna-
tional Journal of Data Warehousing and Mining, 9(2):39–65, 2013.

[3] F. Dehne, Q. Kong, A. Rau-Chaplin, H. Zaboli, and R. Zhou. Scalable real-time OLAP
on cloud architectures. Journal of Parallel and Distributed Computing, 7980:31 – 41,
2015. Special Issue on Scalable Systems for Big Data Management and Analytics.

[4] J.-P. Escofier. Galois Theory, volume 204 of Graduate Texts in Mathematics. Springer-
Verlag, 2001.

[5] R. Kimball. The Data Warehouse Toolkit: Practical Techniques for Building Dimen-
sional Data Warehouse. Wiley, 1996.

[6] O. Romero and A. Abelló. On the need of a reference algebra for OLAP. In Proceedings
of the 9th International Conference on Data Warehousing and Knowledge Discovery,
DaWaK’07, pages 99–110, Regensburg, Germany, 2007.

[7] A. Vaisman and E. Zimányi. Data Warehouse Systems: Design and Implementation.
Springer, 2014.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy