AlgebraOLAP
AlgebraOLAP
Abstract
Online Analytical Processing (OLAP) comprises tools and algorithms that allow
querying multidimensional databases. It is based on the multidimensional model,
where data can be seen as a cube, where each cell contains one or more measures
can be aggregated along dimensions. Despite the extensive corpus of work in the field,
a standard language for OLAP is still needed, since there is no well-defined, accepted
semantics, for many of the usual OLAP operations. In this paper, we address this
problem, and present a set of operations for manipulating a data cube. We clearly de-
fine the semantics of these operations, and prove that they can be composed, yielding
a language powerful enough to express complex OLAP queries. We express these op-
erations as a sequence of atomic transformations over a fixed multidimensional matrix,
whose cells contain a sequence of measures. Each atomic transformation produces a
new measure. When a sequence of transformations defines an OLAP operation, a flag
is produced indicating which cells must be considered as input for the next operation.
In this way, an elegant algebra is defined. Our main contribution, with respect to other
similar efforts in the field is that, for the first time, a formal proof of the correctness
of the operations is given, thus providing a clear semantics for them. We believe the
present work will serve as a basis to build more solid practical tools for data analysis.
1 Introduction
Online Analytical Processing(OLAP) [5] comprises a set of tools and algorithms that allow
efficiently querying multidimensional (MD) databases containing large amounts of data,
usually called Data Warehouses (DW). Conceptually, in the MD model, data can be seen
as a cube, where each cell contains one or more measures of interest, that quantify facts.
Measure values can be aggregated along dimensions, which give context to facts. At the
logical level, OLAP data are typically organized as a set of dimension and fact tables.
Current database technology allows alphanumerical warehouse data to be integrated for
example, with geographical or social network data, for decision making. In the era of so-
called “Big Data”, the kinds of data that could be handled by data management tools,
are likely to increase in the near future. Moreover, OLAP and Business Intelligence (BI)
tools allow to capture, integrate, manage, and query, different kinds of information. For
1
Extended abstract. Full version to appear in Intelligent Data Analysis, 21(5), 2017.
2
Databases and Theoretical Computer Science Research Group, Hasselt University and Transnational
University of Limburg; email: bart.kuijpers@uhasselt.be
3
Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina; email: avaisman@itba.edu.ar.
1
example, alphanumerical data coming from a local DW, spatial data (e.g., temperature)
represented as rasterized images, and/or economical data published on the semantic web.
Ideally, a BI user would just like to deal with what she knows well, namely the data cube,
using only the classical OLAP operators, like Roll-up, Drill-down, Slice, and Dice (among
other ones), regardless the cube’s underlying data type. Data types should only be handled
at the logical and physical levels, not at the conceptual level. Building on this idea, Ciferri
et al. [2] proposed a conceptual, user-oriented model, independent of OLAP technologies.
In this model, the user only manipulates a data cube. Associated with the model, there is a
query language providing high-level operations over the cube. This language, called Cube
Algebra, was sketched informally in the mentioned work. Extensive examples on the use
of Cube Algebra presented in [7], suggest that this idea can lead to a language much more
intuitive and simple than MDX, the de facto standard for OLAP. Nevertheless, these works
do not give any evidence of the correctness of the languages and operations proposed, other
than examples at various degrees of comprehensiveness. In fact, surprisingly, and in spite
of the large corpus of work in the field, a formally-defined reference language for OLAP
is still needed [6]. There is not even a well-defined, accepted semantics, for many of the
usual OLAP operations. We believe that, far for being just a problem of classical OLAP,
this formalization is also needed in current “Big Data” scenarios, where there is a need to
efficiently perform real-time OLAP operations [3], that, of course, must be well defined.
2
2 The OLAP Data Model
In this section we describe the OLAP data model we use in the sequel.
3
Figure 1: Dimension schemas for the dimensions Location, in (a), and T ime , in (b).
As a convention, level names start with a capital letter. Note that the Bottom node is
often renamed, depending on the application.
Example 3. Fig. 1 gives examples of dimension schemas σ(Location) and σ(T ime) for
the dimensions Location and T ime in Example 1. For the dimension Location, we have
Bottom = City, and there is only one hierarchy, denoted City → Region → Country →
All. The node Region is an example of a level in this hierarchy. For the dimension T ime,
we have Bottom = Day, and two hierarchies, namely Day → M onth → Semester →
Y ear → All and Day → W eek → All. ⊔
⊓
4
Figure 2: An example of a dimension graph (or instance) I(σ(Location)).
5
In the remainder of this paper we assume that Γ = Q, the set of the rational numbers.
For most applications, this suffices. Also, as a notational convention, we use calligraphic
characters, like D, to represent data cube instances.
The flag ϕ can be considered as a (k + 1)-st Boolean measure. The role of ϕ is to
indicate which of the matrix cells are currently “active”. The active cells have a flag value
1 and the others have a flag value 0. When we operate over a data cube, flags are used to
indicate the input or output parts of the matrix of the cube. Typically, in the beginning
of the operations, all cells have a flag value of 1. The role of flags will become more clear
in the next sections, when we discuss OLAP transformations and operations.
6
be clearly defined, and which can be applied in different order. For example, we can first
apply a Roll-Up (i.e., an aggregation) to the Country level, and once at that level apply a
Dice operation, which keeps the cube cells corresponding to Belgium or France. Finally, a
Drill-Down can be applied to disaggregate the sales down to the level Region, returning the
desired result. In what follows, we characterize OLAP operations as the result of sequences
of “atomic” OLAP transformations, which are measure-creating updates to a data cube.
7
cedes the creation of an output flag. A destructor δ takes the value 1 for some cells of
the matrix of a data cube, and 0 on other cells. When δ is invoked (and activated by the
output flag that follows it) on a data cube D with measures µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τm and
(m)
flag ϕO , it empties all cells for which the value of the destructor δ is 0 by removing all
measures from them, even the protected ones, thereby effectively “destroying” these cells.
This is the only case where the protected measures are altered (see operations Slice or Dice,
(m)
later). The output of a destructive operation O looks like µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τl ; δ; ϕO ,
in which the destructor precedes the output flag. The effect of the presence of a de-
structor is the following. A cell such that δ = 0 is emptied, after which it contains no
more measures and flag. For cells with δ = 1, the sequence of measures µ1 , µ2 , ..., µk ;
(m) (m)
τ1 , τ2 , ..., τl ; δ; ϕO ; is transformed to µ1 , µ2 , ..., µk ; τl−m+1 , τl−m+2 , ..., τl ; ϕO ; which is re-
named as µ1 , µ2 , ..., µk ; τ1 , τ2 , ..., τm ; ϕ; before the next transformation takes place. This
transformation will act, cell per cell, on the matrix of a cube, and it does nothing with
emptied cells. That is, no new measure can ever be added to a destroyed cell.
The following definition specifies how an OLAP transformation acts on a data cube.
We then address in detail each atomic OLAP transformation appearing in this definition.
8
of a sequence of n consecutive OLAP transformations that create the additional measures
(m)
τl+1 , ..., τl+n , followed by the creation of an m-ary flag ϕO (possibly preceded by a destruc-
(m)
tor δ). As the result of the creation of ϕO , the measures in the cells of the data cube are
(m)
changed from µ1 , µ2 , ..., µk ; τ1 , ..., τl ; ϕ; τl+1 , ..., τl+n to µ1 , µ2 , ..., µk ; τl+n−m+1 , ..., τl+n ; ϕO ,
which become µ1 , µ2 , ..., µk ; τ1 , ..., τm ; ϕ, after renaming. The output cube D = O(D) has ′
the same dimensions, hierarchy schemas and instances as D, and measures µ1 , µ2 , ..., µk ;
(m)
τ1 , ..., τm ; ϕ. In the case where ϕO is preceded by a destructor δ, the same procedure is
followed, except for the cells of M (D) for which δ takes the value 0. These cells of M (D)
are emptied, contain no measures, and become inaccessible for future transformations. ⊔ ⊓
3. (Equality test on levels) For a level ℓ in the dimension schema σ(Di ) of dimension
Di , and a constant object c ∈ dom(Di .ℓ), τl+1 (x1 , x2 , ..., xd ) = (ℓ = c) is an “equality”
9
test. Here, the result of (ℓ = c) is a Boolean 1 or 0 (cell per cell in the non-empty
cells of the matrix) such that τl+1 (x1 , x2 , ..., xd ) is 1 if and only if xi rolls-up to c at
level ℓ, that is ρ(xi , c).
4. (Comparison test on levels) For a level ℓ in the dimension schema σ(Di ) of di-
mension Di , and a constant c ∈ dom(Di .ℓ), τl+1 (x1 , x2 , ..., xd ) = (ℓ <ℓ c) is a “com-
parison” test. The result of (ℓ <ℓ c) is a Boolean 1 or 0 (cell per cell in the non-empty
cells of the matrix), such that τl+1 (x1 , x2 , ..., xd ) is 1 if and only if xi rolls-up to an
object b at level ℓ for which b <ℓ c. The order <ℓ can be any order that is defined on
level ℓ. Transformation τl+1 (x1 , x2 , ..., xd ) = (c <ℓ ℓ) is defined similarly. ⊔
⊓
• τ3 = µ1 · τ2 (product);
• δ = τ2 (destructor); and
The measure τ3 contains the sales values larger than or equal to 50 (and a 0 if the
sales are lower). The destructor δ destroys the cells that contain a O. Finally, the flag
ϕ(1) selects all cells from the input as output cells (it will contain a 1 for all such cells that
satisfy the condition), and concludes the DICE(D, sales > 50) operation. The output of
this operation is sales; τ3 ; ϕ(1) , which is then renamed to sales; τ1 ; ϕ. ⊔
⊓
3.3.3 Selectors
Definition 13 (Selector Transformations). The following creations of a new measure τl+1
are selector transformations (or selectors), and their definition is cell per cell of M (D):
⊔
⊓
10
The constant selector in Definition 13, corresponds to the equality test on levels (see
3. in Definition 12). Here, this transformation appears with a different functionality and
we reserve a special notation for it, and we repeated it. Also, note that the level selector
selects all representatives (at the Bottom level) of objects at level ℓ of dimension Di .
• τ3 = τ1 + τ2 (sum);
• τ4 = τ3 · µ1 (product);
1. (Count-Distinct) τl+1 = #6= (α), α ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl } counts the number
of distinct values of measure α in the complete matrix M (D) of the data cube.
P
2. (d-dimensional sum) τl+1 = (x1 ,x2 ,...,xd)∈M (D) α(x1 , x2 ..., xd ), with α ∈ {µ1 , µ2 , ...,
µk , τ1 , τ2 , ..., τl }, gives the sum of the measure α over all non-empty matrix cells. We
abbreviate this operation by writing τl+1 = SUMd (α), and call this transformation
the d-dimensional sum.
3. (Min-Max) τl+1 = min(α), with α ∈ {µ1 , µ2 , ..., µk , τ1 , τ2 , ..., τl }, gives the smallest
value of the measure α in non-empty cells of the matrix M (D). Similarly, τl+1 =
max(α), gives the largest value of the measure α in the matrix M (D). ⊔
⊓
It is important to remark that the above transformations create the same new measure
value for all cells of the matrix M (D).
Example 8. Now, we look at the query “total sales in antwerp”. The query can be com-
puted as follows, given µ1 = sales:
• τ2 = τ1 · µ1 (product that selects the sales in antwerp, puts a 0 in all other ones);
11
• τ4 = τ3 · τ1 (this is the total sales in antwerp in the cells of antwerp);
• ϕ(1) = τ1 (this flag creation selects the cells of antwerp).
The output measures are sales; τ4 ; ϕ(1) , which are renamed sales; τ1 ; ϕ. Thus, the value of
the total of sales in antwerp is now available in every cell corresponding to antwerp. For
the cells outside antwerp there is a 0. We remark that this example can be modified with
a destructor that effectively empties cells outside antwerp. ⊔
⊓
3.3.5 Grouping
The most common OLAP operations (e.g., roll-up, slice), require grouping data before ag-
gregating them. For example, typically we will ask queries like “total sales by city”, which
requires grouping facts by city, and, for each group, sum all of its sales. Therefore, we
need a transformation to express “grouping”. To deal with grouping, we use the concept of
“prime labels” for sets and products of sets. We will use these labels to identify elements
in dimensions and in dimension levels. Before giving the definition of the grouping trans-
formations, we elaborate on prime labels and product of prime labels. As we show, these
prime labels work in the context of measures that take rational values (as it is often the
case, in practice). The following definition specifies our infinite supply of prime labels.
Definition 15 (Prime Labels). Let pn denote√ the
√ n-th
√ √ prime
√ number, for n ≥ 1. We define
√
the sequence of prime labels as follows: 1, 2, 3, 5, 7, 11, ..., pn , .... We denote the
√
set of all prime labels by P. ⊔
⊓
Definition 16 (Prime Labeling of Sets). Let A, A1 , A√ 2 , ..., An be (finite) sets. A prime
labeling of the set A is an injective function w : A → P. For a ∈ A, we call w(a) the
prime label of a (for the prime labeling w).
Let I be a subset of {1, 2, ..., n}, which serves as an index set. A prime product I-
labeling of the Cartesian product A1 × A2 × · · · × An consists of prime labelings wi of the
sets Ai , for i ∈ I, that satisfy the condition that wi (Ai ) ∩Qwj (Aj ) is empty for i, j ∈ I
and i 6= j. For (a1 , a2 , ..., an ) ∈ A1 × A2 × · · · × An , we call i∈I wi (ai ) the prime product
I-label of (a1 , a2 , ..., an ) (given the prime labelings wi , for i ∈ I). When I is a strict subset
of {1, 2, ..., n}, we speak about a partial prime product labeling and when I = {1, 2, ..., n},
we speak about a full prime product labeling. ⊔
⊓
If we view a Cartesian product A1 × A2 × · · · × An as a finite matrix, whose cells
contain rational-valued measures, we can use prime (product) labelings as follows in the
aggregation process. Let us assume that the cells of A1 × A2 × · · · × An contain rational
values of a measure µ and let us denote the value of this measure in the cell (a1 , a2 , ..., an )
by µ(a1 , a2 , ..., an ). If we have a full prime product labeling on A1 × A2 × · · · × An , then
we can consider the sum over this Cartesian product of the product of the prime product
labels with the value of µ:
X
µ(a1 , a2 , ..., an ) · w1 (a1 ) · w2 (a2 ) · · · wn (an ). (†1 )
(a1 ,a2 ,...,an )∈A1 ×A2 ×···×An
Since each cell of A1 × A2 × · · · × An has a unique prime product label, and since
these labels are rationally independent (see Property 2), this sum enables us to retrieve
the values µ(a1 , a2 , ..., an ).
12
If we have a partial prime product labeling on A1 × A2 × · · · × An , determined by
an index set I, then, again, we can consider the sum over this Cartesian product of the
product of the partial prime product labels with the value of µ:
X Y
µ(a1 , a2 , ..., an ) · wi (ai ). (†2 )
(a1 ,a2 ,...,an)∈A1 ×A2 ×···×An i∈I
We remark that we use these prime (product) labels in a purely symbolic way without
actually calculating the square root values in them. We are now ready to define atomic
OLAP operations that allow us to implement grouping. In what follows, we apply these
prime labels to the case where the sets Ai in A1 × A2 × · · · × An are domains of dimensions
(e.g., at the bottom level), or domains of dimensions at some level.
Definition 18 (Grouping Transformations). The following creations of a new measure
τl+1 are grouping transformations:
1. (Prime labels for groups in one dimension) Let Di be a dimension and ℓ a level
in the dimension schema σ(Di ) of a dimension Di . Let dom(Di .ℓ) = {b1 , b2 , ..., bm }
with induced order b1 < b2 < · · · < bm (see Property 1). If the prime labels
w1 , w2 , ..., wk have been used by previous transformations, then for all j, with j 6= i,
and all xj ∈ dom(Dj ), we have τl+1 (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) = wk+l if ρ(xi , bl ).
We denote this transformation by γDi .ℓ (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) or γDi .ℓ , for short,
and call the result of such a transformation a prime labeling.
2. (Projection of a prime sum) If the result of some previous transformation τm is
Pk+l
a (full or partial) prime sum i=k ai · wi (over the complete matrix M (D)) in which
prime (product) labels wk , wk+1 , ..., wk+l (computed in a previous transformation τn )
are used, then τl+1 is a new measure that “projects” on the appropriate component
from the prime sum, that is, τl+1 (x1 , x2 ..., xd ) = ak+l if the prime (product) label
τn (x1 , x2 ..., xd ) = wk+l . We denote this projection transformation by τm |τn . ⊔
⊓
13
Example 9. Consider the query “for each country, give the total number of cities”. This
query can be implemented as follows (explained below, using the data in Example 4):
• τ1 = γLocation.Country (this gives each country a prime label);
• τ2 = γLocation.City (this gives each city a (fresh) prime label);
• τ3 = τ1 · τ2 (this gives each city a product of prime labels);
• τ4 = SUM3 (τ3 );
• τ5 = γP roduct.Bottom (gives each product a different prime label);
• τ6 = #6= (τ5 ) (counts the number of products);
• τ7 = γT ime.Bottom (gives each time moment a different prime label);
• τ8 = #6= (τ7 ) (counts the number of moments in time);
• τ9 = τ6 · τ8 (is the number of products times the number of time moments);
• τ10 = τ4 /τ9 (normalization of the sum);
• τ11 = τ10 |τ2 ; (projection over the prime labels of city);
• τ12 = SUM3 (τ11 ) (3-dimensional sum);
• τ13 = τ12 /τ9 (normalization of the sum);
• τ14 = τ13 |τ1 (projection over the prime labels of country);
• ϕ(1) = σLocation.Bottom (this flag creation selects all cells of the matrix).
Transformation τ1 gives each country a next available prime √ label. Since no labels have
been used yet, belgium gets label 1 and f rance gets √ label 2. Transformation τ2 gives
each
√ city a next available √ prime label. Since 1
√ and 2 have been used, antwerp
√ gets label
3, brussels gets label 5, paris gets label √ 7, and marseille
√ gets label 11. √ √
Transformation √ τ3 gives
√ √antwerp the value 3 (i.e., 1. √3, brussels √ √ the value 5(1. 5),
paris the value 14 ( 2. 7), and marseille the value 22 ( 2. √ 11).√ If there
√ are
√ 10
products and 100 time moments, then τ4 puts the value 10 · 100 · ( 3 + 5 + 14 + 22)
in each cell of the matrix M (D).
Transformations τ6 and τ8 count the number of products and the number of time
moments (using fresh prime labels), and the product √ √of these
√ quantities
√ is computed in τ9 .
In τ10 , τ3 is divided by this product, putting 3 + 5 + 14 + 22 in every √ √ cell. √
√ Transformation τ 11 is a projection on the prime
√ labels
√ of City.
√ Since
√ 3, √ 5, 7,√and
11
√ √ are the prime
√ √ labels for the cities, and since 3 + 5 + 14 + 22 = 1 · 3+√1 · 5 +
2 · 7 + 2 · 11 , this will put 1 in the cells of antwerp and brussels, and 2 in the
cells of paris and marseille. √ √
Next, τ12 puts 10 · 100 · (2 · 1 + 2 · 2) in every cell of the cube and τ13 puts 2 · 1 + 2 · 2
in every
√ cell of the cube. Finally, τ14 projects on the prime labels of countries, which are 1
and 2. This puts a 2 in every cell of a Belgian city and a 2 in every cell in a French city.
This is the result of the query, as the flag indicates, that is returned in every cell. Now
every cell of a city in belgium has the count of 2 cities, as has every city in f rance. ⊔
⊓
14
3.3.6 Counting and Min-Max Revisited
We can now extend the transformations of Definition 14, in a way that the counting,
minimum, and maximum, are taken over cells which share a common prime product label.
Definition 19. The following creations of a new measure τl+1 are generalizations of the
counting and min-max transformations:
We remark that when there is only one prime label throughout M (D), the above gen-
eralization of the counting and min-max transformations correspond to Definition 14.
15
4.2 Dice
Intuitively, the Dice operation selects the cells in a cube D that satisfy a Boolean condition
φ on the cells. The syntax for this operation is DICE(D, φ), where φ is a Boolean condition
over level values and measures. The resulting cube has the same dimensionality as the
original cube. This operation is analogous to a selection in the relational algebra. In a data
cube, it selects the cells that satisfy the condition φ by flagging them with a 1 in the output
cube. Our approach covers all typical cases in real-world OLAP [7]. We next formalize the
operator’s definition in terms of our transformation language. In the remainder, we use
the term OLAP operation to express a sequence of OLAP transformations.
Definition 21 (Dice). Given a data cube D, the operation DICE(D, φ), selects all cells
of the matrix M (D) that satisfy the Boolean condition φ by giving them a 1 flag in the
output. The condition φ is a Boolean combination of conditions of the form: (a) A selector
on a value b at a certain level ℓ of some dimension Di ; (b) A comparison condition at some
level ℓ from a dimension schema σ(Di ) of a dimension Di of the cube of the form ℓ < c or
c < ℓ, where c is a constant (at that level ℓ); (c) An equality or comparison condition on
some measure α of the form α = c, α < c or c < α, where c is a (rational) constant. ⊔
⊓
Property 3. Let D be a data cube en let φ be a Boolean condition on the cells of M (D)
(as in Definition 21). The operation DICE(D, φ) is expressible as an OLAP operation. ⊔
⊓
4.3 Slice
Intuitively, the Slice operation takes as input a d-dimensional, k-ary data cube D and a
dimension Di and returns as output SLICE(D, Di ), which is a “(d − 1)-dimensional” data
cube in which the original measures µ1 , ..., µk are replaced by their aggregation (sum) over
different values of elements in dom(Di ). In other words, dimension Di is removed from
the data cube, and will not be visible in the next operations. That means, for instance,
that we will not be able to dice on the levels of the removed dimension. As we will see,
the “removal” of dimensions is, in our approach, implemented by means of the destroyer
measure δ. We remark that the aggregation above is due to the fact that, in order to
eliminate a dimension Di , this dimension should have exactly one element [1], therefore a
roll-up (which we explain later in Section 4.5) to the level All in Di is performed.
Definition 22 (Slice). Given a data cube D, and one of its dimensions Di , the operation
SLICE(D, Di ) “replaces” the measures µ1 , µ2 , ..., P µk by their aggregation (sum) µn Σi (for 1 ≤
Σ
n ≤ k) as: µn (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) = xi ∈dom(Di ) µn (x1 , ..., xi−1 , xi , xi+1 , ..., xd ),
i
for all (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) ∈ M (D). Further, the operation SLICE(D, Di ) destroys
all cells except those of the representative of all for dimension Di . We abbreviate the above
1-dimensional sum as SUMDi (µn ). ⊔
⊓
Property 4. Let D be a data cube and let Di be one of its dimensions. The operation
SLICE(D, Di ) is expressible as an OLAP operation. ⊔
⊓
Example 10. Consider dimensions P roduct, Location, and T ime, and measure µ1 =
sales, in our running example. The operation SLICE(D, Location) returns a cube with
(product, time)-cells containing the sums of µ1 for each product-time combination, over
all location. All cells not belonging to the representative of all in the dimension Location
(i.e., antwerp), are destroyed. The query is expressed by the following transformations.
16
• τl+1 = γP roduct.Bottom (prime labels on products);
• τl+2 = γT ime.Bottom (fresh prime labels on time moments);
• τl+3 = τl+1 · τl+2 (product of the two previous prime labels);
• τl+4 = µ1 · τl+3 (product);
• τl+5 = SUM3 (τl+4 ) (3-dimensional sum);
• τl+6 = τl+5 |τl+3 (projection on prime product labels);
• τl+7 = σLocation.All (selects the representative of all in the dimension Location);
• δ = τl+7 (destroys all cells except the representative of all in dimension Location);
• ϕ(1) = σLocation.All (this flag creation selects the relevant cells of the matrix).
Transformation τl+4 gives each (product, time)-combination a unique prime product
label. This label is multiplied by the sales in each cell. Then, τl+5 is the global sum over
M (D); τl+6 = τl+5 |τl+3 is the projection over the prime product labels for (product, time)-
combinations. This gives each cell above some fixed (product, time)-combination, the sum
of the sales, over all locations, for that combination. All cells of M (D) that do not belong
to antwerp (selected in τl+7 ), which represents all, are destroyed by δ. ⊔
⊓
17
and returns the aggregation of the original cube along Di up to level ℓ for some of the
input measures α1 , α2 , ..., αr . Roll-Up uses one of the classic SQL aggregation functions,
applied to the indicated protected and computed measures α1 , α2 , ..., αr (selected from
µ1 , µ2 , ..., µk ; τ1 , ..., τl ; ϕ), namely sum (SUM), average (AVG), minimum /maximum (MIN
and MAX), count and count-distinct (COUNT and COUNT-DISTINCT). Usually, measures
have an associated default aggregation function. The typical aggregation function for the
measure sales, e.g., is SUM. We denote the above operation as ROLL-UP(D, Di , H(ℓ′ →
ℓ), {(αi , fi ) | i = 1, 2, ..., r}), where fi is one of the above aggregation functions that is
associated to αi , for i = 1, 2, ..., r. Since we are mainly interested in the expressiveness of
this operation as a sequence of atomic transformations, only the destination node ℓ in the
path h is relevant. Indeed, the result of this roll-up remains the same if the subpath h is
extended to start from the Bottom node of dimension Di . So, we can simplify the notation,
replacing H(ℓ′ → ℓ) with H(ℓ), and assume that the roll-up starts at the Bottom level.
The Drill-down operation takes as input a data cube D, a dimension Di and a subpath
h of a hierarchy H over Di , starting in a node ℓ and ending in a node ℓ′ (at a lower level in
the hierarchy), and returns the aggregation of the original cube along Di from the bottom
level up to level ℓ′ . The drill-down uses the same type of aggregation functions as the
roll-up. Again, since we are only interested in the expressiveness of this operation, the
drill-down operation DRILL-DOWN(D, Di , H(ℓ′ ← ℓ), {(αi , fi ) | i = 1, 2, ..., r}), has the
same output as ROLL-UP(D, Di , H(ℓ′ ), {(αi , fi ) | i = 1, 2, ..., r}). Therefore, we can limit
the further discussion in this section to the roll-up.
Definition 24 (Roll-Up). Given a data cube D, one of its dimensions Di , and a hierarchy
H over Di , ending in a node ℓ, the operation ROLL-UP(D, Di , H(ℓ), {(αi , fi ) | i = 1, ..., r})
computes the aggregation of the measures αi by their aggregation functions fi , for i =
1, 2, ..., r, as follows:
for all (x1 , ..., xi−1 , xi , xi+1 , ..., xd ) ∈ M (D), for which ρH (yi , b), for some b ∈ dom(Di .ℓ).
This roll-up flags all representative Bottom-level objects as active. ⊔
⊓
Property 6. Let D be a data cube, let Di be one of its dimensions, and let H be a hierarchy
over Di ending in a node ℓ. Let {(αi , fi ) | i = 1, 2, ..., r} be a set of selected measures (taken
from the protected measures µ1 , µ2 , ..., µk and the computed measures τ1 , ..., τk of D), with
their associated aggregation functions. The operation ROLL-UP(D, Di , H(ℓ), {(αi , fi ) | i =
1, 2, ..., r}) is expressible as an OLAP operation. ⊔
⊓
Example 12. We next express the Roll-Up operation, using prime (product) labels, sums,
projections, and the 3-dimensional sum. We look at the query “total sales per country”.
We use the simplified syntax, only indicating the target level of the roll-up on the Location
dimension (i.e., Country). The query ROLL-UP(D, Location, Country, {(sales, SUM)}) is
the result of the following transformations, given the measure µ1 = sales:
18
3. τℓ+3 = γLocation.Country (prime labels on countries);
4. τℓ+4 = τℓ+1 · τℓ+2 · τℓ+3 ; (prime product label – in one step);
5. τℓ+5 = µ1 · τℓ+4 (product of labels with sales);
6. τℓ+6 = SUM3 (τℓ+5 ) (3-dimensional sum);
7. τℓ+7 = τℓ+5 |τℓ+4 (projection on prime product labels);
19
used in real-world practice. Although OLAP is a very popular field in data analytics,
this is the first time a formalization like this is given. The need for this formalization is
clear: in a world being flooded by data of different kinds, users must be provided with
tools allowing them to have an abstract “cube view” and cube manipulation capabilities,
regardless of the underlying data types. Without a solid basis and unambiguous definition
of cube operations, the former could not be achieved. We claim that our work is the first
one of this kind, and will serve as a basis to build more robust practical tools to address
the forthcoming challenges in this field.
We have addressed the four core OLAP operations: slice, dice, roll-up, and drill-down.
This does not harm the value of the work. On the contrary, this approach allows us to
focus on our main interest, that is, to study the formal basis of the problem. Our line of
work can be extended to address other kinds of OLAP queries, like queries involving more
complex aggregate functions like moving averages, rankings, and the like. Further, cube
combination operations, like drill-across, must be included in the picture. We believe that
our contribution provides a solid basis upon which, a complete OLAP theory can be built.
References
[1] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In
Proceedings of the 15th International Conference on Data Engineering, (ICDE), pages
232–243, Birmingham, UK, 1997. IEEE Computer Society.
[3] F. Dehne, Q. Kong, A. Rau-Chaplin, H. Zaboli, and R. Zhou. Scalable real-time OLAP
on cloud architectures. Journal of Parallel and Distributed Computing, 7980:31 – 41,
2015. Special Issue on Scalable Systems for Big Data Management and Analytics.
[4] J.-P. Escofier. Galois Theory, volume 204 of Graduate Texts in Mathematics. Springer-
Verlag, 2001.
[5] R. Kimball. The Data Warehouse Toolkit: Practical Techniques for Building Dimen-
sional Data Warehouse. Wiley, 1996.
[6] O. Romero and A. Abelló. On the need of a reference algebra for OLAP. In Proceedings
of the 9th International Conference on Data Warehousing and Knowledge Discovery,
DaWaK’07, pages 99–110, Regensburg, Germany, 2007.
[7] A. Vaisman and E. Zimányi. Data Warehouse Systems: Design and Implementation.
Springer, 2014.
20