0% found this document useful (0 votes)
50 views100 pages

Bab 5 Golfarelli

Uploaded by

ike ayu idiara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views100 pages

Bab 5 Golfarelli

Uploaded by

ike ayu idiara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

CHAPTER

Conceptual Modeling

w hile it is now universally recognized that a data mart is based on a multidimensional


view of data, there is stili no agreement on how to implement its conceptual
design from user requirements . However, it is well known that an accurate
conceptual design is the fundamental requirement for the construction of a well-documented
database that completely meets requirements. Use of the Entity-Relationship model (ERM) is
quite widespread throughout companies as a conceptual tool for standard documentation
and design of relational databases. Many efforts are made to use it in design of nonrelational
systems as well (Fahmer and Vossen, 1995 ). However it is oriented to queries that follow
associations among data rather than summarize the datar so it turns out to be a bad choice in
-
the case of data marts. According to Kimball (1996 ), Entity Relationship schemata " cannot be
-
understood by users nor be navigated usefully by DBMS software. Entity Relationship
schemata cannot be adopted as the basis for enterprise data warehouses /'
The ERM is actually expressive enough to represent the majority of concepts necessary
for data mart modeling. However, the basic ERM is not able to accurately highlight the
distinctive Jeannes or the multidimensional model . For this reason, its use for data mart
modeling proves to be not very intuitive. Additionally, the use or the ERM in this context is
not very efficient from the notational viewpoint, as the examples clearly show in the
following sections .

In many cases, designers base their data mart designs on the logical level that is, they
directly define star schemata that are the standard implementation of the multidimensional
model in relational systems. (See Chapter 8 for more details on star schemata .) A star
schema is nothing but a relational schema; it contains only the definition of a set of
relations and integrity constraints Using star schemata as a support for conceptual design
-
equates to designing a relational database without first designing an Entity Relationship
schema. Or, worse still, it is tantamount to starting to create a complex software from the
coding phase without any static, dvnamic, or functional design schema. This practice
generally leads to rather discouraging results from the viewpoint of responsiveness to
requirements, ease of maintenance, and reuse. In the case of data warehouses, star
schemata make things worse because they are almost completely denormalized and do
not properly code those functional dependencies on which the definition of hierarchies
is based .

99
100 Data Warehouse Design: Modern Principles and Methodologies

For ail of these reasons, current literature has proposed different original approaches to
multidimensional modeling , some of which are based on ERM extensions, others on Unified
Modeling Language ( UML ) extensions. Table 5-1 lists the main models proposed so far and
shows whether each one is defined as a conceptual or logical model and whether it is
associated with a design methodology, It also specifies whether the conceptual modeLs are
based on the ERM or UML, or if they are ad hoc models. The remainder of this paragraph
briefly outlines the conceptual models considered the most representative and aims at
highlighting their similarities in expressive power.
-
Figure 5 1 shows a class diagram for analysis of purchase orders { Lujan-Mora et al, 2006) .
It uses an object -based formalism, more precisely a UML extension. Purchase orders play the
role of facts. A UML class, whose attributes are the fact measures, models purchase orders.
The dimensions are the supplier, the date, and the item ordered. Both the dimensions and
different levels of aggregation describing them are represented as classes. Aggregations
( represented in UML by small diamonds) connect the facts and the dimensions, and many to
one associations link the different levels of aggregatiore
--
On the contrary, tire model used in Figure 5-2 Ls an extension of the ERM ( Franconi , 1999)
The schema in the example analyzes the calls made through a telephone company. Here, the
fact is called target and the measures properties. You can represent relevant aggregations
(aggregateentities ) and use the specialization hierarchies of the ERM to list the values of one
level of aggregation.-

Modet Level Methodology


Abelld et al . ( 2006 ) Conceptual ( UML } No
Agrawa! et al . ( 1995) Logical No
Cabibbo and Tortone (1993 } Logical Yes
Datta and Thomas ( 1997) Conceptual { ad hoc ) No
Franconi and Sattler ( 1999 ) Conceptual rERM) No
Go If are Hi etal . ( 1998 ) Conceptual ( ad hoc ] Yes
Gyssens and Lakshmanan { 1997) Logical No
Husemann et al. ( 2000) Conceptual { ad hoc ) Yes
Li and Wang ( 1995 ) Logical No
Lujan Mora et al. ( 2006) Conceptual ( UML ) Yes
Nguyen et al. ( 2000 } Conceptual ( UML ) No
Pedersen arid Jensen (1999) Conceptual (ad hoc ) No
Sapia et al . (1998) Conceptual ( ERM ) No
Tryfona et aL ( 1999) Conceptual (ERM ) No
Tsois etal. ( 2001) Conceptual (ad hoc ) No
VassiHadis ( 1998] Conceptual [ad hoc ) No

TABLE 5-1 Approaches to Multidimensional Modeling

i
Chapter 5: Conceptual Modeling 101
4

rUT C Jl3u3 fiQr Ce r 3


,

price
digcount
i

3
1 * *
-
S « ppI lflr Date Item

i
1 .. * »
i
. . I
i l

City Mouth, Eay Surorc^ jp

T
1 I,*
-
County Quarter Group

T ’
1
S ate Year
^
1, .•
1

Couait ry

- A UML class diagram for purchase order analysis (Lu an-Mora. 2006 :
FIGURE Sl

Finally, Figure 5-3 shows a fact schema for the analysis of checking accounts ( Hiisemann
at aL, 2DOO). In addition to facts, dimensions, and measures., nonusable attributes for the
aggregation ( property attributes ) , optional attributes , and alternative aggregation paths are
represented . Specifically; the extensionsL meaning of the alternative aggregation path in
Figure 5-3 is that either a value of profession or a value of branch is linked with a value
of customer Id,
The remainder of the chapter describes in depth the Dimensional Fact Model proposed
by GolfareUt , Maio, and Rizzi in 1998 and constantly enriched and refined during the
"

following ten years to optimally fit the variety of modeling situations that may. be faced in
real projects, We will present basic concepts and more advanced constructs in sections 5.1
and 52, respectively Section 5.3 deals with extensions! properties and defines aggregation
semantics. Section 5.4 tackles specific features tied to the representation of time. Section 5.5
shows how to overlap fact schemata,. Section 5.6 proposes a formalization of the in tensions £
and extensions! properties of the model
102 Data Warehouse Design : Modern Principles and Methodologies

property
level
dirge
* dura tier.
dimension
1
/ 1. TO
r
DAY DATE CALL NUMBER

FROM
type
i ‘
tffiBtE aUSZ.TOSS

V /
A YU
!dm rati an ! \ /
KEEK CUSTOMER
DAY
AC0 - O TYPE

5 I
turgre ate
^
cnLilies

-
Fi&ufts 5 2 -
An ERM ba &ed conceptual schema for phone call aoaysis ( Frsnconl and SarJer, 1999 )

fact
at &enJLtjve pathi

*
c/ a
/
/

c/ a — tj acretint I d|— cuEtOffierld


x=g ik** prejfsss ion

cranch
,
i list bme rTypifi
facts

balance dimeftsson
r
dime anionJJ
t cu = torr.erName
a9 £

orgld
^f. options] -irspbujc
interest level
_ property
creditLimir
A
oEgManue
^ attribute
prodoctId preduetType

measure
U t ime date KOTlth year
A
l
aggregation path

FIGURE 5 *3 A fact schema for the analysis of checking accounts iHusemann et aP , 2000) . .
C h a p t e r S: Conceptual ¥ iteM £ 103

5.1 The Dimensional Fact Model: Sasic Concepts


The Dimensional fnei Model ( DFM ) is a conceptual model created specifically to function as
data mart design support, It is essentially graphic and based on. the multidimensional
model The goals of the DFM are to
* lend effective support to conceptual design;
create an environment in which user queries may be formulated intuitively;
* make communication possible between designers and end users with the goal
of formalizing requirement specifications;
* build a stable platform for logical design;
* provide dear and expressive design documentation .

These characteristics make the DFM an optimal candidate for use in real application
contexts in particular we prefer this model to others presented in the preceding paragraph
,

because we consider its graphic formalism particularly simple, expressive, and appropriate tor
communication between designers and end users. Moreover, the availability of a semiautomatic
technique to obtain conceptual schemata from operational source .schemata facilitates the
designer s work, especially when conceptual design is implemented within a design tool
'
.
The conceptual representation generated by the DFM consists of a set offici schemata..
Fact schemata basically model facts, measures ,, dimensions, and hierarchies.

Fact
-
Ajkcl is a concept relevant to decision making processes. It typically models a set of
events taking place within a company.

Examples of facts in the commercial domain are sales, shipments, purchases, and
complaints. In the healthcare industry some interesting examples are admissions,
discharges, transfers, surgeries, and access to emergency services. In the financial industries,
stock exchange transactions, checking account and credit card balances, contract creation ,
,

loan disbursement, and the like are considered facts, as are flights, car rentals, and nights '
stay in the tourism industry.
It Is essential that a fact have dynamic properties or evolve in some way over time.
Indeed few of the concepts represemed in a database are completely static- Even the
association between dues and states can change if state boundaries are modified . For this
reason , the distinction between facts and other concepts must be based on the average
frequency of change or on the snemric interests of users. For example, the assignment of a
new sales manager to a department occurs less frequently than the promotion of a product ,
Whole the association between promotions and products is a good candidate to be modeled
as a fact, the association between sales managers and departments generally is not, unless
users are interested in monitoring the transfers of sales managers to find out the correlations
,

between department managers and how much that department sells. See section 5.4.3 for
-
a more in depth discussion on the dynamic properties of fact schemata.

Measure
A measure is a numerical property of a fact and describes a quantitative fact aspect that
is relevant to analysis.
104 Data Warehouse Design; Modern Principles and Methodologies

For example, each sale is measured by the number of units sold , the unit price, and the
total receipts. The reason why measures must preierabiy be numeric is that they are
generally used to make calculations . A fact can also have no measures, as in the case when
you might be interested in recording only the occurrence of an event In this case, the tact
.
schema is said to be empty Section 5.3.5 discusses some specific features of empty schemata.

Dimension
A dimension is a fact property with a finite domain and describes an analysis coordinate
,

of the fact

A fact generally has more dimensions that define its minimum representation
granularity. Typical dimensions tor the sales fact are products, stores, and dates . Ln this
case, the basic information that can be represented is product sales in one store in one day.
At this level of granularity, it is not possible to distinguish between sales made by different
employees or at different times of day. Because facts are generally dynamic, a fact schema
will almost certainly have at least one temporal dimension whose granularity can vary
from the minute to the month ( more probably, the day or week).
The connection between measures and dimensions is expressed at the extensional level
( that is, at a data level rather than at a schema level) by the event concept we informally
define here, while referring you to its formal properties in section 5.63.

Primary Event
A primary event is a particular occurrence of a fact , identified by one n ple made up -
of a value for each dimension A value for each measure Ls associated with each
primary event.

Ln reference to sales for example a possible primary event records that 10 pad ages of
,

Shiny detergent were sold for total sales of $25 on 10 /10/ 2008 in the SmartMart store..As
-
this example shows, dimensions are normally used to identify and select primary events.
On the basis of the concepts introduced so far, you can design a simple fact schema for
sales in this chain of stores. Figure 5-4 shows that a fact is represented by a box that displays
the fact name along with the measure names. Small circles represent the dimensions, which
are linked to the fact by straight lines.
A fact expresses a many-to-many association between dimensions. For this reason, the
-
Entity Relationship schema corresponding to a fact schema consists mainly of an n-ary

FlQURf 5-4 dimensions

^
A sjmple fact
schema for sates
product <2
I act

SALE
O O
date quantity store
receipts
unit Price
n LLffibe ro f "u ® t C«e r s

mct uTcs
^
C h a p t e r 5: Conceptual Modeling 105

FHHJHE 5 - 5 product
paODUCT
The Entity-
RelaticxwiD score
schema
corresponding to tOpB} *
the fact schema of
-
Figure 5 i "
ATS
(0. n)
£ 7C ?..~

quantity nu'-nfce rC f Cus teasers


C o
receipts unitPrice

relationship, which mode Li the fact , among entities that model dimensions The measures
are attributes of this relationship. Figure 5-5 show's the Entity-Relationship schema
-
-
corresponding to the fact schema of Figure 5 4 . Clearly, though the ERM is expressive
enough to show facts, dimensions, and measures, it does not represent these concepts as
-
first class citizens.
Note that some multidimensional models in the literature are focused on the
symmetrical treatment of dimensions and measures ( Agrawal et aL > 1995; Gyssen and
Lakshmanan, 1997 ). This is an important result from the view-point of uniformity in the
logical mode! and flexibility of online analytical processing (OLAP) operators Despite that,
we believe that you should distinguish between measures and dimensions at the conceptual
level , because this enables logical design to be aimed more at reaching the efficiency'
required by data warehouse applications.
Before defining what a hierarchy means, we should introduce the concept of
dimensional attribute.

Dimensional Attribute
The general term dramsrtfnnJ attributes stands for the dimensions and other possible
attribute's always with discrete values, that describe them
,

For example, a product is described by its type, by the category to vvhich it belongs, by its
brand, and by the department in which it is sold. Then produce, type, category, brand,
and department will be dimensional attributes. The relationships among the dimensional
attributes are expressed by hierarchies

Hierarchy
A hierarchy is a directed tree 1 whose nodes are dimensional attributes and whose arcs
model many -toon e associations between dimensional attribute pairs. It includes a
dimension, positioned at the tree s root , and all of the dimensional attributes that
describe it .

'Graph theory reminds us that a tree is an acyclic connected graph (Beige, 1985). A directed tree is a tree with a
n?of or a node called rD from which you can reach all the other rvode s ia directed paths. Within i directed tree,
.
only one directed path connects the r. root to each of the other : nodes. Given a node called & into w hich an arc
,

Called a enters and tram which bf c, d . . . arcs exit , we veiiJ cat! the node from which a exits the parent of u and the
. .
nodes into which fr, c, d . enter the cteJim of v in addition to its parent, the predecessors of v are the parents of
its parent and soon. Lit addition to its children, the descendants of p are the children of its children and SO on.
106 Dati Warehouse Design: Modern Principles and Methodologies

Do not confuse the term nierfffrfty used in this context with the identical term used in
Entity-Relationship modeling, where it refers to specialization, links between entities {IS A
hierarchies }. In the multidimensional modeling context, hierarchy refers instead to
-
associative links of different kinds in a way that is not dissimilar to aggregation hierarchies
-
in object oriented models- For example in the product dimension hierarchv you will have
,

an arc from product to type to express the type of each product, an arc from product ip
brand to express its brand . an arc from category to type to express the fact that all. the
, ,

drink type products belong to the food category, an arc from category to department to
,

express the fact that all of the food category products are sold in the food department, and
so on. In relational terminology, each are in a hierarchy models afunctional dependency
between two attributes:

— —
p redo c t t yptr , p rod L C t K> rand,

type tc a t egc ry, c a c egory ^depart ment
Because the transitive property applies to functional dependencies, each directed path
inside a hierarchy represents in turn a functional dependency between the start and end

attributes. For example, produce *cype and type-+caceg ~ ry imply produce teategaty.
Figure 5-6 shows how you. may add hierarchies built on dimensions to enhance the fact

schema of Figure 5-4. Dimensional attributes are represented by eirdes and are connected
by lines that mark the hierarchy arcs and express functional dependencies. For example, the
city where a store is located defines the state to which that store belongs. Hierarchies are
structured like trees with their roots in dimensions. For this reason, you should not
explicitly show arc directions as each one of them is implicitly oriented in a direction
moving away from the root.
Figure ,5-6 shows a typical temporal hierarchy that ranges from date to year. A fact
can include more than one temporal hierarchy modeling different dynamic properties.
For example, a shipments fact schema may include a hierarchy bulk on the shipping dale
and one built on the order date. Other frequently used hierarchies are geographical hierarchies.

~.»r > et LngOroiip


depart Trout

hierarchies category

type brarid C LEV


brand
dimetihinaWI
I product
aamfiuies
i. r dav EaletlCsnager
hoi iday Q uaicaDIstrict

1 SALZ state
O D O quantity
o o o
year quarter mot date store BtorcCity court ry
receipt s
week unit-Prl te
nun&erQ f Cus toene r 3

FHUKE 5 6 - Enhanced fact schema for sales


C h a p t e r 5: Conceptual Modeling 107

hierarchies related to company organization charts, and part -component hierarchies. Figure 5 6
shows an example of geographical hierarchy as one built on the store dimension.
-
NOTE AU of the attributes and measures mtktn a fact schema must have different names You can
differential e similar names., if you quality them with the name of the dimensional attribute that
comes before them in hierarchies ( forexample, coreCicy and brandCity) *
The convention proposed by Kimball et a!„(1998) provides for each name to be built
from three components: an object (client product , city
/ a dtissificatkffl ( average, total date,,
/

description and a qualifier ( starts p r i m a r y , . F o r example, you may then have a


ehe c k : npAc coun z S ta re Dace or a pr i rr.a ry Pr cduc t D e sc r ip c i on,
The available functional dependencies establish many -toone associations between the
values of a dimension and those of each dimensional attribute in the corresponding hierarchy.
-
For example, a many to-one association exists between products and product types (a product
-
is of one type, a type id enemies a product set ). As a result, each n ple of values within any set
of dimensional attributes is associated with a set of value n- ples within ( he dimensions, or
with a set of primary events. This makes it possible to use hierarchies to define the way you
cart aggregate primary events and effectiv ely select them for derision-making processes.
While the dimension in which a hierarchy takes root defines its finest aggregation granularity,
the other dimensional attributes correspond to a gradually increasing granularity. This
concept is set out in the definition of secondary events as we informally propose here.
,

Secondary Events
Given a set of dimensional attributes ( generally belonging to separate hierarchies), each
n-ple of their values identifies a secondary ei' ent that aggregates all of the corresponding
primary events Each secondary event is associated with a value for each measure that
sums up all the a lues of the same measure in the corresponding primary events

For example, sales can be grouped according to the category of products sold , to the month
when sales were made to the city in which stores are located , or according to any combination
/ ,

of (hose Let 's choose store lity . product, andmonth as dimensional attribute , for our
-
aggregation. The n ple (stcreC ; zyi Miami', product;, "Shiny , month: 10/ 20081) identifies
4
-
a secondary event that aggregates all of the Shiny product sates in October 2008 in Miami
stores , In other words, it aggregates all of the primary events corresponding to the n-ptes
where the product value is Shiny the value of store is any store in Miami, and the value of
,

dace ranges from 10/ 01 / 200$ to 10 / 31 / 2008. The value of the receipts measu re in this
secondary event will be expressed as total receipts related to the sales it aggregates. See
-
section 53 tor an in depth discussion on complex problems connected to aggregations.
-
Now we can make a quick comparison with the Entity - Relationship schema corresponding
to the fact schema in Figure 5-6, which is shown in Figure 5-7. Note that each dimensional
attribute ( dimensions included ) corresponds to an entity, which has that attribute as an
identifier, Also note that a many -to-one relationship represents each arc in the hierarchies.
From this viewpoint, the hierarchy notation adopted in fact schemata can be interpreted as
-
a simplification of the Entity Relationship notation, where the representation of relationships
is simplified ( their names and their multiplicity, which is always many-to-one. are not
shown ) and where just an identifier is shown for each entity .
10S D a t a Warehouse Design: M o d e r n P r i n c i p l e s and M e t h o d o l o g i e s

- DEPARTMENT
marfcetingGroup
BASKETING
country GROUP year
COUNTRY tuu YEAR
category /a ,E)
CATEGORY
(1r R)

tun
STATE
tUn)

(
o
l i l)
tUlf
BRAND
CITY

(UR)
brandcity

QUARTER
cun
onarter

TYPE
rUft )

tun
(Un)
'

BRAND
tun
brand o tU * >

cun
city month
CITY I U 1) hONTH

C Un » (Un)

tuni tun
store
STORE

unitPrice
( Mi ( M| qaif .t i % y
O O
r&eaipts nuwborOfCi:* towers
CUR * ( < 1r r o
SALES SALES HOLIDAY DAY
MANAGER DISTRICT

. *
9 ai esHar.5 ger
.
aelesDiatrict •
hoIid a.v
( i , n}
day

WEEK
week

RQURE 5 -7 tniity-RelaLortsnrp schema corresponding to the fact schema of Figure 5-6

5,2 Advanced Modeling


The concepts introduced in this paragraph, along with their corresponding graphic constructs,
,

are descriptive and cross-dimensional attributes; convergences; shared , incomplete, and


recursive hierarchies; multiple and optional arcs; and additivity (Rizzi, 2006}. Although they
are not necessary in the simplest and most common modeling conditions , they do prove
Chapter 5: Conceptual Modeling 109

very useful to best express the multitude of conceptual nuances that characterize actual
scenarios. In particular, we will show that the introduction of some constructs generally
makes hierarchies no longer mere trees, but graphs

5.2 , 1 Descriptive Attributes


Many times, we find that it is important to give additional information on a dimensional
attribute in a hierarchy, although it may not be very interesting to use this information as
aggregation criteria . For example, users may find it useful to know the address of each
store, but they may hardly want to sort out sales by store address. Descriptive attributes
represent this type of information in fact schemata .
A descriptive attribute is functionally determined by a dimensional attribute of a hierarchy
and specifies a property of this attribute. Descriptive attributes often are tied to dimensional
attributes by one- to-one associations and do not actually add useful levels of aggregation.
Sometimes, they have domains with continuous values, so they cannot be used for aggregation
at alL Some examples are a store's address and telephone number as well as product weight.
Descriptive attributes are always the "leaves" of hierarchies.2 Figure >8 shows that
horizontal lines mark them graphically in the DFM. Figure 5-9 shows the representation of
the fact schema of Figure 5-8 according to the ERM. Note that you can model descriptive
attributes as attributes of the entities corresponding to dimensional attributes.
cspa rtr entKead- deuart. r.'Er.z
director
jna rke t i nwGroup '

wtegory
optional CTOi -dlmcmioTual
.re type brandCity \ attribute
veivh
brand
product: diet
day s a 1 e sKins gc T
‘ * holiday 0 alesDi.strict
Q
SALE
i score state
o . y
o O
rea r t e r month /tiat e
'u
storeCicy country
receiptB
4 TTdTberOf 2u Stoners
week
/
/ ijr.it Price ( AVG ) \itelephone
address

non ^ adcSitiviiv

startSate y
prone tier.
optional
dimension
^ ^ descriptive
jutnbuic

endDat E. discount

cost
p
advertising

FKUAE 5 8 Complete fact schema fo: sales


*

:A tree Zori is a node without any children.


110 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

-
d <??± ximenth'ead
'

depa
DERASTMENT -
di r ^' ’ or
Q

11,n.l I Turkc L ingGt uu.p


VAP PwETTNG
country 3SGTTF year
COUNTRY YEA?
category a . a!
CATEGORY
( l rnl

'1, 11 BEAMS
CITY -
q ^ar “

r
5t*‘B etr
QUARTER
STATE
<1, r.! *
Cl . ii ) ( l, n )
11.11
I tl .
Jil

( i, u .
i i i)
BRAIO
brand u . ii
ci .y
•1 i:
.
rr- r ifc

. a _ et
CITY
product MONTH
< 1,1:
tl nl
PRODUCT
.
11 n J O
wiight
[ 0 , Jl )
10.1)
.
tl nl
DIET
! 1 . n)

( 1.1 ) ( 1.1 )
address
stars —O ; D,x> 1 ( CrG| date
STORE DATE
o •miz Price

d,i) .
ii ii
tei *pltcr.e
quant ty - 6
ii . ijjiiTiir
.
receipts n -u ~ no rC f CM* T. OTT K r .
: i , ni .
ll nl ( 1, n !

SALES SALES
.
!0 n)
HCDIDAY DAY
UI 5TEICT MANAGER

• II
p nsmrit iun C
PROMOTION scartDate
*

5s e s l i s t r i c t
#
aaleaKanager

O holiday
.
II a )
(1.1) endLnte
Hi 1)
WEEK

il nl. .
t l n)

discount
EISCOUUT ADVERTISING ; advertisinfi
cgst.

FIGURE 59 "
h: s Efltity- aelatlorcship schema corresponds to the fact schema of Figure 5-3

A descriptive attribute can also be directly connected to a fact if it describes primary events,
but it is neither possible nor interesting to use it to identify single events or even to make
calculations (otherwise, it would be a dimension or a measure, respectively). FOT example,
consider Figure 5 10, which shows the fact schema for shipments. The order, shipping, and
*
Chapter 5: Conceptual Modeling HI

depar t m ntHead
^
lepArdlDeiit,

veignr
packaging C hrsrji
i EVD i zelfmrjber
Drcducd h-© diet
rece:pirate address

a
year
Q O
Tiarter
G O
veek

ir.cr.dh
ordfirMonch
datt

arderYear orderD&ts/ordef
A
SHiPKnrr

sh i ppcdCvar.111 y
shipwentCast
desnaaticn

warehouse
address
——
city
o
sciintry

dirEctor
c^sune r “
-
d de H i on ra zz

conditions
carrier allowance
address / ^ -
incentive

FifitutE 5*10 Fact schema for shipments

receipt dates muss be represented . While it Is useful to define the order and shipping dates
as dimensions and build two different temporal hierarchies on them , you should probably
not represent the receipt date as a dimension as well However representing a date as a
,

measure causes problems because the only applicable aggregation operators would be MAX
and MIN. For this reason , you can accurately represent the receipt date as a descriptive
attribute of the SHXPMEOT fact Keep in mind that in order to correctly link a descriptive
attribute directly to the fact, you will have to give it a single value for each primary event
In the case of shipments, the representation adopted for receipt Date would not be
correct if separate shipments received by a customer on different dates could be made for
the same order and the same product.

5 ,2.2 Cross- Dimensional Attributes


A cross-dimensitmal attribute is a dimensional or descriptive attribute whose value LS defined by
ihe combination of two or more dimensional attributes, possibly belonging to different
hierarchies. For example, if a product's value added tax (VAT) depends both on the product
category and on the country where the product is sold , you can use a cruss-dimensjonal
attribute to represent it. Figure 5S shows this example joining the arcs that define a product 's
VAT with a circular arc . Figure 5- 9 shows that the equivalent Entity - Relationship
representation Involves a VAT association between the COUNTRY and CATEGORY entities.
-
Figure 5 11 gives another example showing DFM modeling for the animal sighting fact Here,
the area and the species attributes jointly define the resident Population attribute that
gives an estimate of the number of specimens of each animal species who live in each area.
1) 7 Data Warehouse Design: Modern Principles and Methodologies

FIGURE 5-11 Qphylum


Fact schema for
animal sightings 6 class

O order

Q CErrii ly

Q yer.us re sidene ?opu1ainon

sc^cies

hour SIGHTING
Q

year month date o


o o o number area

5.2.3 Convergence
The concept of convergence deals with the structure of hierarchies. In particular , hierarchies
may not be real trees because two or more distinct directed paths may connect two specific
dimensional attributes on the condition that each one of them still represents a functional
dependency. Look at the example of the store geographical hierarchy in Figure 5 , 8 . Stores
are grouped into cities, which in turn are grouped into states belonging to countries.
Assume that stores are also grouped into sales districts and that no inclusive relationship
exists between districts and stales . Assume also tha .> each district is nevertheless part

of exactly one country the same country to which the store dty belongs. Lf this is the caser
each store belongs to one country alone, independent of the path followed;


s to r e
-
——
s tore -> s r o r e C1 1y *s t at e +count ry or
a1e s Di s t r i c T »coun t ry
Two or more arcs belonging to the same hierarchy and ending at the same dimensional
attribute mark convergences in fact schemata . If a convergence exists, a tree structure can
no longer define arc directions uniquely. Figure 5-8 shows that you have to add arrows
-
to converging arcs. On the corresponding Entity Relationship schema of Figure 5- 9 r the
convergence could be modeled by adding an explicit constraint stating that the association
cycle between the STORE, STATE , CITY , COUNTRY, and SALES DISTRICT entities
/

is rediindant.
If there are apparently similar attributes, this does not always result in a convergence.
For example, look at the brandC: ty attribute in the product dimension, which represents
the dty where products in one brand are manufactured , and at the storeCity attribute
in the store dimension. Both city attributes obviously have different meaning and must
be represented separately because a product manufactured in one city may also be sold in
other cities.
C h a p t e r 5: Conceptual Modeling 113

FIGURE 5 12
Redundant
-
convergence (left)
and its correct - *
representation
(r t g W

-
Finally, look at Figure 5 12. X ote that a hierarchy like this, in which one of the
'

alternate paths does not include intermediate attributes, does not have a reason to exist.
The convergence is completely obvious in virtue of the transitive property holding for
functional dependencies.

5.2,4 Shared Hierarchies


Entire portions of hierarchies are frequently replicated two or more times in fact schemata .
Temporal Itierarchies are a classic example. Two or more date-type dimensions with
different meanings can easily exist in the same fact, and you may need to build a month
year hierarchy on each one of them . Another example is a geographical hierarchy bulk on
-
all the city attributes in a schema. Figure 5-8 shows that two exist in the sales fact schema:
score City and brandCity. All of the attributes in a fact schema must have distinct
names to avoid ambiguity. This would force you to qualify your attribute names and strain
your notation uselessly: this would then result in braodStat et brandCountry,
storeitace , and storeCountry,
For this reason, we introduce an abbreviated graphic notation that emphasizes the
sharing of your hierarchies and allows you to adopt ad hoc logical design solutions (see
section 9.1.3 ), Figure 5-13 shows an example of a fact schema in which the fact consists of

FI & UHE S-13


Shared hierarchy hierarchy
and roles m the type
pnone calls fact

^
caliiisg hour
CALL
schema aod O
equivalent schema
di 5cries. .
i eiSuaber “ limber —o o— -o
without shared dat * amth year
hierarchies called dusrat ion

c R!1ingNixmbe rType
*
rule

callir.gNun±ierCisi.riot non sr
a
c a l l i ngfruoiber
CALL O
number •O
ca 11 edNUmbetuistrict date raenth vear
a duration

calledNUmher
ca11BdHurhbsrType
114 Data Warehouse Design: Modern Principles and Methodologies

i
warehouse city state country
FIGURE 5*14
Shared hierarchies SHIPMENT
to] O -O
and roles in the
shipments fact O number —
U customer
schema oroduc CCEt
'
vdate
ortten zD&:=
-io> —o o
mooch year

telephone calls made with the number calling, number calledr date, and hour of the call as
dimensions, A double circle represents and emphasizes the first attribute to he shared
( t 0 lNun.be r ). It should always be implicitly clear that all descendants of shared attributes
are shared , too . If one or more descendants of the shared attributes should not be shared in
turn , you would then need to represent those hierarchies separately.
When hierarchies are shared starting from their dimensions, you need to add a role
that specifies its significance to each incoming arc ( cal i in 3 and cal led in Figure 5-13 ,1 .
Figure 5-14 shows the case of dry. Here, you can instead omit the role as its parents
implicitly define it ( the warehouse city, the customer city )

5.2,5 Multiple Arcs


Hierarchies most commonly include attributes connected to their parents by to-one
associations. However, it may be useful or even necessary to also include attributes that
can take multiple values for each single value of their parent attributes .
Look at the fact schema that models book sales in a bookstore, its dimensions are date
and book. It would certainly be interesting to aggregate and select sales on the basis of book
authors. However, it would not be accurate to model author as a dimensional child
attribute of book because many different authors can write many books. Figure 5-15 shows
the notation that we suggest. It graphically requires you to double the arc from book to
author . In general, the meaning of a multiple arc that goes from an attribute called a to an
-
attribute called b is that a many to- many association exists between a and b . In other words,
a single value you give to a can correspond to many values of bf and vice versa. Section 5.3.4
will show you that you need to give each multiple arc a coefficient that defines a weight for
this many-to-marvy association so that aggregation can be consistently defined along
hierarchies that also include multiple arcs .
A multiple arc may enter a dimension rather than any dimensional attribute . For instance,
in. the healthcare industry, if you wanted to model a fact related to hospital admissions,
relevant dimensions could be admission dates, admitted patient identifications, and the
departments into which the admission occurred. . As a matter of fact, you would find it
useful to be 3 ble to aggregate and select admissions on the basis of the diagnoses issued

-
FIGURE 5 iS genre
Fact schema ~or SALE
book sa’es
C
a arbor
i
book ^r
nuri
receipts
O O
— O
dace month year

mull ip k JCTV
Chapter 5; Conceptual Modeling U5

-
FIGURE 5 16 aepart rr.erL"
O
Factsenega for
hospital f LTEtNaTCC
admissions
laStName
O ADMISSION

rate-gcry
o o rest
rat lent 5;
-Os« *erM

Li a gr.o sis
"
'
-
0 as eriag^er.t
*

city

tisYsar

But multiple diagnoses often correspond to a single admission, each one belonging to a
-
category. Figure 5 16 shows that you can turn the start arc in the diagnosis dimension
into a multiple arc to model this scenario in that fact schema. As discussed m section 5-3-4,
when a multiple arc enters a dimension, the semantics of -aggregation becomes more complex.

5.2 . 6 Optional Arcs


Qpi7 L>rtnhty is used to model scenarios for which art association represented in a fact schema
LS not defined for a subset of events. Optional arcs are marked with a dash . If r is the
optional arc, it is worth singling out two remarkable cases: (1) r determines a dimensional
attribute called a, and ( 2 ) r determines a dimension called <L In case (1), if b is the
dimensional attribute that determines a through r , the meaning of optionality is that a and
all the possible descendents in its hierarchy may be undefined for one or more values of N
For example, Figure 5-8 shows the d.i*t attribute in the sales fact schema. This attribute is
given a value ( such as cholesterol-free, gluten-free, or sugar-free) only for food. Its value is
null lor the other products. Note that the minimum multiplicity of the association from
- -
4 —
PRODUCT to DIET is 0 rather than 1 in the Entity Relationship representation of Figure 5 9,
In case (2) , we state that the d dimension is optional that isr some primary events exist that
.
are identified only bv the other dimensions For example, the promotion dimension in the
sales schema is optional. This means that the value of proir.ee ion will be ' No promotion ’

for some product -store-date combinations.


When two or more optional arcs exit from the same attribute, you can specify' their

coverage that is, establish a relationship between the different options involved. In a similar
way to the coverage of specialization hierarchies in the ERM, two parameters, independent of
each other, can characterize the coverage of a set of optional arcs. Let then be a dimensional
attribute from which optional arcs exit traveling toward the E\ .. c h i l d attributes:
* The coverage is total if a value of at least one oi the children is linked to each value
oSa . If , instead, values of a exist for which ail of the children are undefined, the
coverage is called partial
The coverage is disjoint if you have a value for at most one of the children
corresponding to each value of a. If , instead, values of a exist linking to values of
two or more children , the coverage is called Cforrlapping ,
- - -
Altogether, you have four types of coverage, marked with T D. T O, P D, and F-O.
Figure 5-1? shows an example of optionality marked with its coverage. With three types
116 Data Warehouse Design: Modern Principles and Methodologies

FIGURF 5-17 coverage expiratianBaCe


Eaimple of
coverage for a set
of optional arcs OHDS5. L:M
Nl P size

O type
quantity productCode
amount

or products ( food, clothing, and household), coverage proves to be partial and disjoint
because expir&tionDate and s i z e are defined only for food and clothing , respectively. .

5.2.7 Incomplete Hierarchies


Let a. . . . , be a sequence of attributes that form a path tn a fact schema hierarchy ( for
an
example, ci cy, county, state, country ). We have so far Implicitly taken for granted that
each value of a corresponds to a specific value of each one of the other path attributes . In
the example, this b certainly true for every American city. However, this assumption may

be w' rong. and it quite frequently is think of the cases of two countries the United
Kingdom and Vatican City State breakdown is not defined for the United Kingdom . In the
Vatican 's case, county and city breakdown are not even defined ( Figure >18 ).
An incomplete hierarchy is one in which one or more levels of aggregation prove missing
in some instances (because they are not known or defined ). Figure >19 shows that we can
mark all of the attributes whose values may be missing with a dash to distinguish an
incomplete hierarchy graphically. To designate this type of hierarchy' in the literature, the
term ragged hierarchy is sometimes used ( Niemi et at . , 2001 ) . Each level of aggregation has
precise and consistent semantics, but different instances may have different lengths because
-
one or more levels an? missing, making the parent child relationships among the levels not
uniform. For example, the parent of 'Orange' belongs on the szaze level and the parent of
'Norfolk' on the country level.
It is important that you understand the difference between an incomplete hierarchy
and an optional arc. In incomplete hierarchies, one or more attribute values for certain

hierarchy instances in any hierarchy position ( the start and end attributes included are missing .
If we use an optional arc , we instead model the fact that the value of an attribute and the
values of all the attribute dependents are missing. If only the end attribute of a hierarchy is
missing, both approaches to modeling are completely equivalent (see Figure 5-20).

country Q B r $
-A , .
Z K vaclean City

state 0 Caliic mia Colorado

county p Orange Monterey Norfolk

ei -y 0 Santa BTorw
Anaheim Monterey Springfield

FIGURE 5-18 frre- gufar breakdown into states and counties


Chapter 5: Conceptual Modeling U7

Fifiuftt 5*19 country Q


fncomplete
geographic
state
hierarchy

county

t
FrCwRt 5 20
*
cC z
Two equivalent
modeling options
for a hierarchy 4 b

o O a G

5.2.3 Recursive Hierarchies


Unlike incomplete hierarchies, parent-child relationships among the levels in rectrrsiEtf
(or unbatGTi&d ) are consistent, but their instances can have different lengths.
For example, think of a company organization chart that maps a reporting structure of
-
employees . Figure 5 21 shows the hierarchy of employee roles if some branches in the
-
organizational chart have different lengths. Although those parent child relationships are
consistent throughout the two branches, their levels of aggregation are not equivalent from
the conceptual viewpoint: the president s administrative assistant and the chief executive
'

officer have quite different responsibilities.


Figure 5-22 shows how the DFM represents a recursive hierarchy. In this fact schema , each
employee 's work hours are measured daily on various projects sorted out by activity type.
Each employee has a role , and more than one employee may hold a role. It is interesting to
model hierarchy relationships among employees as implied in the company organization
chart. The loop-back in employee emphasizes that you cannot distinguish the various
levels of aggregation semantically at the fact schema level.

-
FIGURE 5 21
Role hierarchy in
President

an unbalanced
company
CEO Presideii - s assistant

organization cnart
Prediction manager Marketing manager

Product Engineer Marketing assistant


Creative
m Data Wartltoie Design: M &dtni Principles and Wetlifldolojies

FISUKE 5 - 22 rcle
Recursive hierarchy
for a company employes D
organization nnert

ACTIVITY
o
date

o mu?* worked
'

ectivityType
prc T set

5 , 2.9 Additivity
Aggregation requires the definition of a suitable operator to compose the measure values
that mark primary events into values to be assigned to secondary events. From this
viewpoint, measures can be classified into three categories { Lenx and Shoshani, 1997) ;
* F/ett? A1fK5 H 7£S refer to a timeframe, at the end of which thev are evaluated
cumulatively. Some examples are the number of products sold m a day, monthly
receipts, and yearly number of births.
* Level Measures are evaluated at particular times. Some examples are the number of
products in inventory and the number of inhabitants in a city*
* Uni* Measures are evaluated at particular times but are expressed in relative terms.
Some examples are product unit price, discount percentage, and currency exchange.

The usable operators for aggregating different types of measures along temporal and
nontemporai hierarchies are shown in Table 5-1

Additivity
A measure is called add:free along a dimension when you can use the SUM operator
to aggregate its values along the dimension hierarchy ll this is not the case, it is called
,

.
non -additive Anon-additive measure i> non-nggregtibie when you can use no aggregation
operator for it,

Table 5- 2 clearly show’ s a general rule: flow measures are additive along all the
-
dimensions, level measures are non additive along temporal dimensions, and unit measures
"

-
are non addid ve along all the dimensions.
An example of an additive flow measure in the sales schema is quantity. The quantity'
sold m a month is the sum of quantities sold every day in a month. Figure 5-23 shows an.
-
example of a non additive level measure. The inventory level is not additive along date ,

but it is additive along the other dimensions. An example of a non-additive unit measure LS

TABLE 5-2 Valia Temporal Hierarchies Nontemporal Hierarchies


Aggregation
.

Operators 'or Flow Measures SUM, AVG, MIN; MAX . .


SUM AVG . MIN MAX.
Three Types Level Measures AVG. MIN, MAX SUM, AVG, MIN . MAX
of Measures
iler-z, 1997) Unit Veasursa AVG , MFr* . MAX . .
AVG MIN MAX
C h a p t e r 5: ConccptusE Modeling m
FIGURE 5-23 department
Fact schema cr
an inventory * category

weight
-
r yp*
( £ rafid
'

packaging itsmsParPallet
predict

ahdr «« s
:r.V2~ RY

"

year laonth date


0 0 -Or Level
c o
warahouM c i t y count ry
AVO . i ROOT. LagCuan tit y
KIN

uni zPrice in the sate schema (Figure 5-B). In any case , you can use other operators, such
a ? AVG, MAX, and NUN, to aggregate those non-additive measures. For example, it makes
sense to average unicPrice for more than one product, store, and date.
-
On the contrary, you cannot intrinsically aggregate non aggregable measures for
conceptual reasons. For example ., look at the numberCr £uscomers measure in the sales
fact schema (Figure 5-S). This measure is estimated for a certain product , store, and day,
,

counhng the number of sale receipts issued on that day in that store, which refer to that
product Totaling or averaging the number of customers for two or more products would
lead to an inconsistent result , because the same sales receipt may also include other
-
products. Then numberOf Customers is non aggregable along die product dimension
( while it is additive along date and store}. In this case, it is non aggregable because the -
association between sales receipts and products is man ) -to-many instead of many-to one- -
You cannot aggregate the numberOf Customers measure consistently along the product
dimension, no master the aggregation operator adopted , unless event granularity is set to
a finer level.
Additivity is the most frequent case. So in order to simplify graphic notation in the
DFM, you should explicitly represent only the exceptions. Figure 5-S show's that a measure
connects to the dimensions along which it is non-additive via a dashed line labeled with
the usable aggregation operators. If a measure shows the same type of additivity on all. of
the dimensions, you can mark the aggregation operator on its side If the total amount of
non-additivity were such that it reduced schema readability, you should adopt a matrix
representation. Table 5-3 shows the equivalent additivity matrix for the fact schema of
Figure 5-8,

product dare score prompt ion


quanticy SUM SUM SUM SUM
receipts SUM SUM SUM SUM
n UTib e rC £ Cus t outer s SUM SUM SUM
uniCPrice AVG AVG AVG AVG

-
IAB-LE 5 3 Additivity Matrix of the Sales Fact Schema
120 Data Warehouse Design: Modern Principles and Methodologies

5.3 Events and Aggregation


In this paragraph, we abandon the intertsional DFM level and move on to the extensionaJ
level in ordeT to return to the concept of the event, the basis of aggregation. In other words,
we'll now describe the instances that populate fact schemata
We have already shown how a primary event , the basic unit that can represent information
in a fact schema , corresponds to a specific fact occurrence, which you can dearly identify if
you give each dimension a value. We will call a primary event coordinate the n ple of -
dimension values that identifies it. Each measure takes exactly one value in a primary' event.
For example, the
a' = (produce: GeanHand ', score: 'EverMore ', date: '5/ 5 / 08')
-
coordinate in the fact schema example in Figure 5 6 defines a single primary' event
that represents the CleanHand soap sales in the EverMore store on May 5> 2008. The
measure values of this event couJd be: quant ity: 20, receipts: 80, unitPrice: 4,
nurabe rO f CUB t one r s: 3.
A cube is a set of primary events associated with different coordinates. How ever, net every
possible coordinate corresponds fo a primary event . In other words, cubes are generally quite sparse,
and just a small percentage of coordinates actually define events. For example, a few products
are sold every- day in all the stores. We postpone the discussion on how to evaluate sparsity to
Chapter 7, where we deal with fundamental questions on workload and data volume.
The shades of meaning in missing events can vary' from one fact schema to the other.
There is no event in the sales schema of Figure 5-6 because a product in a certain store on
-
a certain date remained unsold In the shipments schema of Figure 5 10, a missing event
means that a specific warehouse did not ship a specific product from an order on a certain
date to a specific destination In the telephone calls schema of Figure 5-13, it means that at a
certain time on a certain day. two telephone numbers were not connected. In the hospital
admissions schema of Figure 5-16, it means that on a specific day, a patient was not

warehouse ran out of product supplies on a specific day.


-
admitted to a specific department. In the inventory schema of Figure 5 23, it means that a
r

Generally speaking, the meaning of missing events depends on the existing relationship
between primary events and transactions in operational database sources. Typitallv. the
information modeled in a fact schema sums up data in operational databases. This means
that each primary event summarizes one or more transactions that actually occurred in an
application domain in a unit of tune . If this is the case, that fact schema is said to be a lossy-
-
grained fact schema. The sales schema is an example of a lossy grained schema , because each
sales event modeled in that schema sums up all the sales of one product on the same day at
the same store. If this is not the case, event representation granularity in fact schemata
coincides with that of operational databases. We would then say that this kind of fact
schema is a lossless-grained schema. An example of a lossless-grained schema is the one that
exploits the date, department, and patient dimensions to model admissions to a
hospital, if we assume that the same patient cannot be admitted twice on the same day into
the same department .
Based on this distinct!'on, we can say the following:

• In a lossy-grained fact schema, an event is missing because ns measures would



have irrelevant values typically equal to zero. For instance, this applies to sales,
telephone calls, and inventory.
C h a p t e r 5: Conceptual Modeling 121

• In a lossless-grained tact schema, an event is missing because it did not occur in the
application domain. For instance, this applies to admissions and shipments
After a careful examination, we can conclude that some relevant pieces of information
may not be represented in lossy-grained schemata. Look at the sales, for example. If a
missing event shows that a product is still unsold, how can you represent that a product Ls
or is not for sale in a store on a certain day? From the conceptual viewpoint, the best thing
to do is to place a new empty coverage schema with produce , dace , and store dimensions
next to the sales schema. Each ' coverage schema event should represent that a specific
product was actually tor sale at a specific store on a certain date. This empty schema
corresponds to the coverage table defined by Kimball \ 1996 ) , To compare it with the sales
schema, see section 5.5 on overlapping procedures that allow you to answer queries such as
which products for sale are still unsold ?
As a matter of fact, you could also use only the sales schema if you created an event
that corresponds to each product for sale on one day in one store and use those events
whose measures are null, to explicitly represent the products still unsold . It is clear that
litis solution leads to a remarkable amount of wasted space if the number of unsold
products is high.
As we mentioned, often you cannot analyze data at the maximum level of detail; and
-
this results in the need for primary events to be aggregated along different abstraction lex els.
According to OLAF terminology', aggregation is called roll-up and group-bv set is the key
concept used to define it

Group- by Set
A jrowp-iiy ict is any subset of dimensional attributes in a fact schema that does not
contain two attributes related by a functional dependency. The group-bv set that
includes only and ail of the dimensions of a fact is said to be its printaru gwup-by set
-
.All of the others are called secondary group by sets and identify potential ways to
aggregate primary events.

The primary group-by set for the sales fact schema of Figure ?-6 is Gr = { produce ,
store , date ). Some examples of secondary group-by sets are shown here^
G. = I product . state , quarter !
G2 = Typer brand , 3 tor* , month , day ]
G3 = ( country , dace I
Gt = ( year )
G5 = U
Note that no attributes appear for one or more hierarchies in some of these group-by
sets. For example, no attributes appear for the product Itierarchy in G „ and no attributes are
-
present for the product and store hierarchies in G .. The Gf group by set is an extreme case.
-
It has no hierarchies, and we can call it the empty group by set . The (product , category,
-
date|attribute set is not a group by set, since p r o d u c e —a t a gory.
-
It is fundamental to note that the set of all possible group by sets in a fact schema are
linked by a partial order relationship called roN - up.
122 Da la Warehouse Design: Modern Principles and Methodologies

-
Roll up Order
-
Given two group by sets called G and & and belonging to the set of ail the possible
-
group by sets in a fact schema , we say that G . G , when C — * G - that is, when a
,

b attribute exists in G for each a attribute of G , such that b belongs to the same

hierarchy as a and a * b. —
For example , the follow ing relationships between group-by sets hold:
{year} (type, quarter} (product,quarter} < (product,store,datej
|state} <{type, brand store, month, day} < (product,store,date}
, ,

A maximum dement, corresponding to the primary group-by set of the fact schema, and a
minimum element, corresponding to the empty group-by set, always exist in roll up orders. -
-
Additionally, for each pair of group by sets, both a superior (sup) and an inferior { inf } group-by
-
sets always exist in roll up orders. From the algebraic viewpoint, roll up characterizes a lattice. -
Figure 5-24 shows a small fragment of the roll-up lattice for the sales schema of Figure 5-6. In
-
real world applications, it is easy to imagine that the entire lattice can be huge.

{ product , store , care }

{ t y p e , S t o r e , d i r -a ] (product , store , fliouth , isy} { produe t , at:ereCity , date}

{type, hrdLfti Store , month, day} [ product , a tore, month}


-

( produce , scoreCi- y. month}

{product , state , month ) {product , storcCity , quarter}

; t ypa , s t ore . month } {product , state , quarter )

{produc t , quarter}

( type , quarter) ( product, year}

{type , year}

{ca t eccry, ye = rJ

{ state ) ( category ] [year}

{}
FIGURE 5 - 24 A part of the roll-up lattice for the sales fact schema
Chapter 5: Conceptual Modeling 123

-
A coordinate of a group by sot C is an n - ple of values of the dimensional attributes in G.
-
We already showed an example of a coordinate ct of the sales primary group by set G ,
Some examples of coordinates of the G 3 and G ? secondary group -by sets are respective]y.
a" = (product : GeanHand ', state: Texas', quarter: U, 2007)
a' - (country: France , date: 5 / 5/ 2008')
rr ' ”

We recall that each dimension value exactly determines one value for each attribute in
its hierarch )- by means of functional dependencies. In this way, each coordinate of the
- -
primary group by set determines one coordinate of each secondary group by set , and vice
-
versa, one coordinate of a secondary group by set corresponds to a set of coordinates of the
primary group-by set.
Each coordinate a of a given secondary group-by set uniquely identifies a secondary
event that aggregates all the primary events identified by the coordinates of the primary
-
group by set corresponding to a For example, the secondary event identified by the ant
-
coordinate of the G;1 group by set aggregates the primary events related to the sales of all
products on May 5, 2008, along the French stores. Trie empty group- by set Gtj has one single
second ary event that aggregates ail the primary events.
Each secondary' event then expresses a value for each measure . In order to define hau? to
calculate the value of each measure for second ary events, we need to specify aggregation
1

semantics We provide a gradual solution for this requirement in the following paragraphs.
We begin with the simplest case and then introduce new situations to outline the global
framework for advanced constructs of the DFM .

5.3 . 1 Aggregating Additive Measures


Measures are additive along all the dimensions under the simplest (luckily most common )
conditions. If this is the case, the value of a measure m for a secondary event can be
calculated by summing the values of m in all the corresponding primary events.
It is well known shat the associati ve property holds for the SUM operator. Thanks to this
property, you can. also calculate the sum of a set of values as the total of their partial stuns.
As a result, the measure values of the secondary events for the G' group-by set can also be
calculated by summing the measure values of the secondary events for any G " secondary
-
group by set such that G C ". Section 9-2 shows that this has important consequences on
^
logical design because it makes it possible to answer a query using aggregate views at
different levels of aggregation.
Table 5-1 shows the values of the qusnt ity measure Ln the primary events of a sales
fact schema with the produce and quarrer dimensions- The events are represented
within 3 two-dimensional matrix ,, as they would be in an OLAF tool. Each event is indexed
by its coordinate, which is represented by the values of its attributes (quarter and
product in Table 5-4 ). In relation to the values of die coordinate attributes, the values of
other relevant attributes in the same hierarchy are also shown (the year of a quarter, the
type and category of a product ), as defined by the functional dependencies. Tables 5-5,
- -
5-6, and 5 7 show the corresponding aggregations for different group by sets. The SUM is
used as an aggregation operator Note that the { category, year|aggregation can be
equivalently obtained from any of the three other aggregations thanks to the associative
property nf the SUM operator. Section 5-3- 2 will show that you cannot always apply this
,

pro pert) when you use other aggregation operators.


124 Data Warehouse Design; Modern Principles and Methodologies

year 2007 200S


quarter ro7 iro7 iii ' 07 iv’07 ros ires nros ivos
category type product
Shiny 100 90 95 90 80 70 90 85
Cleaner Bie&cfcy 20 30 20 10 25 30 35 20
House
Brighty SO 50 SO 45 40 40 50 40
cleaning
CleanHand 15 20 25 30 15 15 20 10
Soap
Scent 30 35 20 25 30 30 20 15
F Slurp Milk 90 90 85 75 60 SO 35 60
Dairy
U Slurp MiIK 60 SO 85 50 70 70 75 65
product
Food Slurp Yogurt 20 30 40 35 30 35 35 20
DrinkMe 20 10 25 30 35 30 20 10
Drink
Coky 50 60 45 40 50 60 45 40

TABU; 5- 4 Primary Events of the Sales Cube

year 2007 2008


quarter 1* 07 11 * 07 1IT07 IV’07 I’OS 1108 lll' OS [ V’CS
category
House cleaning 225 225 220 200 190 185 215 170
Food 240 270 280 240 245 275 260 195

TABLE 5 *5 The {category , quarter } Group-by Set Secondary Events

5.3,2 Aggregating Non-additive Measures


rf a measure cart be aggregated using the same aggregation operator along all of the dimensions,
even though the operator is not the SUM (for example AVG or MAX), the preceding
instructions still apply- You can apply the operator to the values of measure m in all the

TABLE 5 -6 The year 2007 2008


{type , year }
-
Group by Set
Secondary Events
caceqorv type

House Cleaner 670 605


cleaning Soap 200 155
Dairy product 750 6S5
Food
Drinks 2SQ 290

-
TAALE 5 7 The year | 2007 2008
{category, year } category
Group-by Set
Secondary Events House cleaning 870 760
Food 1030 975
Chapter 5: Conceptual Modeling 125

primary events to calculate the measure values for each secondary event. However, we
should add that not all aggregation operators show the same useful associative property
.
holding tor sums. From this viewpoint Grey and others (1997) classified aggregation
operators inter th ree groups'
* Distributive Calculating aggregates from partial aggregates
* Algebraic Requiring the usage of additional information in the form of a finite
number of support measures to correctly calculate aggregates from partial aggregates
1
Holistic Calculating aggregates from partial aggregates only via an infinite
number of support measures
Some examples of distributive operators are the previously mentioned SUM operator,, as
well as the MIN and MAX operators; the considerations made in the preceding paragraph
apply to these operators . Some examples of algebraic operators are the average, the standard
deviation, and the barycenter operators. If you add all the required support measures to
every second a ry event, you can still calculate secondary events from other more fine-grained
secondary events.
Let ' s look at the case of the AVG operator. Generally it is dear that the average of partial
averages ol a set oi values is different from the average trf the same set. Tables 5-5, 5-9, and
5-10 show' examples of the unic Price measure of SALE. The AVG operator can be used to
aggregate un.2.1 Price along all the dimensions. It is apparent that there is no way to obtain

year 2009
quarter lP09 iro9 liras iv 09

'“ Jr 4 =
category f“ Vi product

Shiny 2 2 2.2 2.5


Cleaner BJeachy 1.5 1.5 2 2.5
House
Brighty 3 3 3
cleaning
CleanHand 1 1.2 1.0 1.5
Soap
Scent l.o 1.5 2

-
TABLE 5 8 Primary Events of me Sales Cube: A Dash Stands for the Unsold hems in a Quarter

year 2009
Quarter l '09 ll ’ Q9 111 ' 09 IV 09
category type
House Cleaner 1.75 2.17 2.40 2.67
cleaning Soap 1.25 1.35 1.75 1, 50
Average: 1.50 1.70 2.08 2.09
TA£U 3 -S The 4 type, quart erf Grouchy Set Secondary Events
126 D a t a W a r e b s i i s t D e s i g n : K o d t r n P r i n c i p l e s a n d M 1 1 h a 4 o L o g i e s.

-
TA&LE 5 10 ~ e - year 2009
{category,
quarter} Group
by Set Secondary
-. category
quariei i 09
'
iros 11109 3 Vr 09

Events
House cleaning 1.50 i - 34 2.14 2 - 30

the proper aggregation by can ego ry and quarter '[Table 5 10 ) from the aggregation by
type and quarter (Table 5-9 ) unless you add a new measure that counts the number of

primary events that make up each second ary event. COUNT LS the support measure for
AVG. See section 9. L9 for more details of the impacts on logical design .
Some examples of holistic operators are median and rank. Secondary events must be
calculated from primary events because they cannot use a futile number of support
measures to calculate aggregates from partial aggregates.
Things get more complex when different operators are used to aggregaie a measure
.
along different dimensions. For example Figure 5-23 shows the level measure in the
INVENTORY fact schema. The MJN operator is used to aggregate level along the date
dimension, and the SUM operator is used to aggregate it along the warehouse dimension.
Table 5-11 shows some primary events with reference to a single product . Tables 5 12 and -
-
5 13 show the secondary' events obtained by aggregating along only one dimension if you
instead have to aggregate simultaneously along btff / j dimensions (for example, by month

nron ~ h March 2009

city W SJTSd £ 'LL a


cat
- I 3/1/ 0* 3/ 2/ M 3/3/0913/4/ 09 3/ tyflB 3/ 6 / 05 3/ 7 / Mi 3/ 3/ 09 3/ 9/ dS

Defense 10 10 8 4 20 2G 15 15 12
^ ahs ElysSe 5 4 4 4 2 2 2 10 10
ReuLily 14 14 14 12 20 20 20 20 16
Vil Ue 4 2 2 2 10 10 10 S 3
Lytm ^
Ainay 4 20 20 15 15 12 12 10 9

TABLE 5 11- Primary Events in me Inventor/ Schema

month March 1999


d a t e 3/ 1/ 09 3.- 2/ 09 3/ 3/ 05 3/ 4 / 09 3/ 5, . 9 3/ 6/ 09 3/ 7 / 09 3/6/ 09 3 9/ 09
city
^
Paris 29 23 26 20 42 42 37 45 38
Lynn 3 22 22 17 25 22 22 IS 17

TABLE 5 12 The ( date , city) Grour^by Set Secondary Events


*
Chapter 5: Conceptual Modeling 127

-
TABLI 5 13 The month March 1999
.
[iBoath
c i ty .
wa re r cuse
warehouse ]
Groyp-by Set Defense 4
Secondary Events
Chicago flys£e 2
Reuilly 12
Villette 2
Boston
Ainay 4

and city) , this causes a problem . Will each secondary evert have to be calculated as the
minimum of the sums or as the sum of the minim urns? It Is obvious that the results
obtained in both cases are generally different! in practice, you must choose a priority order
I
for dimensions to decide which aggregation should come first . Note that in the specific
-
example of Table 5 11, the result of the aggregation by cnonth and city would have
seemed to be independent of the application order of operators if we had used the AVG
operator instead of MIN to aggregate along dace. As a matter of fact, this applies only
because no primary events are missing. It this is not the case , results still depend on the
operator priority order.
To conclude this section, we would like to consider the case of derived measures , whose
values can be calculated from other measures in the same schema. For example, the
receipts. = unit Price x quantity measure is derived from the unit Price and
quantity measures. If a derived measure is additive, as in the case of receipt s, you can
calculate the aggregation from partial aggregations ( the annual receipts are the sum of
monthly receipts). However, you cannot calculate the aggregation from aggregations of its
component measures. For Instance, you cannot multiply the amount of items sold in a year
by the average unit price during the same year to calculate annual receipts accurately!

5.3.3 Aggregating with Convergence and Cross-dimensional Attributes


A convergence in a fact schema is completely transparent to the goals of aggregation. For
exampler if we look at the hierarchy built on the store dimension of Figure 5-8, including
a convergence based on country, we can immediately note that the aggregation along
country is dearly defined because the convergence results in each store corresponding
precisely to a country. In addition to that, you can calculate secondary events of any group-
by set that includes country from the events of any more fine-grained group-by set that
includes either storeCity, state, or salesDistrict .

d If your fact schema includes a cross-dimensional attribute, you should use similar
reasoning to check for aggregation semantics. Figure 5-25 shows a simplified sales schema.
Here , the category and country parents jointly define the VAT cross-dimensional
attribute. Each primary event is associated with a product and a store, and, consequently,
a category and a country. For this reason, the secondary events of the group-by sets that
include VAT are uniquely defined because he VAT value for each primary event is clearly
-
defined . Figure 5 26 sums up the substantial difference between convergence and cross-
dimensior.a1 attributes from the viewpoint of roll - up relationships between group-by sets,
128 Data Warehouse Design: Modern Principles and Methodologies

FIGURE 5 25-
Cross-dimensiona!
category Q VAT

attribute- in the
sales fact schema n product

SALE
da r e
O
quantity
o
s t o r e country

FIGURE 5 26
-
*

Roll up lattices
with convergence
3 V”
*s sQ

ar>d cross - F F
dimensions ' b a
attribute

{ *? }

\ ae } { cbj

[i] tcI i }
^
( }

5.3.4 Aggregating with Optional or Multiple Arcs


K an optional arc exists between two attributes called A and b in a hierarchy, at least one value
of the domain of a corresponds to an undefined value of ft. As a result, if G is the primary
-
group by set and GJ is a secondary group-by set that Includes k a corresponding coordinate of
Gr docs not exist for some coordinates) of G. Conceptually, you can solve this problem if you
use a fictitious ' Mo Value ' value to expand the domain of t? that you can give to b every time it
-
is undefined. Using this expanded domain, roll up is defined for all the coordinates for ail of
the secondary group-by sets including b or one of its descendants in the hierarchy, the optional
,

arc then results in a set of fictitious secondary events that group all the primary events

excluded from the normal roll- up that is, those primary events for which b equals ‘ No Value'-
For example, look at the diet attribute in the sales schema ( Figure 5-8), linked to
-
produce via an optional arc. Tables 3-14 and 5 15 show an example of aggregation where
the dash means T>So Value ".
Things get: more difficult for a fact schema that includes a multiple arc. Look at Figure 5-15,
w'hich shows the book sales schema in which a multiple arc links book and author, in
particular, relevant queries are those that group sales by book or author. Table 5-16
shows the number measure of some primary events related to a certain date, and Table 5-17
shows the corresponding secondary events grouped by date and author. Note immediately
that totaling all partials of the various authors would lead to an inaccurate result (an estimated
total of 37 books sold against the 3d actually sold } when you calculate secondary events
grouped by date for this reason , it is not correct here to exploit the distributive property’ of
,

the SUM operator, even though the number measure is additive along the book dimension.
C h a p t e r 5: Conceptual Modeling m
T*si 5 -id date 1/ 2 / 09
Primary Events in
.
the Sa es Schema
diet type product

Slurp Milk 30
Cholesterol Dairy product Slim Yogurt 15
Fatty Yogurt 20
Cholesterol Stnct Cake 10
h Sweets
Macrobiotics Gnandi Cake 5

TABLE 5-15 The date 11/ 2/09


.
[ diet date ]
diet
-
Group by Set
Secondary Events 50
Cholesterol 25
Macrobiotics 5

expressed
-
Now let's associate each book author pair with a weight ranging tram 0 to 1 that
the relevance of that pair. Table 5-18 show’s a potential weight distribution .,
assuming that the contribution of different authors to each book they write is equal. If you
total a weighted sum of the sales contributions for every single book to calculate the
7

secondary events grouped by { author , date ), you will obtain the results shown in Table 5-19.
Front those results, you can easily infer that the use of weights restores the distributive
property of the SUM operator if the weights are normalized for every book that is, if
the sum of the weights for each book is equal to 1.

Now let ' s examine the case in which the multiple arc enters a dimension, as shown in the
hospital admissions schema of Figure 5-16. Here, more than one diagnosis is associated with

date 1/ 2/ 03
author book
Gor’amlli, Rizzi racts & Crimes 3
Technical
C-olfarelll Sounds Logical 5
RFZZ3 The Right Measure 10
Current affairs
GolfarellS, Rizzi Facts How and Why 4
*
Science Fiction Golfareill The 4th Dimension 8
7ofaf: 30

TABLE 5-16 Primary Events in the Bookseller Schema

TABLE 5 *17 The date 1/ 2/ 09 !


{author, date )
Groyp-py Set
author
-d Secondary Events Golfarelli 20
Rizz: 17
f
Total: 37
130 Data Warehouse Design: Modern Principles and Methodologies

TABLE 5*18 Gcffarein R:- zzi


Multiple Arc Facts & Crimes 0.5 0.5
Weights Between
boor and author
, Sounds LogicaJ 0
The R ght Measure 0 1
'
Facts, How and Why 0.5 0.5
The 4th Dimension 0 0

TABLE 5-19 The date ‘1/ 2/ 09


{author , date}
author
Groupby Set
Weighted Golfarell
* 16.5
Secondary Events
R.*zzr 13.5
Torar; 30

each admission- In particular the aggregation semantics seems to be rather confusing. More
than one coordinate (each one with a different diagnosis ) corresponds to one primary event
( an admission ) And , vice versa , a coordinate of the j diagnosis , date , department,
patiencSSN} primary* group-by set does not define a primary event, but a fraction of it
( the 'contribution ' a single diagnosis makes to the admission ) To evaluate the aggregation
semantics, we suggest that you first transform the fact schema into an equivalent schema. To
do this, you can stick to two procedures described in the following paragraphs . Section 9.1 .4
shows how both procedures generate just as many logical design solutions.
Figure 5-27 shows the first equivalent schema. Here, we introduce the fictitious
diagnosisGroup dimension with a domain of all the diagnosis combinations linked to
admissions. This transformed schema is largely similar to that of book sales, where the
multiple arc enters a dimensional attribute. Using a multiple arc weight allows the
admission cost to be distributed over the different diagnoses, as required by the application
domain, so that significant aggregates can be obtained for single diagnoses and for single
categories. Table 5-20 shows the cost measure of a set of primary events related to one
department and one date, and Table 5-21 shows tire weights of the multiple arc .
The second schema of Figure 5-2S has been obtained from the original one of Figure 5 16
by using the ADMISSION PER DIAGNOSIS fact rather than the arc entering diagnosis
-
in order to model the multiplicity of the association between admissions and diagnoses.

department
o
f i rstName

lastName
date
a ADMISSION

diagnosis Ogender
O Q O CCSE C 4 ticntS3N
category use rS egTter.t
di ignoG is
^ rcvip
city

blrthVear
FIOURE 5-27 Equivalent fact schema for admissions
Chapter 5: Conceptual Modeling 131

patient Johnson Smrth


diagnosis di agnosisGroup
Cardiopathy. Hypertension, Astnma G1 1, 000
.
Fracture Asthma G2 2.000
TASU 5-20 Primary Events in tire Admissions Schema

TABLE 5-21 G1 G2
Multiple Arc
Cardiopathy 0.5 0
Weights Between
-
d i a gn 2 s i sGr oup Hypertension 0.3 0
and diagnosis
Asthma 0.2 0,3
I Fracture 0 0, 7

To achieve this result, we reduced the fact granularity . The diagnosis dimension is now
a normal "single" dimension, and each primary event no longer corresponds to an entire
admission ., but to the " portion " of an admission that you can assign to a single diagnosis.
-
We will briefly describe this operation as the diagnosis attribute push dawn. Table 5-22
shows the costFerDiagnosia measure of the primary events that instance this schema .
Note that the admissions costs have been weighted according to the weight? listed in
Table 5 -21 to calculate the partial costs in primary events .

5.3 ,5 Empty Fact Schema Aggregation


A fact schema is said to be mpfy if it does not have any measures. If this is the case, primary
events only record the occurrence of events in an application domain, for example Figure 5 29
shows a fact schema in the university domain . Here, each primary' event shows that a student
. -
attended a specific course during a specific semester.
The information each secondary event represents in an empty fact schema is normally
the number of primary events corresponding to it, computed by the COUNT3 aggregation
operator. In other words, you can assume that an empty fact schema is described by an
depax Bm cut
O
i irstKajre

lastlfam#
ADMISSION
Q FIR DIAGNOSIS
dais
category ddstPeriDiagnosifi O ger.de r
J O patientSSK
diagnosis O ueerSegrrisnl
city
JCdrxhVeax

FIGUSE 5-2 S Equivalent feet schema for admissions resulting from the diagnosis: attribute push-down

The COUNT aggregation operator is distributi ve, even if Ihe operator that enables the calculation of a COUNT
'

aggregate from the partial aggregates is not COUNT itself but rather SUM*
132 Data Warehouse Design; Modern Principles and Methodologies

TABLE 5-22 oat lent Johnson Smith


Primary. Events in
the Admissions diagnosis
Schema after Cardiopathy 500
Pushing Down the
diagnosis Hypertension 300
Attribute Astnma 200 600
Fracture 1, 400

FIGURE 5-29 year teacher


Fact schema for o
university course
attendance enesterQ

address ares
name ATTENDANCE
O
• tudent
o
nattonality ( COUNTJ course schccl
ace

gender

implicit integer measure that equals 1 if the event occurred or 0 otherwise, and that the
SUM operator aggregates the events. Then, the ( course: Database Design ', gender; F)
coordinate defines the secondary^ event that shows the fcotai number of female students that
took the Database Design course.
An additional approach to aggregation is actually also possible. In this approach, the
-
information each secondary event carries is linked to the existence of the corresponding
primary events. In order to explain this concept, we can assume that you have an implicit
Boolean measure, which is TRUE if an event has occurred or FALSE if the event has not
occurred. Then you may use both the AND and OR operators for aggregation with universal
and existential semantics, respectively. The secondary event defined by thefscudent: 'Will
Smith ', area: ' Databases ', school; ' Engineering' ) coordinate can then mark that Smith took
at least onecourse on databases in the School of Engineering (OR operator ), or alternatively
that he took ail of the courses on databases in the School of Engineering ( AND operator ).
Table 5 24 compares the different aggregation options based on the simple example of
*

Table 5-23.

area Databases
course Database Design Information Systems Data Structures ..
Advanced I S
5 tudar
.r
Will Smith
Peter Johnson
Pau Berry
U

Fran* Armstrong
Mari* Stephens

TABLE 5*23 Primary Events ( in Gray ! Sn trie Course Attendance Schema for Engineering
C h a p t e r 5: C o n c e p t u a l Modeling 133

area Ds:a bases area Databases area Databases


student
Will Smif
Peter Johnson 4
Paul Berry 0
zra ~ k Armstrong 2
Mark Stephens 1

-
TABLE 5 24 'he { student , area} Group- by Set Secondary Events Using. the COUNi ( Left ) , OR
(Middle), AND ( Right) Ooerators for Aggregation

5.3.6 Aggregating with Functional Dependencies among Dimensions


One or more functional dependencies may occur between the dimensions in a fact schema .
For example, a specific store sells each product on a certain date through at most one
promotion in the sales schema of Figure 5-S . A date, product, store upromoti or.
functional dependency then exists. In the admissions schema (Figure 5-27), we mav

reasonably assume that a patient cannot be admitted twice on the same day. For this reason
we have pat ier.tSSN , date ^department , diagnosis Group, In die attendance schema
of Figure 5-29, if a student attends each course at most oncer we have student, course-*
semester .
Unlike what occurs in standard relational databases, a functional dependency in a fact
schema does not generate serious anomalies, and you should not then necessarily avoid it.
l- rom the aggregation viewpoint, the immediate result is that the association between

-
ease of sales, this happens for all group by sets that include date, store, and product
-
coordinates of some pairs of group -by sets is one-to-one rather than many-to one. in the

You may note the effects of these conditions when OLAP operators query that schema . If
events are being analyzed by date, store, and product , the application of roihup and
drill-down operators along the promotion hierarchy will not modify the query result,
because one promotion value will be determined at most and you will not be able to
execute any significant aggregation .
To conclude, you should note that a fact schema where the . . , f dimensions
am
^ - * ^
determine the j 1 . , t ar dimensions is equivalent to a schema where u , .... flm are the on.lv
-
+

an
dimensions and m I , , are included as cross dimensional attributes jointly determined
by the dimensions it is nevertheless more useful to represent A . , , , a as dimensions , if
,

you want to give more importance to their role in aggregation.

5.3.7 Aggregating along Incomplete or Recursive Hierarchies


Hierarchies in the fact schemata are at times irregular when their occurrence results In
missing values for one or more attributes (incomplete hierarchies) and / or wrhen the actual
hierarchy length varies from occurrence to occurrence (recursive hierarchies), as mentioned
in sections 5.2, 7 and 5.2,8. From the aggregation viewpoint this obviously has quite
considerable effects.
134 Data Warehouse Design: Modern Principles and MethoAoIogies

countryO

;iateQ

D . 3. A . -
E K VatiiaE City
countyy year p
Cal i f omia Col or a
s; tyO Tucnti
A.VG *i
^p Cirar» c9 Kontorey = sr.a Norfolk
PGFu
^TIOff Santa vi : A* Korvi ;h \
eiiiffinersd I nfc& b i t ar, c a
r lagf it 1 d S her inghMn

FIGURE 5*30 Irregular breakdown into stales ana counties

We fsrst discuss the case of incomplete hierarchies and make reference to the geographic
hierarchy example shown in Figure 5-30 If a fact expresses the number of inhabitants per
dty as a result of a census . Table 5-25 shows some examples of specific primary events ,
Because states are not defined for the United Kingdom and for Vatican City, primary
events can be aggregated by stars after balancing hierarchies in three different ways ,
depending on user preferences and decision -making process features. More specifically
data about the cities that cannot be grouped to states can be
1. completely summed up in a single value labeled "Other ', as shown in Table 5 26 -
( bal&ncmg by exclusion );
1. shown at the coarsest aggregation level among those finer than stare, for which
a value is defined aunty in the example of Table 5-27 {d&wntvard bohmcing );
3. shown at the finest aggregation level, among those coarser than count;y, for which

,

a value is defined country in the example in Table 5-2S ( upward balancing ),

TABLE 5-25 month Jan 09


Primary Events in
The Census country state county city
Schema Santa Ana 353.154
Orange
Anaheim 346 ,823
California
Salines 148 ,350
USA Moment
Monterey 30.641
Vilas 110
Colorado Baca
Springfield 1.562
K Norwich 121.600
Norfolk
Sheringham 7.143
UK
Epping 11,047
Essex
Tilbury 12 ,091
City II aoo
Chapter 5: Conceptual Modeling 135

TABLE 5-26 month Jan 09


The {sȣUk
state
state ] Gfo -jp by -
Set Secondary California 878.938
Events in tne
Population Schema Colorado 1.672
with Balancing Other 152,681
by Exclusion
(We Assumed
for Simplicity that
No 'Other Cities
A;e Present }

TAflLf 5-27 month Jan 09


The {month,
state
state} Group-Oy
Set Secondary California 878,998
Events m the
Population Schema Colorado 1,672
with Downward Norfolk 128.743
Balancing
Essex 23.138
Vatican City 800

We deal with logical design of different types of balancing in section 9.1.6. Here, we
would like to draw your attention to two of their features:

* Balancing by exclusion fails to meet the classic roll-up semantics, which requires
that each progressive step in aggregation causes more than one group to collapse
into a single group. When you roll up from s t a t e to country in the geographic
hierarchy. the group labeled w ith 'Other ' is broken down into two groups: UK and
,

Vatican City, respectively .


* You cannot always apply all three balancing solutions that we proposed , For
example, Table 5-27 shows that we could not downward balance the Vatican City
row because there is no defined city for litis country.

Aggregation semantics change slightly with regard to recursive hierarchies, Unlike


,

other hierarchies* you are not supposed to use a fixed number of distinguishable
aggregation levels to characterize diem . Look at the example of Figure 5- 22. The example
,

fact represents the activities carried out around projects Table >29 shows some primary

TABLE 5-28 month Jan 09


The {menib ,
state } Group by
Set Secondary
- state
California 878 , 998
Events in trie
Population Schema Colorado 1,672
with Upward UK 151,881
Balancing
Vatican City 800
136 Data Warthoust Design : Modern P r i a e i p t e s and M s tho d ol a ° i t s

TABLE 5-29 dins 5/ W 2005


Primary Events
\ f t the Actrvtty

z
employee
Schema Mark 2
-
Andrea Laura 5
Paul John 4
Laurence 3
George m Tom 2
Anna 4

Andrea 1
George 2
Paul i

events related to a specific project and’ a specific activity type. As you can see, each I
employee works a specific number of hours on a specific day . Additionally, reporting
relationships of varying length exist between employees, as the recursive hierarchy models
When there is no explicit subdivision in levels of aggregation with specific semantics,
the only way to aggregate primary events to secondary' events is recursively. Table 5 30-
shows secondary events at the first level of recursion . The total number of hours worked
links to each employee and to possible other employees that report directly to each one oi
them ( the "children" in the hierarchy ). For example, we allocate 6 hours to Paul, calculated

— - —
by totaling I hour he worked , 1 hour Andrea worked , and 4 hours John worked . At the
second level of recursion Table 5 31 shows its secondary' events we calculate instead the
,

total hours per employee and those of their ''grandchildren ." The hours worked and
assigned to Paul then become 13 The result at subsequent levels of recursion remains
unchanged because the hierarchies m the example have a maximum length of 3.
See section 9.1. 7 for an explanation on logical design solutions tor recursive hierarchies.

TABLE 5 30-
Secondary Events
dace 5/5/ 20081
employe e
tn the Activity
Schema at the Mark 2
First Level of
Recursion Laura 5
John 4
Laurence 3
Tom 2
Anna 4
Andrea 8
George 7
Paul 6
C h a p t e r 5: Conceptual Modeling 137

TABLE 5 31 aare 5/ 5/ 2008


Secondary Events
in the Activity employee
Scnema at trie . Mark 2
Second Leva ; of
Recursion Laura 5
John 4

Laurence 3
Tom 2
Anna 4
Andrea 8
George 7
Paul 13

5 . 4 Time
Tune is commonly understood as a key factor in data warehousing systems, since the
decision process often relies on computing historical trends and on comparing snapshots
of the enterprise taken at different moments. In the following discussion , we collect some
notes regarding temporal dimensions in fact schemata and their semantics.

5.4 .1 Transactional vs. Snapshot Schemata


Kimball (1996) introduced two basic paradigms for representing inventory -like information
in a data warehouse: the transactional model , where each increase and decrease in the
inventory level is recorded as an event, and the snapshot model where the current inventory'
level is periodically recorded . A similar characterization is proposed by Bliufufe et ai . ( 1998),
who distinguish between event-oriented data such as sales, inventory transfers, and financial
transactions, and state-oriented data such as unit prices, account balances, and inventory
levels. In this section, we generalize these paradigms to define a classification of facts based
on the conceptual role given to events ( Goifarelli and Rizzir 2007).
We start by observing that, in general terms, the facts to be monitored for decision
support faLl into two broad categories according to the way they are collected and measured
in the application domain. Flow facts are monitored by collecting their occurrences during a
time interval and are cumulatively measured at the end of that period ; examples of flow-
facts are orders, invoices, sales, shipments, enrollments,, phone calls, and so on . Slock fads
are monitored by periodically sampling and measuring their state; examples of stock facts
are those measuring the price of a share or the water level of a river.
Depending on the nature of the monitored fact and on the expected workload , two
types of fact schemata can. be built whose events have different semantics:
* A transactional fnd schema is one for which each event may either record a single
transaction or summarize a set of transactions that occur during the same time

interval In this case, most measures are flow measures (see section 5.2.9 ) that is,
they are cumulatively evaluated at the end of each time interval and are additive
along aII dimensions.
m Data Warehouse Design; Modern Principles and Methodologies

FIGUHE 5-31 O G* Ge£cry


Transactional ( top :
a no snapshot C - y? £
( bottomI feet
schemata _
S . fizzz Liar
product
.
salesMar ag ^ r
^ cizy
SALE
O c
O O- SICi' H country
fnonzn week quantiCy
receipts

O ha. s i r .
0 ri '
er
it s z ance ? rowMouch

vaterGauae

r1 :A
--
r
jzr uoiszr .
level
I

* A sratfpshtfl fact schema is one whose events correspond to periodica] snapshots of the

fact Its measures are mostly stock measures ( see section 5.2. 9 ) that is, they refer to
an instant in time and are evaluated at that instant, so they are non-additive along
temporal dimensions ( that is, their values cannot be summed when aggregating
along time, while, for instance, they can be averaged ).
For a flow fact , a transactional fact schema is typically the most natural choice. For the
sales fact , for instance, a subset of events of the transactional fact schema in Figure > 31
-
might be those reported in Table 5 32r each representing the total quantity of a product sold
in a store during a week.
On the other hand, for some flow facts, both transactional and snapshot schemata can
be reasonably used . This is true, for instance, for the stock inventory fact, tor which a
transactional fact schema and a snapshot schema are depicted in Figure 5-32. The two
schemata arc clearly identical except for the meaning of their measures. A sample set of

-
TABIE 5 32 store week product quantity
Events fo the
' SALS Srr. artMart 10/i/ 2008 COR 100
Transactional
FACT Schema SmartMart 10/ 1/ 2008 0VD*R 20 I
SmartMart 10/ 1/ 2008 CD-RW SO
SmartMart 10/ 8/ 2008 DVD+R 25
C h a p t e r 5: Conceptual Modeling . 139

-
FIGURE 5 32
Transactional ( top !
category Q

and snapshot
( bottom ) fa t
^ product 0
schemata for the
slock inventory
fact

—— UrVEKTOftY cley
O a c
o o- warehouse country
yftdr wsck flow

category Q

produce 0

IMVEKTORt
c
year
o—
week
wa rehouse country
stock

inventory events for the transactional solution is shown in Table 5-33; each event records
the net flow of items of one product over a week within one warehouse . Table 5-34 shows
how the same events could be represented within a snapshot solution; here each event
he records, on a specific date, the total number of items of a product available in a warehouse.
to The choice of one solution or another for a flow fact depends first of all on the expected
* workload , and in particular on the relative weight of queries asking for flow and stock
information, respectively. For instance , the current inventory level for LCD TVs in Austin,
Texas, can be obtained in the transactional solution by summing up all pertinent events ,
which may be costly, while in the snapshot solution at is sufficient to read a single event ( the
12
most recent one ). On the other hand , consider a query' asking for the net flow of LCD TVs rn

Id
TASCE 5 33 week product; warehouse f low
\ Events for the 10/1/ 2006 LCD TV Austin *20
Transactional
INVENTORY 10 /8/ 2008 LCD TV Austin -o
Fact Schema
10 /15/ 2008 LCD TV Austin T 3

-
TABLE 5 34 week product warehouse stock
Events for the I 10 1 2008 LCD TV Austin 20
Snapshot / /
INVENTORY Fact 10/8/ 2008 LCD TV Austin 15
Schema
10/15/ 2008 LCD TV Austin 18
140 Data Warehouse Design: Modern Principles and Methodologies

-
TABUS 5 35 hour
_-
wac er auge level
Events for the .
9: 00 a . m . , Jsn .7 2008 Westminster Bridge 8,0
Snaps no: now
Fact Schema -
10:00 a . m .. Jan 7 , 2008 Westminster Bridge S.4
11:00 a . m ., Jan .7 * 2008 Westminster Bridge 7.9

Austin during a specific week. While in the transactional solution this query is answered by
reading one event iq the snapshot solution the result must be computed as the difference
between the values of quantify registered in two consecutive weeks.
Differently from flow facts, stock facts naturally conform to the snapshot solution; for
instance, a sample set ot events for the liver flow fact in Figure >31 is reported in Table 5-35:
like in the transactional solution applied to a flow tact , each event records the exact value of
the measurement - In principle, adopting a transactional solution for a stock fact is still
possible, although not recommended . In fact , it would require disaggregating the (slock)
measurements made in the application domain into a net flow to be registered , which
implies that, before each new event can be registered , the current stock level must be
computed by aggregating all previous events.
in conclusion, a transactional faa schema is the best solution f in the application domain ,
events are measured as m- and out - flows Lt should not be adopted when events are measured in
the form of stock levels. A snapshot fact schema is the best solution if . in she application domainT
events are measured as stock levels , but it can also be adopted when events are measured as
flows . In general , the best choice also depends on the core workload expected for the fact-
,

5.4.2 Late Updates


- JE

The meaning commonlv given to the time dimension in fact schemata is the so-called valid
time (Tansel et al . , 19931, which stands for the moment when an event occurs in the business
world. Transaction fnw is the time at which a database stores an event; i t is not typically
given importance in data marts because it is not considered as important for decision
support. However, this is not always true, as you will see in the following sections.
One of the underlying assumptions in data marts is that, -once an went has been
registered, it is never modified so that the only possible writing operation consists in
,

appending new events as they occur. While this is acceptable for a wide variety of domains,
some applications call for a different behavior. In particular, the values of one or more
measures for a specific event may change over a period oi time, longer than the refresh
interval, to be finally conscslidated only after the event has been registered tor the first time
in the data mart . This typically happens when the early measurements made for events may
be subject to errors or when events inherently evolve over time
As an example for this discussion, consider an educational data mart in which a tact
models the enrollment of students to university courses. Figure 5-33 shows a possible fact
schema. Each primary event records tine number of students in a specific city that registered
for a specific course tor a specific academic year on a specific date An enrollment is
completed and sent to the education secretary and then recorded as an event in the data
,

mart only when the enrollment lee is paid. If you think of the delays tied to bank payment
management and transmissions , ii is not at all uncommon that the secretary receives
payment notification more than a month after the actual enrollment date.
In this context, if you still wish to draw decision-makers' attention to the actual scenario
of the current phenomenon at the right time , you should update past events with each
Chapter 5: Conceptual Modeling 141

FACULTYQ

oegreeCourse Q r
.

.- 0 " ^ ^
ENHOLLKSTTT
c y
O-
ccj .nrry
of

<p
2?
***
e* ^
o
'
* ^

e
nuiribe rOf St uae r, " s • . i

•i
stats >4

C acaderriioYesr
FIGURE 5 -33 Fact schema for stumerr enrollments ‘. r"
!

population cycle to reflect new data entered. We will call this type of operation late update ,
*-
as it may imply registration of more. than one measure for the same event. See section 10.4
for the implications of the popuiation procedure. i

The usage of valid time only is no longer enough to attain full query expressivity and
flexibility when you have to deal with late updates:
- " , ^ * if a - "
L It is the decision -maker 's responsibility to justify his or her decisions. This requires
the ability to trace precise information available at the time when decisions are -
made . If old events are replaced by their new versions , past decisions can no longer
be justified . " ‘

In some scenarios, accessing only information from current versions is not enough ,

5 to ensure the accuracy of analyses. A typical case comes from those queries that
compare the advancement status of a phenomenon in progress and past statuses ot
that phenomenon Because the data recorded for the phenomenon in progress are
not yet consolidated, their comparison with past, already consolidated data is
.
conceptually inaccurate.
i

-
Mow look at the enrollments example. Figure 5 34 shows a possible flow of data the
secretary received over an interval of six days for a specific combination of - cities, academic
years , and courses. We have distinguished two temporal Coordinates: the registration date,

V enrol lr&en D& tfc


FIGURE 5 -34 v ^
Number of Li .j
m
enrollments by city l o
iG
1

1 1
course and
-5.4* .
i academic year . 3 2 1 V *
r

nj
6 * ‘
U
±J
6 5 1 2 !H
Li
14
til
H
5 5 2 2 1 . 15 4'

V 4 6 5 5 I 2 IT' 23
•u V
141 V

enrol Imer.cOaoe
D
20 19 9 9 2 2
142 Data Warehouse Design; Modern Principles and Methodologies

the date when the secretary receives notification of payment, and the enroll merit date , the
date to which a payment refer!? . Each matrix cel] records the number of enrollments received
on a specific date related to the date tn which they were actually compicted -
This example helps us to understand that two different temporal coordinates can be
distinguished for ail of the tacts potentially affected by late updates. The first temporal
coordinate refers to the time when each event actually took place and coincides with valid
time as we previously mentioned The second temporal coordinate refers instead to the time
when a measurement for an event was received and recorded to a data mart; this coordinate
coincides with the transaction time. As in the case of operational databases, you do not
necessarily have to maintain both coordinates, because this mainly depends on the
workload type. In particular, we can distinguish three types ot queries:

* --
Up to date queries Only require currently valid measurements for the events. For
.
example, look at the ENROLLMENT fact . We can ask In which months do students
tend by preference to enroll in a certain course? To answer this query’ accurately, you
must use the most updated data available on the number of enrollments per date of
enrollment . This means the data shown along the enrol 1mentDate axis at the
bottom of Figure 5*34 You do not have to record any transaction time to answer
--
up to date queries because they just use the valid time.
* Rollback queries Require, tor each event, the measurement that was valid at a
tune t . For example, you may find it interesting to examine the current trend of total
number of enrollments, per faculty, compared with that of the previous year. If you
answer this query on the basis of enrollment date when you compare the current
data (still partial ) with past data (already consolidated ), you could erroneously infer
that enrollments are declining this year Instead , if the average delay of payment
receipt shows no change from previous years, you can base an accurate comparison
on the population date ( the regas era tionDate axis on the right in Figure 5-34 ) , It
is clear that the valid time alone is no longer enough to support this type of query
and it becomes essential to model transaction time as well.
* Historical queries Require more than one measurement be made at different times
for each event An example of a historical query ic a query that defines the daily
distribution of the number of enrollments received on a certain enrollment date. As
in the previous case , you should also explicitly represent the transaction time here.

Depending on whether late updates occur, and on the composition ot expected


'

workload , two main types of conceptual design solutions can be envisaged for a fact ;
rrtcrtctcmpcrffh where only valid time is modeled as a dimension, and hitemporal , where both
valid and transaction time are modeled as dimensions.
Monotemporal solutions are commonly implemented for facts that either an? not subject
to late updates or are only required to support up-to-date queries. They are the simplest
solutions: updates are done by physically overwriting the measurements taken at previous
times for the same event, so that one single measurement ( the most recent one ) is kept in the
database for each event The transaction times of measurements are not represented and no
trace is left of past measurements, so only up-to-date queries are supported , and, in case of
late updates, accountability i> not guaranteed. For instance, the schema of the monotemporal
C h a p t e r 5: Conceptual Modeling 143

-
FIOEJRE 5 35 tacuity Q
Bitemporal fact
schema for
enrollments . degr *sOc rsft Q
_
city 3curst ry
O O
state

solution tor the enrollment facts is exactly the one already shown in Figure
, where the
only temporal dimension is enrollment Date ( valid rime).
Conversely, bitemporal solutions are the most comprehensive solutions that can be
adopted when late updates might occur, and they allow all three types of queries to be
accurately answered For each population cycle, new update measurements for previous
events may be added, and their transaction time is traced ; no overwriting of previous
.
measurements is carried out, so nothing is lost A bitemporal fact schema for enrollments
-
is shown in Figure 5 35; the shared temporal hierarchy enables the modeling of valid
time (enrol1merit Dae e} and transaction rime ( in the form of a currency interval
[currency3 t a r t , currency End ])* A sample set of events is depicted in Table 5 36.
The interested reader is referred to Go!fa rail i and R.izzi (2007) for a deeper analysis of
-
the issues arising in the presence of late updates and a more detailed description of the
possible conceptual design solutions for transactional and snapshot fact schemata ,

5A 3 Dynamic Hierarchies
Up to this point, we have hypothesized that the only dynamic component described in a
fact schema may be the fact itself and its events We have attributed an exclusively static
nature to hierarchies This is evidently not completely true Slowly, sales managers rotate
among various departments . Each month, new products are added to those already for sale,
product categories change , and their attribution to products changes. Sales districts may be
modified or a store may be moved from one district to another We should darify that only

the dynamic properties at the extension*! level that is. the dynamic properties of hierarchy

instances will be considered in this paragraph. We will not be concerned about possible

tnrtal I 'LEr Lrsie . -


sxirTGr :;y!ia t a r t
^
i T
- . -
rTcr rrv End de-gr flCcursa
^ -
sca 'isr icYasT . BspaberCfS* -ctint %

Oct. 21. 2008 on. 27, 2003 Oct 31 , 2003 E ’- SC Eng


. . oa / o& 5
Oct, 21 2008 . to L 2QQ8 to . 4 2003 E-'SC. Eng Q&Q 9 13
OiL 21 2008 to 5. 2006 - Eng
Ei8: 0& 09 - r
11
OfiL 22 . 2003 OcL 27 20OE to . 4. 20G8 P &i . Eng. 03/ 09 2
Od 22.2003 to S 2008. cfee 0&C& <?
-
C r 23, 2COE T Oct . 23. 200$ Eftg, D8/ 09 3

5- 36 Events for tne ESROLlK rrs Schema with Late Updates in Italics
^
144 DaU Warehouse Design : Modern Principles and Methodologies

modifications that alter the structure of hierarchies, such as the addition of a new attribute
to a hierarchy or the addition of a new dimension* which are not considered routine
phenomena , hut rather extraordinary events connected to data mar! maintenance.
On the conceptual level, the representation of dynamic properties in hierarchies i$
strictly bound to their impact on queries. When you use a dynamic hierarchy, you can
actually distinguish tour different temporal scenarios in the event analysis, as proposed first
by SAP Business Warehouse. To discuss those scenarios, we will make reference to the
following example. Assume that on 1 / 1 / 08, the sales manager for the EverMore store
changed from Smith to Johnson, and that a new store, EverMore2. opened with Smith as
manager. Consider the possibilities:
-
Tod a y -fdr yes te rday Al1 the e vents areanalysed according to th e hierarchies ‘
current configuration In the example, we attribute EverMore store sales before 2Q0S
as well as the more recent ones to Johnson , while we attribute EverMoreZ store sales
to 5mith. This approach proves interesting if Johnson also becomes responsible for
past sales.
* Yesterday-for-today All of the events are analyzed according to the configuration
the hierarchies had at a previous time. In the example, we attribute all EverMore
store sales to Smith and we do not consider EveiMoreZ store sales.
* --
Today or yesterday Each event is analyzed according to the configuration ihc
hierarchies had al the rime when the event occurred- For this reason, we attribute
the EverMore store sales prior to 2008 and ail of the EverMoreZ store sales to Smith,
and we attribute EverMore store sales from 20® onward to Johnson .
-
Today-and yester day Onlv the events referring to the hierarchy instances that
remain unchanged are considered . Sales in neither of the two stores are then
,

considered.


the association between store and saleswanager because instances of the

In the example, the dynamic properties involve an arc in a hierarchy the arc expressing

corresponding association vary. However, dynamic attributes are very frequent as well. For
example, the name of a product categoiy or store may change over time . Let's assume then
that the Ev erMore Store changes its name to EvenMore on I / 1 / 09. We may then describe
the four scenarios as follows;

- -
Today for yestenday We attribute all of the store's sales (even those prior to 2009 >
TO EvenMore .
*
to EverMore,
-
Yesterday-for today We attribute all store sales {even those from 2009 forward )

* Today -or-yesterday We attribute the sales prior to 2009 to EverMore and those
from 2009 forward to EvenMore .
* -
Today and - yesterday The store's sales are nut considered .
From the viewpoint of conceptual modeling, it is important to note that you should not
also consider as dynamic the creation of a new value in the attribute domain (for example,
Chapter 5; Conceptual Modeling 145

Today-far - Yesterday- Today -or - Tod ay and-


^

Attributes/ Arcs Yesterday far-Today Yesterday Yesterday


st ore X

Store - x >:
stor* - silesEistrict x
type - TI .= rket ingGroup x x x
OEtegory - department x

TABL£ 5-37 a Die of Dyrtami; Properties of the Saies Schema

a new product for sale or a newly opened store ) or the consequent creation of a new
instance for all of the associations regarding it ( the new product must be associated with a
type and brand, the new store to a city and to a sales district ) . All logical design solutions
'

ad uniformly as far as the addition of new values is concerned ( section fi .4) ,


It is useful to mark the analysts scenarios of interest to the user for each arc and
attribute. We will assume by default that the only scenario of interest is rc>iiiry-or-
yesterday . If some attributes or arcs require different scenarios, you can prepare a table to
list them. Table 5-37 shows how to perform this task with reference to the sales fact
schema of Figure 5-B . Section 8.4 discusses how the logical design solution to be adopted
depends on the specific combination of scenarios required by users.
CXWH

5.5 Overlapping Fact Schemata


Separate tact schemata represent different facts ir. the DFM . However, part of the queries
may require a comparison of the measures from two or more facts correlating to each other.
In OLAP terminology, these are called dnU -across queries , For exa mple, next to the sales fact
schema can be a fact schema that models the shipments sent with the quant icy Shipped
and cost measures, arsd the product , date, warehouse, and customer dimensions. In
that case, users may find it interesting to compare quantities sold with those shipped tor the
same dates and products .
In this section, we will discuss the ways you can combine two or more fact schemata
into a new schema to use for drill -across queries. For simplicity of notation, we will denote
two dimensional attributes that belong to different fact schemata but share the same
semantics and have non -disjoint domains with the same name ,

Comparable Fact Schemata I


Two fact schemata are cnmparablf when they have at least one group-by set ( not the
empty one ) in common .

Obviously, two schemata are comparable when they share at least one dimensional
attribute Some examples of comparable schemata are SALE ( Figure 5-8 ), Sr ~ FKENT
.
( FIGURE 5-10 ) , and INVENTORY ( Figure 5-23) They have the majority of dimensional
146 Data Warehouse Design: Modern Principles and Methodologies

attributes in their hierarchies based on date and product in common . Additionally,


SHIPMENT and INVENTORY have the warehouse attribute in common. In particular, the
-
gtoup by sets common to SALE and SHIPMENT are |product ,, date ) plus all or the others
lower than them in the roll-up order.
Naturally, the main problem with determining comparability in real-world cases lies in
identifying common attributes. Even though universal methods to resolve this do not exist,
we list some criteria for you to follow, with differing levels of effectiveness according to
the scenario:

* Having non-disjoint domains is a necessary condition so that two attributes can


correspond . Unfortunately, this criterion is difficult to follow if you extract attribute
domains from underlying operational schemata , because the majority* of attributes
will be defined on the basis of standard primitive data types.
* Determining pairs of attributes with the same name in both source fact schemata
can be useful to suggest correspondence if the designer selected names consistently
in the concep tual design phase.
* According to design rules commonly adopted , hierarchies built on the same
dimension in distinct fact schemata within the same data mart should be identical
.
( the so-called conformed dimensions ) In cases Uke this, you can immediately define
a correspondence.
After defining common attributes., you can overlap tw o comparable schemata to create
a resulting schema .

Reducing Hierarchies
Given a hierarchy h and a subset JR of its dimensional attributes, the reduction of h to R is
a set of hierarchies that include all and only the attributes in R linked to each other via
functional dependencies transitively derived from those in Ji.

Figure 5-36 shows an example of reduction . On the left is a hierarchy with the a
dimension. On the right are two hierarchies obtained after reducing to the c e , g, i , and 1
#

-
FiGUftE 5 36
Reducing a
hierarchy to
a subset of
attributes
b c

d t S h * 9

O
i I m l 1
Chapter 5; Conceptual Modeling 147

-
dependency in the result from c-*h and h 4 l in the initial hierarchy. —
attributes, respectively . For example,, note that you can transitively derive the c *1 functional

Overlapping Compatible Fact Schemata


Two comparable fact schemata are called compatible when reductions of their hierarchies
to common attributes are equal . Overlapping two compatible fact schemata results in an
overlapped fact schema in which (1) hierarchies are the common reduction of the
hierarchies in source fact schemata; ( 2) measures are the union of the measures in
source fact schemata; ( 3) each attribute domain is an intersection of the corresponding
attribute domains in source fact schemata .

Compatibility between two fact schemata implies that the functional dependencies they
contain are all mutually consistent. It is easy to verify that the SALE. INVENTORY, and
SHIPMENT schemata are compatible. For example, two fact schemata are incompatible if
one of them contains the dependency and the other has the a —dependency. As a
matter of fact, both functional dependencies coulc theoretically coexist . This implies that a
--
one to one association exists between tf , and a .. In our context though, the fact that the same
attributes are in separate hierarchies in inverse order probably means that a design error
occurred in one of both schemata (Gbifarellli et al., 1998).
Figure 5-37 shows how to overlap INVENTORY and SHIPMENT. You can use the
resulting over lapped schema to compare quantities shipped and those warehoused for each

schemata.
*
-
product. Sec section 7.1.2 for a detailed discussion on drill across queries on overlapped

We repeat that hierarchies are widely shared between more than one fact schemata in
the majority of cases. And in many cases, they will even be identical (conformed

hierarchies for example the product hierarchy). In other cases, a hierarchy dimension
will coincide with a dimensional attribute from another hierarchy ( for example, a hierarchy
rooted in the customer dimension in the complaints fact schema will probably coincide
with the subhierarchy rooted in customer from the order hierarchy in the shipments
schema ). Figure 5-38 shows an example of conformed hierarchies.

FJOUHE 5- 37 department
Overlapping the
INVENTORY and category
SHIPMENT
weight type
schemata
packaging brand
product

SHIPMENT
year month date INVENTORY
C O Op- shippedQuar.t I ty
AVC-\ shipmentCost warehouse
MIN i n vent tryLevs1
x ncomingOuantity \ address
148 Data Warehouse Design; Modern Principles and Methodologies

deg depa re me nrKe ad

-
: rr.r n £
ilgpA .
depart rcer t

? agjcaa:7ia
tyjw
category
p aC cs:i ng
-ypa
Ht t
^ary I

weigh'
e in&ic
desi:lp £ :cxi
-
brac jd;
JgjCTipbign
-
brr 3

T C diet dLiet
product prodaci.
"O itemaFerPalleL addresi iie ^ slerPallet
E.aT.e
ESTOEHTORY CQHPLAIMT
w * rer.::ae
i + naa* o
siatfi
o--
city
a ddre = 0
company
dlroc o
^-
I I
iepari‘t/Jfa d. d #ps nmem Kea -3
departmast dcpartnant

:at.egtry category
packaging type - pack jibing
_ type


weighs. ^SiGh‘
de script
product
-O diet.
j
brand
deasTlptim
:
b:ou.ct
brand
Ddie "
_
lifflawFc r Pal le:

SffIPKEHT H SUK

—— ——
orderTear crderEace
6ofoerMoniii
O Q
yorder
war shpy.s v
j
^iir c
.

0
stare city
Q —. CPatamer
~-
address
dire -ccor I
company address I

Flaws 5-3S Conformed hierarchies

5.6 Formalizing the Dimensional Fact Model


In this section we discuss how to formalize the extensions! and intensions*! properties of
the DFM. Section 5.6.1 introduces the metamodel of a fact schema for readers familiar with
-
UML Sections 5. b .2 and 5.6,3 show how to formalize the basic concepts of the DFM. We
have excluded advanced constructs of DFM on purpose to simplify this explanation.

5.6 . 1 Metamodel
-
Figure 5 39 shows the UML metamodd of the DFM. We sugges? you read one of the many
texts on this subject (for example. Allow and Neustadt, 2005} for an explanation of the
i

syntax used in this figure.


C h a p t e r 5: Conceptual Modeling 149

FiGuae 5- 39 1. * Measure
FsctSenema
UML melamcde 1

na»e
Of the DFM -
f anue
A
-
t ’ xe
U
i auditiveAicng

Adiitivity
cteraict
1 . .*

Hierarchy

u) Arc

1 .. *
1

Attribute iron
0 ..1
n*^ n
doetai r. to
1

5.6. 2 IntensionsI Properties

Hierarchy
A hierarchy h is a paJr (A , < v ) where

* A is a finite set of dimensional attributes ;


* £h is a partial order based on the attributes of A where one and only one
maximum element always exists and is called a dimension .
Semantics of the partial order are tilat <. a . when
Looking at Figure 5-6, the product hierarchy can be described as follows:
A - (produce, type, category, department,
marketingsroup, brand, brand City)
Figure 5-40 shows the Hasse diagram that illustrates the partial order for the product
hierarchy- The product attribute is the dimension.

FIUURS 5 -40 product


Hasse d- agrarr
snowing the partial
order of attributes
in the product type brand
hierarchy

brandCity
category ( j
.
rr. i : .nqUrcup

department O
ISO Data Warehouse Design; Modern Principles and Methodologies

Fact Schema
We define a fact schema F as a pair ( H , M) in which
* H is a finite set of hierarchies
* A ? is a finite [ possibly empty ) set of measures
*

If « is the number of hierarchies in H , we say that f is n-dxmensional. Additionally, we


will denote the set of a LI the attributes in the hierarchies and the set of the measures in F
with ,4ttr(F) and M«f $(F), respectively.
The sales fact schema in Figure 5-6 includes three hierarchies: one with the produce
dimension (shown m Figure 5-40), one with the score dimension , and one with the date
-
dimension (both shown in Figure 5 41). The measures are quantity* receipts ,
unit Price, and nutnherOf Customers.

Group-by Set
Let F be a fact schema. Wo define a group-by set of F as any subset of attributes, G c AtttF ),
_
such that for each pair of attributes and a contained in G from the same hierarchy hf
f

neither a« 4 aI nor ai <* a holds


*

The [i j , sj group-by set including all of the dimensions of F b called the primary
set of r and marked with Gte(F), All Of the others are called secondary group-by sets
gmjjp-fty
and identify possible ways to aggregate events, as we mentioned in section 5.5.

Roll-up Order
Given fact schema F , we define a partial rtf // - up order on the set of all possible gjoup
by sets of F as follows: C 1 < G if and only if G.-+G1.
-
r
> J

According to the roll -up order definition ,, it is G . G when, for each attribute a from
hierarchy k contained in G ^ either G includes an attribute at such that at <h or no attribute
from ft is part of G ,.

FIGURE 5 41
*

Store ard date


hierarchies,
assuming that the
allocation of stores
to districts does
o
week day
not depend on me
cities to which
stores beiortg career Q

y «4r O
C h a p t e r 5: Conceptual Modeling 151

5.6.3 Extensional Properties

Hierarchy Instance
Given a hierarchy h = ( A , < . ), an rnstance o f h LS a pair fD„ L ) where
* D Is the set of domains ,) of the attributes in A; each domain is the

(countable) set of values that the attribute can be given;


* L family of ro!l-upfunctions that map each value of
is a into a value of
D<m ) for each pair of $ . and a attributes such that a < h a :
( a (

upl' : Dt?m(rt . ) — ^ Dcnn[& x }

Consider this example;


Dmr:(city) = j Abbyville' ' Auburn , ' Pushmataha „ Jackson 'Castlevvoodrr Deckers ' }
'
p
1

Opm (stace ) = [ Kansas'. 'Oklahoma ', 'Colorado ' }


'

['AbbvvUle') = ' Kansas'

-
You should actually define a roll up function only for the immediately adjacent pairs of
attributes in the partial order. You may transitively compose the function for the other pairs.
Given the a.! , a.* and a.? attributes such that J <n (f. <n a., it would actually be
t
j

< () = uK fepSo
- )
If two or more separate paths exist in the <h order that join two attributes a and we
assume that the upj roll-up functions obtained from composing roll up functions along the
different paths are identical to enable an accurate aggregation. In the stores hierarchy in
-
Figure 5-41. country and store are connected by two paths that include seAesDistrict
on. the one side and city and state on the other side. For this reason, it should result
in this ;

.
r c:?iirit ry
c[ r l c
"
• aiLaartfiLf ic .M ~ “ 7?MUSiry ("p eicy
State ("pSTil
Coordinate
Given a fact schema F = ( H, M ), an instance of each hierarchy in H and a group by set
G = |LTL . ;z j, a coordinate of G is a function that maps each C attribute into a value of
-
L3ci»r(aJ. We will mark with Ppm(G) the set of possible coordinates of G : ^
Dum(G) = Dom [ n } x .. x * )

Now you can. define an instance of a fact as a function that associates the coordinates of
i -
the primary group by set with primary events.
152 Data Warehouse Design: Modern Principles end Methodologies

Cube and Primary Event


Given a fact schema F = { H , M) with A l = |m -
JKJ. and given an instance of each
hierarchy in Hr a rufef instance or F is a partial function that maps rrom the coordinates
of Gfcs(F) to the measure domains of Ai;
c : Dom( Patt ( F )) *-¥ Domain, ) x x Dcwt(»i ) p * 1

^
Each specific n-ple or measure values determined is called a priruny r tfnf . -
A primary event is a basic ceil of information defined by a coordinate of the primary
-
group by set. A value for each measure in the fact schema is associated with each primary
... .
event. Let 0. » a be the dimensions of fact schema F and c be a cube instance of F . We will
mark the primary event corresponding to the a icL : a .. . aj =
DornfaJ x x Dom( ) an ... ar
coordinate with r ( «) .
In the sales example, a cube may be then defined as follows:
c ( product: Shiny ', store:'EverMore ', date:’4 / 5 / 07 ) - (quantity:10, receiptsi25)
1

c (product ; 'Shiny ', st cre:_'EverMore‘, dace:’4/ 7 / 07'} - ( quant ity:20 receipts ;50 ) #

c (products Scent', store:' EverMore', date:“ 4 /5 / 07') = ( quantity.5P RECEIPTB:10)


cfproduct:Scenf , stcre : ‘SmartMart \ date:'4/ 8/ 07") = (quantity:10 receipt s:20)
4

-
Given two gimtp by sets C, and G such that G G , the functional dependency between
(

G - and Gf and the roll - up functions in F allow each coordinate et of G to be mapped into
exactly one coordinate a of G;. Each coordinate of G corresponds to a set of coordinates of
-
(

Gr You may then extend the roll-up function definition to map from one group by set
domain to another

up£ EAwr (G . ) Dem (G , ) such that wp? ( CE , } = a ,

For example,
fcjPKtfftLyl
i I:a (Hor,'.iii.e .
*
( ' Drink , Los Angeles ') = ( Food , California
' '
' '
)

( We have omitted the attribute names that define the coordinates to simplify notation, as the
subscript and superscript group-by sets of the roll-up function already express them . )
ll G includes more attributes from the same hierarchy the constraint on unique
-
compositions of roll up functions among attributes ensures a clear definition of roll up
functions among coordinates; here's an example:
-
Jtfxct, eiiy )
" r 1 c^ untTy! ( \TorthIllinois', 'Chicago' } = ( ‘USA ’)

A hierarchy may be present in G but not in G : |P

( CleanHand1, -Colorado' . 'II 2007 ) = (’Soap , '2007 ) 1

product, - .’-- *, cr
Here we use the up (I type year i ,
I M

—--r-*rl
-
roll up function to map all the coordinates of the
( product , state, quarter | group by set and related to soap-type products, to the tour
quarters of 2000 and to all the stores into the ( type:'Soap \ year:‘2007 ) coordinate .
Chapter 5: Conceptual Modeling 153

As a last example, think of the specific case of a roll-up function that maps into the
-
empty group-by set. The coordinates of the empty group by set do not exist strictly because
no attributes are displayed in the empty group-by set . Conventionally you may assume that
one fictitious coordinate exists, to which all of the coordinates of any group-bv set G
correspond via the xpf , roll- up function,
-
Now let there be a secondary group by set G GMT). The roll- up function that
associates a set of coordinates of Gbs( F ) with each coordinate of G enables the aggregation of
-
primary' events of cube c by group by set G . You then obtain an aggregation c * where the
measure values for each coordinate sum up those in the coordinates of c corresponding via
-
the roll up function.

Aggregation and Secondary Events


Let f be a fact schema and let G < Gbs{ F ). Let c be a cube instance of F . The aggregation
of cby G is a function d that maps from the coordinates of G to the measure domains
of Ate ( F):

This is defined as follows,


cr : Dom( G )

< > Dorn{ m ) x ... x Dpmfai, )

c'(a 7wj = a c( a ).m.


v - we \o
-
ff
'

where a is a generic coordinate of Gte(F), a is a generic coordinate of G, and Q is an


aggregation operator for measure m .. Each n - ple of the measure values defined is called
a secondary event .
, #

!
CHAPTER
Conceptual Design

c hapter 5 described a conceptual, model for data marts, which should be used to
create a set of fact schemata . In this chapter, we discuss how to bund a conceptual
schema for a data mart in order to meet user requirements and be consistent with
operational source schemata .
Chapter 2 already pointed out that three fundamental methodological approaches can
-
be taken to data mart design : the data-driven approach, the requirement driven approach,
and the mixed approach . Their differences are based on the relevance given to the source
-
database analysis and the end user requirement analysis phases. Opting for one of those
approaches dramatically affects the way conceptual design will be carried out
,

The data-driven rapproach defines the conceptual schema for a data mart in relation to
*


the structure of an operational data source typically, your reconciled database.
This helps you skip the complex task of linking them at a later stage. Moreover, von
can almost automatically derive a preliminary' conceptual schema for vour data
mart from your data source schema. When you analyze user requirements, this
gives you the immediate opportunity to have a precise idea of the maximum
*

performance potential of your data mart , and then to discuss specifications more
realistically.
In a rtijuirfm&ii -drivert approach, there is more room for a user requirement analysis
because no detailed information on data sources is available or sources are too
complex to be investigated. The resulting specifications become the foundations for
a conceptual schema of a data mart. Virtually, this means that designers have to be
able to manipulate their interviews with users to extras ( i ) very precise instructions
about facts to be represented ; (ii) measures defining those facts; and ( iii ) hierarchies
for those facts to be usefully aggregated . Designers will deal with the resulting
conceptual schemata linked to operational data sources only at a later stage .
* In a mixed approach, user requirements play an active role in imposing iimjis on
complexity of data source analysis. The m-depth requirements analysis is formally
-
performed as in the requirement based approach The results of requirement
analysis guide the algorithm that obtains a chart structure ot fact schemata from
data source schemata as in the data -driven approach.

155
156 Data Warehouse Design ; Modern Principles and Methodologies

In this chapter, we will examine conceptual design with reference to these three
approaches.
-
In the data driven approach, it is extremely interesting to study how you can derive
your conceptual schema from those schemata that define a relational database Tills is
because the majority' of the operational databases created over the last decade are based on
the relational model - Our methodology’ may be applied to conceptual Entity-Relationship
schemata ( see section 6.1 ) or relational logical schemata (see section 6.2) along with minor
changes. Naturally, conceptual Entity-Relationship schemata are more expressive than
relational logical schemata . For this reason, they are generally considered as a better design
-
resource. However, companies often provide Entity' Relationship schemata that axe
incomplete and inaccurate. Very' often, the only accessible documentation consists of logical
schemata of databases if no careful investigation is carried out .
Companies are increasingly considering the Internet as an integral part of their business
and communications plans. Moreover, a targe part of web data is stored in extensible
Markup Language (XML ) format. For this reason , it is important that you can integrate
XML data into your data warehousing systems. Section 6.3 shows how a data mart
conceptual schema can be derived from schemata defining XML sources. The material
presented is discussed in Gollarelli and others ( 2001 ).

NOTE See Atzmi et a /., 1999 for a more detailed discussion on the Entity-Relationship model
( ERMJ . See Cabibbo and Tcrione, 1998; Husemann et alr 2000; and Song et al . 2007 for other
-
f

approaches to conceptual design front tlx Entity Relationship source schemata. Jones and Song 4
( 2005 / discuss an interesting approach to conceptual design based on the application of relevant
design patterns. Other work relevant to XML sources and semantic web contexts is presented in
Vrdoljak et al ,, 2003; Jensen et al . 2001 and Romero and Abelio. 2007.
r r

Section 6.4 gives a detailed description of the mixed approach to conceptual design that
is only partially' different from the data -driven approach. Finally, section 6.5 provides more
-
details on the mam instructions for the requirement driven approach to conceptual design.

6.1 Entity- Relationship Schema -based Design


Data-driven Conceptual Design
The technique used for the Dimensional Fact Model (DFMHrompliant conceptual
-
design of a data mart based on an operational source Entity Relationship schema
includes the following steps:
L Define facts .
2. For each fact
a . Build an attribute tree
b . Prune and graft the attribute tree
c. Define dimensions
d . Define measures
e. Create a fact schema
Chapter 5: Conceptual Design 157

First, select relevant facts from vour source schema (step 1) , Then, create your attribute im'
-
in semi automatic mode (step 2a ), This is a transitional structure that is useful for delimiting
the area relevant to your fact schema in order to eliminate all the irrelevant attributes and
modify dependencies that link these (step 2.b ), and to define measures and dimensions
(steps 2.c and 2d ). The attribute tree is also very important because it links vour data mart
and source schema. This Link acts as a key for the data-staging process. Then it is relatively
simple to translate your attribute tree into a fact schema (step Z.e). Note that step 2a is based
on the application of an algorithm; steps 2c, 2 d, and 2.e are based on the objective properties
of attributes; and steps 1 and 2.h require an inniepth knowledge of the corporate business
model. For this reason, designers must pay much greater attention to the last two steps,
-
Even step in this method will be described with reference to the sales example. Figure 6 1 -
shows a simplified Entity-Relationship schema for this example. Each instance of the SALE
relationship represents the sale of a specific product listed on a specific sale receipt. The
unitpr ice attribute relates to SALE rather than PRODUCT because product prices can vary
ova: time. The identifier of SALES DISTRICT is the pair consisting of the dialriciNuir.
attribute and the COUNTRY entity identifier Note that there is a redundant cycle involving the
*

STORE, CITY, STATE, COUNTRY and SALES DISTRICT entities. However, the cycle
involving PRODUCT and CITY through STORE and BRAND is not redundant because the city
where a product of a particular brand is produced is usually different from the rides where
those products are sold.

6.1.1 Defining Facts


Facts are concepts of key importance for a decision-making process. They typically correspond
to events dynamically occurring in a company.

marks n ingSroup i s;arunentHe a d dLscriciRunn country


t ?
di rftCwCr O # depart
MARKETING SALES
GROUP DEPARTMENT DISTRICT COUNTRY

U.ni . : i, n) I 11 . n
-ype category '

Ml , l » II . L ) .n (1, 1)
f T [l )

TYPE CATEGORY <%> STATE


I L , 1)
< Q, n‘i L U . nJ
diet
-c
Unit ?T !T 3 date salesManaaer
weight ? c P L 2 fl,l)
a
SALE
PRODUCT SALC>^ aacsiRT
STORE CITY
produzt
s :i £ A {1
/ H’
. n> .- iiy 1
suleRecfliprNuTfi
score i i A
addreaE
cslephcr.e •city
[1 , 13 ]
* ( 1 . n]
. -
tl lkx s. tl , a )
WAAEHOUS L

-:
address OF B RAND
11 , 11 $>
<
brand
FIGURE 6-1 Conceptual Entry-Relationship schema for the sales operational database
158 Bata Warehouse Design; Modern Principles and Methodologies

Ln an Entity-Relationship schema , a fact may correspond either to an entity o: an rc-arv


reiationship, R , between the E.... £ n entities. In the latter case, in the interests of
t

simplicity, you can transform R into an entity ( reification process ). To do this, add a new
entity - called F and replace each branch of R with a binary relationship ( R . ) between. F and Er
If you mark the minimum and maximum cardinality level , at which an entity’ E participates
in a relationship A , with min ( E , A ) and max[ Er A ) ,1 respectively, you get this:

R ) = matyR R ) - 1, jR , ) = miniE , K ) , max( EfR ) - tmx( Ef R ), t = 1 n


The attributes of the R relationship become attributes of F. The identifier of F is the
-
combination of the identifiers of Ej t = I . . , n. Figure 6 2 shows the reification process
applied to a ternary relationship.

TtP Those entities that represent frequently updated archives, such as SALE, are good candidates for
fact definition , out those entities flint represent structural domain properties corresponding to
almost static archives , such as STORE and CITY , arc not .
As a matter of fact, this common sense rule must not be applied too literally. This is
because die borderline between what should be a fact and what should not depends largely
on the application domain and the type of analysis users wish to carry out Consider the
example in which shop assistants in a supermarket are assigned to different departments.
If each assistant is assigned only to one department at any time, is it correct to classify this
relationship as a fact ? If users want to pay attention to the sales made by each assistant both
assistants and departments will be modeled as hierarchy attributes. A functional dependency
will model shop assistant assignments to a specific supermarket department. On the contrary,

FWUK 6 2*

Reification of the R
rsl3t>on 5h : n

<3>
.
t' l l ) < 1.1 J
R,

.
* 1 F El

‘Typically mimE < A ) e |0, 1] and maj( E, A ) c llrn \


Chapter 6: Conceptual Design 159

-
Fl6Lft £ 6 3 qu*nr i z y .
iir i r Pr i ce
Re:~ cation o+ tne
'

SALS relationship (O . uL vtt . 1)


/
9 ? a .i l A l i . t i

.
FF.OD'J cr SALE: N SALE RECEIPT

product saleReceip^ NyjE

if users ana more interested in studying assistants transferred from a supermarket department
to another one, then you should introduce a new fact to emphasise dynamic properties of the
assignments to specific departments , Note that the choice between one or the othet design
solution does not necessarily depend or. how often assistant assignments change.
Each fact identified in a source schema becomes the root of a different fact schema. In
the following sections, we will focus on one single fact that corresponds to an entity
possibly deriving from a reification process. The most relevant fact for the analysis is the
sales of a product in the sales example. At a conceptual level , the sales fact is represented by
the SALE relationship, reified into the SALE entity, as shown in Figure 6-3.
It is important to note that different entities can sometimes be candidates for expressing
an individual fact If this Is the case, we suggest you should always choose as a tact the
entity to build an attribute tree that includes as many attributes as possible. Section 6 2, 2
will give an example to clarify this concept

6.1.2 Building Attribute Trees

Attribute Trees
Given a relevant part of an Entity-Relationship source schema and one of its entities
F classified as a fact , an attribute tree is the tree that meets the following requirements:
* Each node correspond* to a source schema attribute ( simple or composite ),
* The root corresponds to the identifier of the F entity.
* For each node vr tire corresponding - attribute functionally determines all the
attributes corresponding to the descendants of r.

You can automatically build the attribute tree corresponding to F if you run an
algorithm that recursively navigates functional dependencies expressed by identifiers and
to-one relationships in your source schema . Each time you examine an entity Er you should
create a new node v corresponding to the E identifier, and then you should add a child
node to v for every attribute of E. These also include all the individual attributes that
make up the E identifier, if it is composite. Every time you find a relationship R with a
maximum cardinality of 1 from the E entity to another entity G, you should add a child for
the G identifier and all the children for the attributes of R to the r node. Then you should
repeat this procedure for the G entity, Tne entity from which this process is triggered is the
one you chose as the fact . F
160 Data Warehouse Design: Modern Principles and Methodologies

Figure 6-4 shows the basic features of the algorithm that builds the attribute tree in
pseudo-code. Figure 6-5 shows an example of the control flow for the translate
procedure and it gives a step-by-step explanation on how to build an attribute tree branch
for the sales example. Figure 6-6 shows the resulting tree.

-
FHSL'SIE 6 4 Pseudocode for the attribute tree building algorithm
*

r o o t = newNode ( i d e n t ( F ) ) ; / / i d e n t ( F ) i s the i d e n t i f i e r of F
/ / t h e t r e e r o o t i s l a b e l e d with t h e i d e n t i f i e r
/ / of t h e e n t i t y s e t as a f a c t
translate ( ?, root ) ;

procedure translate H , v ]:
// E is the current entity of the source schema ,
// v is the current node of the tree
{ for each attribute aeE such that a ident ( Ef
a ddChi 1d(v, nevNodeia)1 r
^
// adds the a child to the v node
for each entity Q linked to E by an ? - T.EX 1, E, R )= i
. re 1 E1 1 unship s 1
'

{ for each beS attribute


a ddChi1div , nevModef b ] I }
// adds the b child to the v node
next * newNode lident ( G! ! ;
/ / creates a new node with the name of the identifier of 3 .. .
addChild Cv, next ) ;
U ... adds it to v as a child ...
translate t '3. next!;
tf and triggers the recursion
!
}

-
Fisuft* 6 5 Operatingflow of trie t r a n s l a t e procedure for trie saies example

r o o t = n ewWode < produ c t * sa1eReceiptMum }

translate{E= SALE , v = product +s= leReceiptMum 1 :


_
--
addThi d {produrt saleAeceiptNum , quantity ; ?
addChiId(product saleP .eceiptNujE „ unitPrire] ;
for 3=SALE RECEIPTi
addChiid ( productssaleReceiptNun . saleReceiptNum } ;
translate ( SALE RECEIPT, saleReceiptNum ) ;
-
f o r G PRODUCT ;
-
addChild ( product saleReceiptMum , product );
translate !PRODUCT , product ] ;

-
t r a n s l a t e I E SALE RECEIPT , v = BS1eReceiptMum ] :
.
addChild! aaleReceiptNurc d a t e ) ;
f o r G « STORE ;
addChiId IsaleReceiptNum, store ) j
trans1ate(STORE. = t ore );
Chapter 6: Conceptual Design 161

- ranslate ^ E- 0RE
addChild < store .
ST 4

addressi ;
vaster*

addChild < store, telephone ) ;


addChiId 1 store . salesKanager);
for :Z-= D I STRICT:
add Chi Id : store , di 3trictNum *country ] :
translate DISTRICT. districtNum* country] :
for 3 = CITY ;
addchildistore , city ) ;
translate 1 CITY 4 city ) r

- - -
t r a n s l a t e 1 E DIST? ITT , v d l s t r i c t N u j m *o c i i n t r y ; :
addChilc ( &i3trictNum*c o u n t r y , districcNust ;
=
for G COUNTRY:
addzh iId ( distractNhim-* country , country j
.
1

trar.5 late ( COUNTSY country ) r t

.
trar s lace ( E = COUNTRYr v = coir.try 1

Naturally, the need to duly pay attention to many specific elements, for which you must
check your source schema , makes your algorithm operations more complex than this. In the
following paragraphs we will make a few remarks and give you some guidelines to manage
those elements.
- -
Your source schema may contain a many to one relationship cycle. For example, one
part may be made up of many other parts. If this is the case, the translate procedure
loops forever and attempts to build an infinite branch in the attribute tree. Two solutions are
possible at the conceptual level. The first solution involves the use of a recursive hierarchy
to show that the height of the hierarchy is generally unlimited With this solution you can
have instances of hierarchies of different heights, but bear in mind that the logical modeling
of recursive hierarchies inevitably involves complexity and performance problems. See
sections 5.2. S and 9.1. 7 for more details on conceptual and logical modeling of recursive
i
hierarchies The second solution is very often preferred This solution slops the loop cycle
after a specific number of iterations determined by the relationship relevance in your
application domain , Figure 6-7 shows a simplified schema for the transfer of personnel
within a company There is a many-to-one relationship in this schema involving the
EMPLOYES, the DEPARTMENT, arid the DIVISION entities. There are three cycles because
you can reach DEPARTMENT twice directly from TRANSFER, chosen as a fact , and a third

5t £ t date salesMsnager
c- O* Qcity c
nation ?? address
Qtelephone
-
o
department
weight
o category
brand quantitv
diet 0 o
/
*C91
e
--
fry
Wr
—. — —
city state country
o
-
d 1st r i it JJUQ country
departcentHead
cype product ^ saleFece - ptNurr.
directorO
TL= rket LTJQGX &JP site X6 .
*
di « ricxNum

unitFTIce

-
FIGURE 6 6 Attribute tree for the sales schema ; the root is highlighted in gray .
162 D a t a Warehouse Design: M o d e r n P r i n c i p l e s end M e t h o d o l o g i e s

fir IJ ( Q , r.l

V
(1 . ,n) - f cO

.
^ftKIVSFER D£P «rtPKHsrr DIVISION

1
*
f -i . i )
aate divisiqn
, ID dflr-ar: rr.:
cfjDCode OF ~
[i

T C 0 . n) 11, 13
<8>
EMPLOYES
OF
10.1)

date destination
current
O department
department
division enpGoda division
0 y— _ , „ O
dat v
department
^^ HSpCod«\
iMtadCcde
headCode
department O
divisi on
origin
department

hcadCode
current destination O
department departmenf
Q division division ©
1

date
O department O depart cent p
division. empCode department division
O O <3 <
headCode department date *\ headCode
errpCode headde-de
o
department division

department &

div 1a ion 0
origin
department
^
headtode

RGL- RE 6-7 Entity-Relationship schema for the personnel transfers ard two possible correspofidsng
attribute trees
C h a p t e r 6: Conceptual Design 163

time from EMPLOYEE. Figure 6-7 shows two possible attribute trees . In the first tree, we
used three recursive hierarchies for modeling. In the second tree, for each department r . we f

chose to stop the cycle at the r: department to which the r manager belongs, because we
assumed that the r, division coincides with the r division.
While you are exploring a cyclic schema, you may happen to reach an individual E
entity twice while you go dow n different paths. In this way, you generate two homologous
nodes called r ' and v" in your tree . If each instance of F precisely determines one instance of
E* regardless of the path followed, you can have the v ' and i f nodes coinciding in a single
node called v . Tn this way, you create a convergence. The same applies to each pair of
homologous children of r 'and z f . It this is not the case, n' and \ f must be left separate. After
creating the fact schema, you can opt for a shared hierarch )' construct . Figure 6-7 shows an
example of the inappropriate conditions for a convergence because the origin and
destination departments must be separate. The sales example shows the correct conditions
for convergence instead , because the country attribute can be reached either from
districtNumor from scats.
-
To many relationships ( mdx(£. ,R ) > 1) and multiple attributes in the source Entity-
Relationship schema are not automatically inserted in mees These may give rise to cross -
dimensional attributes or multiple arcs that designers will frame manually in fact schemata
at the end of the conceptual design phase. See section 6.1.5 for more details on cross
dimensional attributes or multiple arcs.
-
It is important to show’ whether an optional link exists between attributes in one
hierarchy. To do this, you can use a hyphen in those arcs that correspond to optional
-
relationships (min(£ , R ) 0) or optional attributes m your Entity Relationship schema . In
-
the sales example, this applies to the diet attribute.
-
If you carry out a reification process (section 6,1.1), you can turn an n ary relationship in
-
your Entity Relationship schema into w binary relationships- Most n~aiy relationships have a
maximum multiplicity that is greater than 1 for all their branches. It this is the case they define
,

n one-to-many binary relationships that you can not insert into attribute trees. Figure 6-S gives
an example of this point. On the contrary, a branch of an .n -ary relationship , whose maximum
multipiitity is equal to 1, defines a one-to-one relationship in a reiiied schema. In this case, you
can insert it in your tree . Figure 6-9 shows an example of this point. You can reach TEAM twice
from. MATCH. Remember that you can always replace an n-ary relationship with a maximum
- -
multiplicity of 1 for one branch in your Entity Relationship schema with n 1 equivalent binary
relationships without the need for reification .
Specialization Entity -Relationship hierarchies are equivalent to optional one-to-Dne
relationships between super-entities and sub-entities , and they may be treated as such in the
algorithm . Alternatively if a hierarchy exists between the E super-entity and the E 3 . . , , Em
-
sub entities, you can merely add a chDd to the node corresponding to the E identifier. This
child allows you to discriminate between different sub-entities, and it actually corresponds
to an attribute with n possible values. Figure 6-10 show’ s both solutions and makes reference
to the order line example . Note that the specialization attributes of the sub-entities are
optional in the second solution that includes the protin it Type discriminator
164 Data Warehouse Design: Modern Principles and Methodologies

T department

.
11 1 ' O . Rl 10 , di
WW:5$ ION PATIENT VISIT,
=OT7 :

A
from!:ate
»
O
toDate
ijadrires :
patieatSSK
.
10 n 3l
doctorcode

street: ZIP city CATs

date
f roitiCata
o » e

o o o
detartnATet cc-Daie
- atreet
ZIP
cicy
piEientSSN ad ores a

-
FIGURE 6 8 Entrty-rtelaTjonship schema, for hospital admissions and the. corresponding attribute tre
*

attributes, such as a,
-
A composite attribute called c of the Entity Relationship schema consists of simple
.
ft is inserted in the attribute tree as a c node with children am
-
See the address example in Figure 6 S. Then you can graft c or prune its children , as
mentioned in section 6,1-3- Figure 6-11 shows that composite identifiers can also be treated
in the same way.

ia ,m
? result
tlplJ
SUSPENSION VIN HATCH TEAM
-
.
V 3Y

A
rAiscn
1
minute
A
match Code
(O .n )
learr- Toie

.
rea &or teamCode
o 0
c de
rrd nu t e + ma tchC od e *$& < **O° result

o 6
mlTiUt e - ear Cede
.

FIGURE 6 *S Entit Relabonshlp schema for suspensions in foolbai matches and the corresponding
attribute tree ^
Chapter 6: Conceptual Design 165

ii .n ) .
O al
•- prcdtictcod*

ORDER
ORDER .
I IKE
PRODUCT

i
onforCod*
A
date
o
iirie r quant i t y
price. i il
itot »1, disjoins )

FOOD zzjomrrNG BDCSSSOLD

A
expiry?!D4
A
= :zs
quant i t y t oodProdC'&d* quantity
0 Oexpi ryTir-e 0 Qexp' iryTine
°«s ^ 6

— _ =O-—cJ;0 £ t CC
'

a
date
o
orderCede
o Qfaqusei:oldProdCod =date
Q
ardeFCoG
"
~
" ^
0
product Type
price * prf ce
nuiR&er O Oe z*
cl pt&lsgFrcd&ode - number Q 0 5 -se

FIGURE -10 Errtity*HeiaCiqfishlp schema for the order line example and two solutions for the attribute tree

6.1,3 Pruning and Grafting Attribute Trees


Generally, not all the attributes in the tree are relevant to the data mart , For example, a
'

trees have to be manipulated to delete unnecessary levels of detail .


-
customer s fax number can hardly be useful for decision making purposes. For this reason ,

FIGURE 6-11 U . lJyv tO ni .


Entfty-RelattonshSp
schema for the
telephone calls
durst ion
? tost

CALL TELZPEOtfE

* A— A
and, the
corresponding
attribute tree
date tour minute areaCode urier

duration
o « date
Q hour
0 Q minute
cost

f rrmTs 1echcr.a toTelephor.e

froiTAreftCode o o froDi' umbsr


-
£ vof Vurrer :romAre A Cede
— “f

166 O a t a W 5 r E h n i t a c D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d 11 » g i t s

FIGURE 6-12
Attribute Tree ( top ) .
Dmnin (left ) and
^
grafting

-
To pnflTf node vf you should remove the entire sub tee rooted in r. The attributes that
you remove will not be included in your fact schema, so they can not be used for data
aggregation. Grafting is used when you need to retain the descendants of a node in your
tree, although that node shows information that is not relevant To graft node rr whose
parent is called u you should link all the children of v directly to r 'and then remove v. As a
-
,

result , the aggregation level corresponding to v V T 11 be lost but the levels corresponding to
, .

its descendants will not.


-
Figure 6 12 shows examples of pruning and grafting . Figure 6-13 shows how you. can
transform the attribute tree in the sales example, If you graft sal eRecei.ptNun; and prime
state , disc rictNum, and size.
When you graft an optional node, all of its children become optional I f you prune or graft
an optional node v with a parent called r ' you can add to r a new child b corresponding to a
Boolean attribute to express opbanality. For this reason, the value of b is TRUE for every value
of p ' for which a value of v exists. Figure 6-14 shows an example of this; The check - in
attribute is Boolean. It was added to the tree when the optional t i ckerNuziber : CHECK - IN )
attribute was pruned . Its value Is TRUE only for those rickets whose passengers have
checked in*
If you prune or graft a root child that corresponds to an, attribute included in the
identifier of the entity chosen as a fact, this makes fact granularity more coarse-grained than
your source schema . Moreover, if the node you graft has more than one child , this potentially
leads to an increase in the number of dimensions in your fact schema. Figure 6-6 shows

5a I e 3Manager
Qcity o address
£reltpfiane
quantity stcrr city Etate country
brandy O


J
diet 0
~
d = r artment* ‘ 5' ’ “ <X \
1 _ - cunt: ry
i s t r irtN' LTr. T •
O
det rtnentHead category roduct * sEieF.-B ce ipUftun
^ type product
director O
BSRKETIN^Grcup data
O
uni t Prices

FIGURE 6-13 The attribute tree tor sates after pruning and grafting
C h a p t e r 6: Conceptual Design 167

f 1 sdiunate -
depart ureTii*
^
f i t or.xN
_ .be r
LTr

— Oairline

lATAGode
AIRPORT
I IromCity ;

airline
iromAirporc Q

be partureT ime

£ lighfcffuHber nucriberOfBags
passenger 0"
a bC'Airporb aeat
flibhtDate
I

tocity
f rcmcicy
\ _
.= :r :r.e
fxomAirporz h 0
'

it

a depart lire Time
flightxuober
ue
passenger C f
1) toAiipbare s e a t.
£ Light Date 1

toCity

m FIGURE 6-14 Entity-Relationship schema for airline ticketing and the attribute tree before and after
illy -
prunfehg the optional tickstmiBber CHICS is) attribute

a sales example: Grafting &a1eReceiptNum changes the granularity of sales. Before


grafting, a separate sales event existed for each product item listed on each sale receipt.
After grafting, all the sales related to a single product and to the sale receipts issued by an
individual stare on a single date are aggregated and become one single event. Before
grafting, she potential dimensions: were product and saieReceiptNum. After grafting,
xry the potential dimensions become product, date, and store.
In this phase ,, you should think of how to deal with composite identifiers. Now look
-
at Figure 6 15: The E entity in the Entity -Relationship schema has a composite identifier
( it consists or the i n t e r n a l a t t r i b u t e s and the external b. . , ,, bt attributes], The algorithm
described in section 6.1.2 translates E into the c = a , + ... + a 4 h. + ... + & node with the fl. ... . ,
children ( the b } . b , children will be added when the entity they identify is translated ) ,

I
168 Data Warehouse Design: Modern Principles and Methodologies

ti . h <U)
F E
O G

oa , t
flap
*6
bl
o
b.

ar -ram+if i +-A -
O
_ am *

0
d

A T . JJ r
:: o

o *
1

£ wiHf 6-1$ Vanagirg compos * e ktentrfi.ers: source schema , attribute tree , pruning i lefr . an 3
grafting ( right )

You can achieve two possible results if you have to leave the granularity of E unchanged
,

in your tact schema,, you can retain the c node and prune one or more of Us children For
example, you can retain the districtNum+ country node because you need to aggregate
by individual sales districts, but you can prune discrictNum because this is not a
significant aggregation. Otherwise, you can graft c to remove it and retain all or some of its
children if the aggregation expressed by £ is too fine grained .
To conclude this section , we must emphasize that you may need to manipulate your
attribute trees further. In particular, you may need to make radical modifications to the
structure of your trees by replacing ‘/he parent of a specific node . This is the same as
adding or deleting a functional dependency, depending on whether the new parent is a
descendant or a predecessor of the original parent , respectively. Figure 6-16 shows how the
structure of a hierarchy changes when you modify functional dependencies. The left-hand

tree shows the &*-+ b, b ct and the <2 K- functional dependencies you can use transitivity —
-
only a *!* and
-
to infer the third functional dependency from a +6 and 6 *c The right-hand tree shows

In practice, you should add a functional dependency if this is not displayed in the
source schema but is inherent to the application domain. On the contrary; you could also
find it useful to delete a functional dependency if you want an attribute that is not a direct
root child to become a dimension or measure. However, you should remember that this
particular operation introduces a functional dependency between dimensions. For example .
If you do not want to sacrifice the information from the sale receipt but you also want to
have a time dimension in the sales fact, you can delete the functional dependency between
saleReceiptNum and date. Figure 6-17 shows the tree you can obtain. In this tree, you
can choose as dimensions product , sale&eceiptNum, and date,: the resulting fact
schema will dearly have a functional dependency between the saleReceipcNum and the
date dimensions.
Chapter 6: Conceptual Design 1G 0

BSEME 6 IG-
Changes applies to
an attribute tree
after dealing or.
adding a functional
O O
*2
'0
b
O
c
O
if
delete b
— c

O
43
£? i

o
C
o
-
W

dependency add & — *:


L

S a1*5?kar:agar
o address
telephone
brand ,


diet

::
dapartjnentliead
di rector 0
— o— -cr
d frpa r une
O
category^
type
D

T^arke t i noOroijp -
unitPrice

-
FIGURE 6 17 An alternative Sates attribirte tree

6.1,4 One-to-One Relationships


--
A one to one relationship may be seen as a special case of a many to ne relationship. For
this reason, a one-to-one relationship can be inserted in attribute :rees. Despite this, if you
- ^
issue an online analytical processing (OLAP) query to drill down a one-to -one association,
this does not add any further useful detail. As a result, we suggest the following solutions:
* When the node determined by a one- to-one association in the attribute tree has
relevant descendants, you can graft it out to remove it,
* -
When the node determined by a one- to one association does not have any relevant
descendants, you can represent it as a descriptive attribute.

-
When a one-to one association exists, you can sometimes find it useful to invert both its
end nodes . This can be helpful when (i ) a relation key is not a mnemonic code so you do not
need to include it in your data mart , and ( si ) one of the relation attributes is a description
field that can act as an alternate key. In this case, you can switch both attributes and then
-
prune the code attribute. Figure 6 18 shows an example of this point ,

6.1.5 Defining Dimensions


Dimensions define how you can significantly aggregate events for decision making
processes. They must be selected from the root child nodes of the attribute tree and may
-
- r
correspond to discrete attributes or value ranges of discrete or continuous attributes

-
FIGUSE 6 1S
Switching and
description
Q
pruning attributes
:n one-to-one
associations 0- O O
description
170 Data Warehouse Design; Modern Principles and Methodologies

i
Selecting dimensions is crucial for design because ii defines the granularity of primary
events. Each primary event "sums upr ‘ all the instances of F corresponding to a combination
of dimension values. If all the attributes making up the identifier of F are chosen as
dimensions, each primary' event will correspond to an instance ot F (lossless grained fact
schema ). Otherwise, you can prone or .graft one or more attributes identifying F so that each
-
primary event can correspond to multiple instances ( lossy-grained fact schema ). i

In the sales example, we have chosen the dace , sto re, and product attributes as
dimensions. The granularity of primary events is coarser than that of tire SALE source
schema entity because we grafted the saleReeeipt uro node, a child of the iwt.
- ^
Figure 6 19 show's an interesting example of how to choose granularity. It show ? part of
-
the Entity Relationship schema relating to the admissions of a patient and the attribute tree
associated with admissions. Data warehousing in the health industry faces the classic
dilemma of keeping or losing patient granularity, The left - hand tree in the figure shows that
we retained the patient sm node . For this reason you may use it as a dimension , together
with f rornDate, zoDate, and department . On the contrary, wre grafted the pacientSSN
node and pruned f irstName and surname to sacrifice the granularity' of individual
patients in the right-hand tree. Now the dimensions become f roraDate, tcDate ,
department , gender, userSegment , city, and yearOf Birth, The last dimension is of
course obtained from riateOf 3 i rth. Figure 6-20 shows the fact schemata that you can
obtain for both cases after selecting your measures. The first schema is lossless-grained. The
second schema is lossy -grained . The second schema might seem to have a higher number of
primary events than the first one because it has more dimensions- But the opposite is true.
When the department and admission date values are equal, all the patients bom in the same
year, of the same gender, resident in the same city, and belonging to the same user segment
are classified into a single went in the second schema The first schema, instead, classifies
them as different events
One particular feature of the choice of dimensions requires a separate discussion Let us
consider the airline ticketing case of Figure 6-14. Here, she identifier of the F entity chosen as

DDE,o department
o
ADMISSION
II , l )
'OF
center O

!
< 0 , rJ
dateOfBirth
o

PATIEUT
oc

a , ii
y

aSlLOSttS
iO , ni USER
S EGKE3ST

m x
TC

A t-e f romTXftte
irstName
f *ti*aCSSJS
-
S'irrtame
neerSegment

frcirnDare f DDWl- E D E
0
ffO^DZr-
COIL 1
~- test i »

iet r 1' *

department
C taE ED e
departcent
a — OtoCote

iirstName i
a
au-tname
t C Cl y - city

w J L wdat & ’If Birth datflOfslrth


gender uaerSegment gender taerSegment

Rati RE 6-19 Hospital admission schema ano two atoerosttve attribute trees
Chapter 6: Conceptual Design 171

-
FiSiJBi 6 20
Fact schemata
department
0
corresponding to i iratNanse
the trees in
a 'irnau
Fgure 6-19
c-
f roriDats
ADMISSION

cost 0 gender
uaerSecprent
tofcate
city
depar t merit \
o OyeirOf Birth

Ogender
O ADKISSCPK
froauate
O userSegnerr

o cost O ci : y
tOuate O ye arO f B i rth

f

a fact is simple that is, it consists of a single attribute fc, It the granularity of each single
instance of F is not relevant , you can choose the dimensions among the children of the root,
as we previously mentioned. Figure 6- 37, later in this chapter, shows an example of a ticketing
>
fact schema with the f 1ighrNumber . tlightBate , check - in, and passengerijencer
.
dimensions However, users can sometimes request the top granularity level defined by tine
k identifier. If tills is the case, you could perform analyses at the individual ticket level in the
ticketing example This request mostly conceals that users may (improperly ) wish to use their
data marts for operational queries linked to everyday life operations, which would be far
better addressed by an operational database. However, we still find it useful to show designers
how the)' can proceed if they are asked to retain maximum granularity:
1. Duplicate the root node of your attribute tree. This means that you have to add a new
node that becomes the new root. This new root is also labeled with the name of the
identifying fact To connect it to the previous root you can use an arc showing a
,

one- to-one association. No additional arc exits that new root


2 . Now you can go on to selecting dimensions , You may decide to mark as a dimension the
-
only direct child of the root This will result in the creation of a one dimensional fact
schema . Alternatively, you may decide to delete some functional dependencies so as
to change parents of some of the tree attributes in order to transform them into
direct root children. Moreover, you can choose these direct root children as rurther
dimensions. If this is the case, it will result in a fact schema with functional
dependencies between dimensions.
-
Figure 6 21 shows an example of the air ticketing case . We modified the attribute tree
in Figure 6- 14 to duplicate its root ( top) If you choose tick*tWumb« r ( TICKET ! as a
-
dimension, you can create a one chmensional fact schema . Or you can decide to turn
f light -late and f 1ightNumber into root children to create a fact schema with the
£.1 i glit Number, f 1ig he Date , and the ticker Number dimensions .
172 Data Warehouse Design: Modern Principles and Methodologies

FIGURE 6 -21 t romCity

.
Duplicating the
root ar a ceding O
.
a 1 r1 i r c

-
Id re check I r.7 iene
O o
functional
dependencies from
tne attfioute free
irowAirport
depamireTiene
f lirh 3?uinher

^ c;c e
* fc L

MS
tick*
* O
In tne air treeing numberl' f Be ?E
examp;e passenger y U
S4ftC
ToAlrport
flighcDace
'
^
ZOCitfrj
/ ttckezNiunber
( TICKETI
4
f r ACTC i
_ y'
airline :are Check I n.7 ime
f romR insert Q G Q
- el

—— ^
Cjc >
departui eTiioe - xS’
:11ghiNniTibe r
- c/cerj
*
j . I
1 G
absence humoerCi 3aas
u V
--
seal
1

1 CAIR ?cil

!i lqhtI- 31 S 1 i cket- N'incher

-
PC city
( TICKET!

6.1.6 Time Dimensions


A fact schema should always contain at least one time dimension, as we previously
mentioned. To learn haw to insert time dimensions , you should remember that source
operational databases can be classified into transient and temporal databases, depending
on the way they are related to time
A transient database defines the current status of a specific application domain Older
versions of data that vary over time are constantly being replaced by the new \ ersions
Transient databases do not always represent time explicitly because they imply that a
-
schema should display real -time data . If this is the case , you should add time fas a
dimension ) to fact schemata manually and you should properly evaluate it for every event
when you carry out data population processes For example Figure 6 22 shows a simple
.
.
transient database schema tracing bids in online auctions. This schema stores only the latest
-
offer each user makes in each auction . We manually added the date attribute, which we
can choose as a dimension at a later stage, to the attribute tree Each bid event loaded into
that data mart will be associated w ith the data population date .
On the contrary, a temporal database show's the evolution of an application domain
over an interval of time. Old data versions are explicitly represented and stored . When
you design a fact schema based on a temporal database, time is explicitly represented as
an attribute For this reason, it becomes an obvious candidate to be classified as a
dimension Figure 6 - 23 shows the temporal database schema corresponding to the
-
transient database schema of Figure 6 22. This schema records multiple bids made on
different days for an individual user-auction pair. The resulting attribute tree is Identical
-
to the one of Figure T> 22
C h a p t e r 6: Conceptual Design 173

object V3 x
o T

s t a r t irtgPri ce . r. I il , i > D v^ t Q' U )

.
t Q i 1,
O 1 AUCTION K. sir i; V
CJS2S

I
a action Code nserCode narre
o
date i
ocjeci Q Coi8
e
st art marries
O ^ —
*
a^c'
o
= ^o
v -‘c
sure:orXod* UStrCodt n nrr
^
Cvalue
FIGURS 6 22
* Tran stem database schema for online auctions ana rts attribute tree

obiset value
:
? .
s naming Price
O WJCTIGS
1 Cmi
rC?
11 ,1 ) j' l U IO
- HJ
USER

.
aucticnCco =
#
date *
userCode name
I
FIGURE- 6-23 Temporal database schema for online auctions

Sometimes it is useful to graft or delete a functional dependency in order to make time


become a direct root child and , consequently, a dimension if time is m an attribute tree as a
child of a node , other than a root node . For example, look at Figure 6 10, which shows the *

order line Entity- Relationship schema . Figure 6-24 shoves the original attribute tree ( left )
and the tree obtained by deleting the functional dependency between orderCode and
sate (right ) where date is a candidate for becoming a dimension.
See sections 5 4.1. and 5-4.2. for more details i « n the choice between a representation based
,

on a transactional or a snapshot fact schema and for issues raised if updates are delayed.
,

guan* ity =
qu niitv
'
QexpiryTime O OexpiryTline

> ..C' c°
- ie c
f - c^O p
O o _
rr:‘da:iT \T'S >
order-Coce
- Oduct Type
±= i = arccrCids
aa ~ ^ Cp r - --
njnM z 6 O si Z 5 nazusr Q 6 size

FIGURE 6-24 Attribute tree for erde' tines wrth/ wrthout trie functional dependency between crd*. r oe
and daie
174 OaU Warehouse Design: Modern Principles and Methodologies

6.1. 7 Defining Measures


If all the attributes making up the identifier of the entity chosen as a fact are displayed
among dimensions (in other words, if you are creating a sossiess-grained schema ) , then
measures will correspond to numeric attributes which are roof children . If this is not the
-
case ( that is, if you are creating a lossy grained schema ) , to define your measures you should
apply aggregation operators, which operate on all the instances of F corresponding to each
primary event , to the numeric attributes of your tree. Generally, this is a sum / average /
maximum / minimum of an attribute expression, or a count of F instances Remember that a
fact schema may be without measures ( empty fact schemata ) , if the only relevant information
is a fact occurrence.
It is not absolutely essential to select measures from direct children of toots . However,
designers must be aware tha: they sacrifice a functional dependency when they select an
attribute that is not a direct child of a root as a measure. For example, assume that the
unitPrice attribute is placed on the PRODUCT entity, rather than on SALE in the sales
source schema . In order to be able to select this as a fact measure, you should first delete the

product *UTL it Price functional dependency In practice, this means that you turn
uaitPrice into a child of the root .
Now we need to create a measure glossary that associates each measure with an expression
-
defining how to obtain that measure from the attributes in the source schema . Figure 6 25
shows the measure glossary of the sales example. Aggregation operators of this glossary are
applied to every instance of She 2 ALE entity relating to an individual ihiwelement group
(date, store , product ].. See section 13.3.2. for further details on measure glossaries.

FIGURE 6 -25 Measure glossary of the sates schema

quactity - SUM ! SALE - c t antity!


r e c e i p t s = SUM ( SALE , q u a n t i t y * SALE . unitPrice 1
unit?ri te = &V£ [ SALE , uni i 05 )
'

nuipO I CUB tcirer = = COtUTT ( *

If the fact granularity Is different from that of the source schemata , you may find it
useful to define a number of measures tha ? use different operators to aggregate the same
attribute For example, suppose that each even ? corresponds to a set of airline tickets, sold
foe a specific flight on a specific date, and that the fare attribute belongs to the TICKET
entity m the source schema. As a result, you can define one or more of the fallowing
measures in the tact schema: average Fare calculated by using the AVG operator:
raaximumFexe calculated by using the MAX operator; -n in imunFare calculated by using
the MIN operator ; and receipts calculated by ridding the fares for individual tickets
together . Of course, the only additive measure will be receipts. To aggregate the other
measures you should use the same operator that defines them .

6.1.8 Generating Fact Schemata


Now' you can translate the attribute tree into a proper fact schema including the dimensions
and measures specified in the preceding phases. Ilierarchies correspond to sub-trees of the
attribute trees with their roots in dimensions Fact names typically correspond to the names
of entities chosen as facts
* r
C h a p t e r 6: Conceptual Design 175

In this phase, you can prune and graft to eliminate any irrelevant details, but you cart
also add new aggregation levels ( typically, to the time dimensions) and define appropriate
ranges of numeric attributes.
You can niark attributes as descriptive if they are not used for aggregation but only for
information. These attributes also generally include attributes determined by one-to-one
associations with no descendants. As tar as the attributes, which are root children but are
not chosen as dimensions or measures , are concerned, the following points apply:

-
——
If a fact schema is lossless grained that is’ the granularity of the primary events is 1
*
the same as the one of the F entity those attributes can be represented as descriptive
attributes directly Linked to the tact A descriptive attribute linked to a fact takes a
value for each single primary event .
-
- r ' _1 • *

— —
3» •

• If a fact schema is lossy-grained that is, both granularity values are different
.
those attributes absolutely need pruning: other vise multiple values of these
attributes would be associated with each primary event .
, i >

-
Figure 6-26 shows the lossless grained fact schema for the order line example. The ,
attribute tree is the one on the right of Figure 6-10. The granularity value is the same as the
one or the source schema because all the attributes ( order Code and produce Code!that
identity' the fact entity were maintained as dimensions. For this reason, we can represent the
line number as a descriptive attribute number linked to the tact. In addition to the price
measure, which is non-additive because it show s the unit price of a product , we defined a
derived measure called receipts , which shows the mathematical product of prize and
quantity for each order line. However, if wre grafted the order Cede node and turned
date into a dimension together with product Cede to create a lossy-grained fact schema ,
vve would need to prune the number attribute because it cannot be unequivocally defined
in the order line set for an individual product issued on the same day.
If you find a shared hierarchy while you are building an attribute tree, you can choose
to highlight it to simplify your fact schema .
In this phase, you can also highlight any cross-dimensional attributes and multipie arcs.
-
It is very difficult to use source schemata as a starting point to identity these types of
1

attributes because you would have to navigate to-many relationships, too. However, if
to-many relationships could be navigated as well, this would extend navigation to entire
source schemata , give rise to an extremely large number of possible paths, and make cycle
management very difficult - Because you can neglect the frequency of cross dimensional
attributes and multiple arcs in normal fact schemata , we suggest that you define them on
-

-
FiauftE 6 26
Fact senema for
.
r m&er explryTine

\ size
the oroef line ? -3,
example zTderCod* ORDER LZliE product Type
O <> quantity
recelpt = ._
_ ftfr,
O

AVG price A VO

r
176 Data Warehouse Design: Modern Principles and Methodologies

the basis of your user requirements. Then you should represent them in fact schemata at a
later stage. For this purpose , remember following:
* A crossed ] mensional attribute generally corresponds to an attribute ot a many-tq-
many relationship called R 1r the source Entity-Relationship schema Cross-
dimensional attribute parents in fact schemata then correspond to the ’denbriers of
the entities involved in K. See Figure 6-27 for an example
* A multiple arc corresponds to a to-many relationship called R from an entity called
E to an entity called G , It will link the E identifier or vour fact with an attribute of R
r

or G in your fact schema See Figure 6-2S for an example.


In this phase, you must idendfy any schema elements that are non-additive or cannot
be aggregated. To do this, you need to analyze aH the dimension-measure pairings Given
an n -dimension fact schema the question about the d dimension and the m measure to ask
,

is as follows:
If the ( w t f j v a l u e s are given to the m measure in the k primary events corresponding
to k different values from the d domain , and from a preset value for each remaining
n - 1 dimension, which aggregation operators does it make sense to use to mark all the
k event? with a single value for rn 7

receipts
iste
n 0 quantity
\i! 0 , n >
‘1 . 1 )y

.
STORE SALE PRODUCT

I tlrl )
#
i: i s
#

store
$
< >
!1. n )
saleCoce

(l .a)
=
o? y - roduct

ll , n )
STATE SALES CATE TORY

L
State VAT
.
category
z a t e goI T O \
V _0 VAT
Q product

SALE
date
o quantity
o
store state
receipts

FIGURE 6 - 27 Entity-Reiatiofiship schema for sales and the corresponding fact schema
Chapter 6: Conceptual Design 177

FIGURE G - 2S Q 0 rest.
Entity-Relationship ( 0 , n ) . (1.11
.
10 nJ
schema far
<8> = At1EN?
=^AETM£MT
.
D N,
hospital
aPmissfons anc
the corresponding
*
tl , nj
i
fact schema departnasmr
<$ > .±7 1E E 1 r. ZoQE cat leritSSN-

. £l

l C , r.

;i, i! CI ,nJ
DIAGNOSIS

dz.agr.OELE

.
; T::, Z Zr cr.t
.
CATEGORY

category

o
data
ADMISSION

ra:eoory :D E I O
o disgnqais
Q paLieziZSSX

The sales schema , for example supports the following remarks:


,

• Given a store and a product if 10 items were sold yesterday and 3 items have been
'

sold today, the total amount of items soid in both days would be 13. Then the
quantirv measure is additive along the date dimension .
• Given a day and a product, if 7 items were sold in the Columbus store and 5 items
were sold in the Cleveland store, the total amount of items sold un Ohio stores would
be 12. Then the quantity measure is also additive along the store dimension.
• Given a day and a store if 4 boxes of Siurp milk and 3 boxes of Yum milk were sold ,
the total amount of boxes of milk sold would be 7. Then the quantity measure is
also additive along the product dimension .
The INVENTORY schema with the product and date dimensions and the level
measure supports the following remarks instead.

• Given an item if there were 100 items in a warehouse yesterday and there were
,

95 items today, how many items would the warehouse total over both days? The
answer 195 ts obviously wrong , so level is non-additive along date. However,
you can reasonably aggregate on the basis of AVG, MIN , or MAX.
• Given a day, if there were 40 boxes of Slurp milk and 30 boxes of Yum milk in a
warehouse, the total amount of boxes of milk would be 70. Then the level measure
is additive along the product dimension
178 DaU Warehouse Design: Modern Principles and Methodologies

depar unanE ad
diresier \dflp* r Ernest
BI& rfce
- i ngj&roup
ca
_ egcT\’

type
-
eight

produce' ^ brmd

s a1e sKaziager
P ul«DLstrict

care state
O O
quantity \\ HtoreCity C ClLTit!“,
receipts
EHiraOf Customers address
.
ur itPrice IAVG )
telephor e
-
Fisvut - 29 rac' schema ferine ssies example

Figure 6-29 shows a lossy-grained fact schema obtained from ihe source schema in
Figure 6-1. We inserted the month. quarter , and so on, attributes and added them to the
time dimension . As tar as additivity is concerned, u n u t P r i c e l s non additive, but you can
use the AVG operator to aggregate it nutoOf Customers cannot be aggregated.
-
Figure 6-30 shows two possible tact schemata that can. be derived from ihe attribute tree
of Figure 6-7. In the first schema , which is lossless-grained , vve left the employee
aggregation level unchanged and the fact is empty, In the second schema , we pruned the
eccCode node so that we turned this schema into a lossy -grained one. In this way, we
managed to add the number measure, which counts the number of transfers made on each
day between two departments . Note the shared hierarchy in department .
To conclude this section, we should comment that designers can sometimes "split" a
single fact schema into two or more schemata to standardize their hierarchies. This is more
properly referred to as fact schema fragmentation. From a logical viewpoint, the result obtained

FIQVRE 6 -30 Q oate


Alternative 'act
schemata for the
personnel transfer TRANSFER
example eraployee Z*? head
O c
; COUNT i 'O d i v i a i in

-urr&n £
Q date

TRANSFER
-ie£aI head
r.- j-mbeT £2
O
division
o
C h a p t e r 6: Conceptual Design 179

is very similar to the one obtained from horizontal fragmentation (section 9 3-2), To illustrate
this, consider an example related to invoicing in a large retail chain. Assume that a company
sells products in Italy and abroad. A hierarchy for foreign customers is less informative than
the one for Italian customers. Moreover, the invoice line shows only the currency dimension
-
for foreign customers. Figure 6 31 shows a comparison of two conceptual design alternatives .
In the first conceptual design, the specific properties of the application domain cause the
geographical hierarchy for customers to be incomplete, because information on the foreign
customers city and regions is missing. This also creates an optional dimension (currency) .
In the second conceptual designr the decision was made to create two separate fact
schemata. The first feet schema models invoices issued to Italian customers. The second fact
-
schema models invoices issued to foreign customers. The dear cut advantage of this is that
all the hierarchies are complete and compulsory, but every time you need the data summing
up ail the invoicing, you have to issue a drill-across query accessing the overlapped schema .

0 category

O type

Q. product

INVOICE LUCS
cate customer
O
quantity
o <t> 6 O
city region country
receipt *
_
:Currency
Qcategory Q category

6 type
count ry Q
£ product
i Q product

region f

NATIONAL FOREIGN
date date CUETAMER
INVOICE LINE INVOICE LIFTS
o quantity
o o o
euacorner city quantity c&cm ry
receipts receipt *

QCURRENCY
FIGURE 6- 31 Fact schema for invoicing lines , witnout and witn fragmentation
180 Data flare he use Design: Modern Principles and Methodologies

We recommend avoiding any abuse of fragmentation because it inevitably leads to a


proliferation of fact schemata. For this reason ., you should use fragmentation only after
checking first with users that the percentage of analysis queries addressed only to one of the
fact schemata is much higher than the one of analysis queries addressed to multiple fact
schemata. Moreover, you should not confuse tact schema fragmentation with horizontal fact
table fragmentation despite their obvious similarities. Fact schema fragmentation is
conceptually relevant and can be deployed to improve fact schema simplicity and accuracy.
On the contrary, horizontal fact table fragmentation is part of the logical design phase, which
is strictly workload-dependent and aims at improving analytical query performance .

6.2 Relational Schema- based Design


-
The relational schema driven approach to conceptual design is substantially the same as the
approach mentioned in section 6.1. The main difference between both approaches LS in die
algorithm used to build attribute trees. Figure 6-32 shows a logical schema for the sales
example that we will use in this section. This logical schema corresponds to the Entity
Relationship schema of Figure 6-1 . Its primary keys are underlined . We specified the
-
referencing relation of every foreign key. We put the attributes, which are part of composite
foreign keys, m parentheses. We did not add any surrogate key in the interests of simplicity.

FiauRi 6 - 32 Reiationar schema for the safes operational catatase

FECCUCTS tcreduce , wight , s i z e , die : ! , brand ; BRANDS , typ* tTYP£Sl


STORES < store , address , telephone , salesWanager ,
I' distrirtU -diu . countryl iSALES _ DISTRICTS , inCity : CITIES )
.
SALE RECEIPTS Isalgjteceip ^Nun date , store :SCORES )

WAREHOUSESywarehouse , address I
-
SALES I prsducr : RROHUCTS saleReceipTLN .in:: 5 AL2 _ 5C32 ?7S , quantity ,
= unit ?rice )

Sta ? ft * STATES ]
STATES i s t a t e ,
tountry ; CCUKTRIES I
COUNTRIES i court ry )
_
SALES 0 ISTRI CTS ( di tr ictWiiir. r country : COUNTR Z £ S J
=
_: <
PAOD NWAEEHC7SE c- roduct . PRODUCTS , warehouse : VARZHQUSES \
EF^RirpS i. COdSrand , prodi: C« dlai CITIES )
_
TYPES 1 typei marketineGroup :KARK 3RCUPS , category': CATEGORIES I
>5AfiY GROT?5 [ ma r.ce t rngOr cup . d i r e c t o r )
CATEGORIES [category , department:DEPARTMENTS)
DEPARTMENTS ( department . iepartrr.entHead 1

6.2.1 Defining Facts


In a relational schema, a fact corresponds to a relation . In the following sections , we will
study a fact corresponding to the F relation- I n the sales example, the most important fact,
a product sale, is represented by the SALES relation .
Different relations can be candidates for expressing an individual fact as in Entity
Relationship sources . In section 6.2 2 we will show an example of how your selection can
.
-
result in similar attribute trees showing different quantities of information.
C h a p t e r 6: Conceptual Design 181

6.2. 2 Building Attribute Trees


If your design source is the relational schema for your operational database, your attribute
tree will be built as follows:

Each node corresponds to one or more2 schema attributes.


• The root corresponds to the primary key of F.
For each node vt the corresponding attribute functionally determines all the
attributes that correspond to the descendants of v

-
The tree building procedure is based on the principle of following functional
dependencies. In a relational schema, functional dependencies link the primary' key of each
R relation to all R attributes on the or e hand, and each foreign key of R, which references S ,
to the primary key of S on the other hand. The first examined relation is the one that you
choose as the F fact . Every time you examine a relation R , you create a new node v in your
tree. The v node corresponds to the primary key of R, with a child node added for each
attribute of R ( including each single attribute that make up the primary key of R if this
primary key is composite, but excluding the single attributes that are part of a composite
foreign key, because those will he dealt with in the next recursion step). You should also add
a child to v every rime you find a foreign key t in R, and then recursively repeat the whole
procedure for die relation 5 referenced by c:
R (k .
s ( c, dr e
C :S . • i .1
A recursion is also triggered when a one-to-one association links R and 5 , and the
primary key of S is foreign and references R ;
.
R (k r a to ...)
S ( k : R , c , d . .. )

In this case, you have to check that this procedure does not return to K after examining S,
because this would result in an infinite loop.
-
Figure 6 33 shows pseudo-code that defines the basic operation of the algorithm
building an attribute tree rooted in F. The comments on cycles and convergences that we
-
previously made for Entity Relationship schema-based design also apply to this case.

Rum: 6 -33 Pseudocode for the attribute tree-building algorithm

root -
newNode ( pk < FJ J ?
/ / the r o o t i s labeled with the primary kay
l l of t h e r e l a t t o r i chosen as a f a c t
translate ( Fr root 3 ;

-A node may correspond tn multiple- attributes only if a primary key or a foreign key of a relation consists of
those attributes.
182 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

_
procedure £rarslat.e i R r v ! :
// ?. is the rurrsr.t relation, y is the current node of the tree
for each attribute a*zF: such that a*pk[H end 53: £k(Rr 5
// pkiR) and fkfR S' mark the sets of attributes of R that font ,
(

/./ respectively , the primary key of R


// and a foreign key of R referencing E
a ddChiId < v, newNode( a ? > ;
U add a child a to the node v
for each set of attributes IrZR Such that
3 s s . t . t c = £ k [R . S] or C= ffc(S , SD »pk < S)) and {parent (v j a s . C)
// if the parent of v id the tree is C itself from S ,
// the infinite recursion is avoided
( next =nswNode fC) r
// create a new node with the names of the attributes in C
addchild(v , next ) j
translateiS . next );
// i add it to v as a child and trigger the recursion
i i

1
}

Note that you cannot process optionality automatically. This is because no information
in a relational schema dearly specifies whether you can give attributes null values’ Figure 6-6
shows the attribute tree corresponding to the sales schema , which is the same as the one
obtained from the Entity-Relationship schema . .
We will provide another example related to DVD rental and described in the transient
database schema;
CARDS { cardNuTTiber. exp- iry)
CUSTOMERS fcartiMimber :CARDS, nsr.-e , gender , addres s .
tel ephone, persons1Docum&nt!
H Q V I E S 1 mov ; eCode , t i t l e , c a t e g o r y
* d i r e c t o r , l e n g t h. , n s i n A c t o r
;
COPISS i p c s i t i o n Q n S h e1 f , rr.ovieCode : MOVIES )
REHTALS(pcsitionQnSha1 f :COPIES » cardNutaber:CARD, date, time)

-
Here, RENTALS is the only relevant fact Figure 6 34 shows the control flow for the translate
procedure and explains how io build a branch of the attribute tree shown in Figure 6 35. Note -
that the association linking the card number to the customer number is one to-one. -
FJGLRE fi-34 Operating flow of tne translate procedure in the DVD rental example

-
root newseode ! pos it ton0nS'nel f )

t r a n a - a t e ( R= ftEN7ALS * v = r c s i t i c- n OnShe I f ‘ :
addChi 1d ( pa s i E. i o r.O c S h a l f , d a t e ) ;
add 2h i I d ( pa EIE £ o r.OnE h a l f , E. icie ) *
f o r S = COPIES ?

3SQL
DtfimtKn Urngungt allows ;'cr a NOT NULL standard clause lr principle, it could be used to extract
optbnaBy. As a cutler of fact* this clause ts normally used to express entity integrity CQfistiAtnte drily Entity
integrity constraints specify that it is forbidden to give a null value !Q the attributes of whidl a key consists- For
this reason, there is no point In inferring that all the attributes not specified as NOT NULL are actually optional
Chapter 6 : Conceptual Design 183

.
-
addChi id(poHit:or.0r.5hE 11 , p osi1icnChShelf! j
translate(COPIES , posit ion.On.SheIf ) ;
for S =CARDS :
addchild ( positionOnShelf, cardNumber )
translate ( CARDS * cardNUmber) r

trans1ate(3 COPIBS ,

- =
for S MOVIESJ -
v pcsitionOnSheIf ):
ft dd chi1d(pcsit ionOnShe1 £ , mov i eCode)
translated KOVIS3, movi eCode}

-
1 1an*late(E MOVtES . v = movieCode):
adcChi1Q(movieCode . title);
addChiid(fcnovieCode , category);
addCh11d(movieCoder director) ,
addChiId(movieCode , length] ;
addChiId(movin Code , xainActox)-r

t rans1at e < E
-CART'S , v« ca rdNumbe r ):
addChiId \ number, expiry) r
-
for G CUSTOMERS:
addChlid(careNttmh =r . cardNumber);
translate i CUSTOMERS , cardNumber );
.

translate( B= CUSTOMERS. v = cardNumberj 1


addChild(cardNumbar , name);
addCh:laicsrdHumber telephone);
addChi Id cardNumber , gender );
addCh i 1a(cardNumber r address)
addchi1d(cardRumber , persona 1Document ) ;

To conclude this section, the following example demonstrates how to select a fact in
order to model as many concepts as possible. To this end, you should keep in mind! how
attribute trees are built The operational database is shown next:

FLIGHTS(flightNumber , airline , froroAirport AIRPORTS, =


tcAirport ; AIRPORTS, riepartureTime , axrivalTime, carrier)
_-
K!2<3HT NSTAXCES I f 1ightNunbex:FLIGHTS , date >
#

AIRPORTS I lATArode , name . city , country)


TICKETSI 11eke tNunib&xr (f1ightNumber , date):PLIGHT INSTANCES, seat
_ rare .
passengerFirstNaxe , passengerSurname , passengexGender)
-
CHECK IN( ticketNumber : TICKETS , ch eck TnTime * numberOf Bags)

FIGURE 6 35-
Attribute tree for z it la
the DVD rental telephone ( category
example
gender Q Olength
I nardXumber ositionOnSheL f
t address I CARDS)
0
r : COPIES ) Udirertot
cnainActor
r pe rsons1Document date time
184 Data Warehouse Design: Modern Principles and Methodologies

The relations that aie candidates for expressing facts are FLIGHTS, FLIGKT INSTANCES, _
-
TICKETS, and CHECK IN LH this example. Figure 6-36 dearly shows that the las: two options
are the best , because the existing functional dependencies make it possible to include the
maximum number of attributes in the tree . However, note that the selection of TICKETS
means that you opt for modeling the TICKET ISSUE, fact. But si you select CHECK ZK . this -
name
ci
^y 0 city
countryn
carrier -
caun ry

i ro~A i rpo r t Q frojr


airline Q
^
airline
i rpori

f 1:sfctHumbK f 1ighieliimh er + da ie
d s part ure Time O - ( FLIGHTSJ deparcureTiiae Q iam INSTANCE I
jrr IVE I Ti me Q? arrivalTine
LoAiroon

name
o
=
> i -y country
city nan*
Q
-cumry
f rOTEAIrpcr1
airlineQ
- ^S*-
r‘ 1

cepart jreTite Q - o
numberCf Bags
azr i va1T i me O'
coAirport
pas sergerTend a r
name
city country

city

ecfuntryi
f a r e cb*ckInline
fronAirport
airline Q &
ae pa rtmr ST;me Q-
numbe rQ £ 3ag £
an i va1Tine QT
tcAirpcrt

name
city cc Lin t ry
-
FI«JK 6-36 Attribute trees for the nights example, obtained by choosing different relations as facts,
corresponding to lhe nodes ingray
Chapter 6: Conceptual Design 185

results in the CHECK - IN tact . The difference between both solutions is not merely the name,
because not all the tickets are necessarily checked in . As a result, the CHECK - IN fact primary
events will presumably be a subset of the TICKET ISSUE fact primary events.

6.2.3 Other Phases


The other phases in the methodology are substantially identical to those mentioned in
section 6.1 for the Entity-Relationship schema design.
You can choose to retain or graft any nodes corresponding to composite keys. You may
also find it useful to modify, add, or delete a functional dependency, as in the case of the
-
Entity Relationship source. You need to add one or more functional dependencies if a non-
normaiized relation exists in your source schema. It would be useful, tor example , to make
country the child of city in the flights example. Figure 6-37 show's the tree built after
grafting repeatedly and the final lossy-grained ticket issue fact schema. The check m
attribute is Boolean and we added it to the tree when we grafted the number node. Its value
-
is TRUE only for those tickets whose passengers have checked Ln.
Figime 6- 3S shows the transformed attribute tree in the DVT) rental example of Figure 6- 35.
In Figure 6-38, we inverted movieCode and title, and cardNumber(CUSTOMEFJS ) and name
( renamed customer); we grafted positionOnShe If ( COPIES.) and cardNumber( CARDS );
and we pruned time , expiry, telephone, address, personal Document, movieCode .
and ca rdKumbe r(CUSTOMERS),

country ( j \
f rwfclrpan:
a ir l in ^ O.
departretime Q~ ->
numb* rDfBags
auf civ*1Time O
toAirporr

city country
country o Pity

3 : rport
date 0
-
O checx in

0
CJ
alrlim from TICKET ISSUE
s w de r
ynj setige t3
epartureTi me munberCf r Iighta
r.umberOf Bags
c
arrivalTime
0 receipts
ft

carrier

-
FIGURE 6 37 Modified attribute tree and fact schema for the flights example
186 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FIOUH ? 6-38 Q date


Modified attribute category
tree and fact cu»toiKr title
schema for gender Q O Olength
DVD rental® .
pts L z i ar CTiE hel i
O director
'

matHK.cc or
Q daze

i
RENTAL category
gander
o o nuenrer
O length
CUElCEWr title
director
mainActor

In case of relational sources, measure glossaries are typically written in SQL If schemata
-
are lossy grained, SQL queries defining measures will necessarily use the GROUF- BY
clause. Figure 6-39 shows the glossaries of the sales, flights, and DVD rentals examples ,

FIGURE 6-33 SQL measure glossaries for sales, flights, and DVD rentals . The check - ir. dimension
was teft out of the flights example to avoid making the query too complex.

qua cicy
^ = SELECT SOM i S . quantItyl
FSOK BALES S INNER JOTS SALE R.SCEI.FTS F. _
OK R . saleSfitze. iptMu.nl = S .. sale teze iptNun: -
GROUP BY S product , R dace , a , store
- .
receipts = .
SELECT smis quantify *5.uni thrice)
_
FROM SALES 3 jmtR JOIN SALE RECE I ?TS R
ON R , sftleReceiptNXsm S saleReceipiNuia
3ROUP BY S productr R .date, R .store
-
- .
unitPrice
-SELECT AVO[E.unitPrice}
FROM SALES S INNER JOIN £ALE R£CEIFTS R
ON R.salapaceiptNuns
_
.saleSecsiptNma
GROUP BY S . product * R . date ^ R . store
-
aunofcusccfflwrfi
-
SELECT COUNT( O
FROM SALES 5 INNER JOIN SALE REC EIFTS R
ft . salefteoeipCttuB
_
ON aaleReceiptNimj =
GROUP BY S product , p , date R , store
- .
nuiftberQf Flights SELECT COUNT I " )
*
FROM TICKETS T INNER JOIN FLIGHT INSTANCES I
ON T f 1 ightNuinher I.f1i rlt tNtrier AND 7.daz I . date
GROUP BY T , passengerGender 2 date T , f I i ghtN unibe r . . .
Cmmcerlf Bag* = SELECT STM C , ElUifiOerOCBaga } '

FROM TICKETS T INNER JOIN FLIGHT XNSTANCES I _


ON T . fIzghtNuirber = I . f 1ightNujiibe r AND T . date = I . d a t e ,
TICKETS T INNER JOIN CHECK * IN C
.
ON T tickstNumber * C t icketNut be r
GROUP BY T . pa &sengerGender , I . date , T . £1ightlhzmber
. .
rezeznts SELECT SUM JT . fare }
FROM TICKETS T INNER JOIN PLIGHT INSTANCES I
ON T ilightNurber. - .
I £ 1 ightNutnfcer AND " . date
GROUP BY T . pa = s engerGende r , I . date . T fIightKuraher
I .date
.
C h a p t e r 6: Conceptual Design 187

cumber = SELECT COUNT(*)


FROM RENTALS R INNER JOIN' COPIES C
ON H . positionOnShelf
-
C. posit ianOosheIf *
COPIES C INNER JOIN MOVIES ?
ON C . movieCode = F . movieCode ,
RENTALS R INNER JOIN CUSTOMERS C
ON R . cerdN jrJbe - z .cardNiynber
.^ -
CROUP HY F . title E date C , tsair-e

1.3 XML Schema-based Design


-
Part of the data used in decision making processes may be stored in XML form . The structure
of XML consists of nested tags, defined by users. These tags can define the meaning of the
data represented . This makes XML suitable for exchanging data on the Web without losing
-
semantics. The Internet is evolving into a global platform for e commerce and information
exchange. Interest in XML is growing m step with this evolution- Now huge amounts of XML
data are available online.
XML can be considered as a special syntax for the exchange of semi-structured data
( Abitebou] et a]. , 2000). A feature common to all models of semi-structured data is the lack
-
of a schema, which makes it possible for data to "self describe." As a matter of tact, XML
documents can be associated with a Document Type Definition ( DTD ) or an XML Sdiema.
DTDs and XML schemata are both able to describe and constrain documents and contents.
DTDs are defined as an integrating part of the XML 1.0 (World Wide Web Consortium
[W3C], 21X30) specification,, and XML schemata have recently been recommended for W 3C
(W3C, 2002a ) . XML schemata are a major extension of DTD features, especially from the
viewpoint of constraints and data standardization If you use DTDs or XML schemata, data
exchange applications can agree in the meaning of their tags. In this way, XML can release
its full potential
Now it is becoming vital to be able to integrate XML data into data warehouses because
many companies look to the Web tor communications and business support This is also
confirmed by some commercial tools that already support XML data extraction for data
mart feeding. Nevertheless, they still require designers to define data mart schemata
manually and to make sure of their mappings with source schemata .
Conceptual design of data marts from XML sources presents two basic problems. On the
one hand , there are various approaches to representing associations in DTD and XML
schemata, each with a different expressive power. On the other hand, you cannot be sure
that you can derive all the information required for design because XML models semi
structured data. In the following sections, we will first discuss the issues involved in
-
representing XML associations oriented to creating multidimensional schemata. Most topics
are by Abiteboul et al. ( 2000). Then we will briefly describe a semi-automatic technique
solving the problem of how to infer correct information by querying source XML documents
and taking advantage of designers' skills.

6.3,1 Modeling XML Associations


An XM L document consists of element structures nested on the basis of a root structure. Each
-
element can contain component elements, or sub elements, and attributes , Both elements and
their attributes can have values. The structure of a document can be nested up to any level of
18$ Data Warehouse Design: Modern Principles and Methodologies

complexity. Any number of additional elements and textual data may be placed between
the opening and closing tags, of an element . Attributes and attribute values are included in
-
element opening statements. Figure 6 40 shows an XML document containing data on the
traffic on a web site.

-
FMUK 6 40 An XML document describing traffic on a web Site

< webTraf f ic >


<L '“ lick >
< host hos t Id = " w\* . uni bo . it " >
< coun t ry > i t a 1y < / country >

-
< /h o s t >
< dac, > 2 3 -MAY -:Q C 5 < / del t i >

-
< tiine > 16 i 43 ; 25 < / time >
curl urlID HSLGQ 23 " »
< 3 ite si * eI 0 = " ww , csb . f r " >
< ccuntry > traace < / count ry >
< /s i t e >
<£i LeType > Bhtral < / i L LeType >
< ur1Ca t eg-n ry >cat a1ogus < / ur1Cat eg cr/ >
< / url >
</ clicSc >
< click >
4 ..
< / cLicX >

< / wecTrs £ fic >

— —
An XML document is valid if it has an associated schema that is, a DTD or an XML
schema and if it complies with the constraints expressed in that schema . The following
discussion will focus on the ways you can display many-to-one associations in DTDs ,
because our conceptual design methodology is based on recognizing those associations.
*

Similar considerations apply to XML schemata .


A DTD defines ( i ) the elements and attributes an XML document allows; (ii) element
nesting modes; and ( tii) element occurrences The element type and the attribute list
statements constrain document structures. The element type statements specify which $ub -
elements may be displayed as element children- Tire attribute list statements specify names,
types, and default values ( if necessary ) of each attribute associated with a specific element
type. Among the various attribute types, ID, 1DRHF, and IDREFS are of particular
importance for our approach. ID type defines a unique element identifier. IDREF type
shows that attribute values must correspond to the values of an ID attribute in the current
document . IDREFS type shows tha: attribute values are lists of ID values.
-
There are two different wavs to specify associations in a DTD: using sub elements or
IDREF(S).
In the first case, the association cardinality is described by an optional character that

This defines whether elements may appear one or more zero or more
-
comes after the name of a nested element or a list of elements in the element type statement.
or zero or
one ("7") times. The default cardinality is precisely one. Figure 6-41 shows a DTD that
validates the XML document of Figure 6 40. The wei; Traffic element is defined as a
*
Chapter 6: Conceptual Design m
docwngK ? dement fand Si becomes the root of XML documents. The wenTr a f i l e element
may contain many click elements. But the s i t e sub-element can be exactly displayed
once in a url element; the fileType sub-element and many urlCategory elements can
come after it , The host element may have either a category or a country element .

FIGURE £ -41 Sub-elements specifying associations in a DTD

< IOOCTYPB webTraffic i


cLBLEMBR webTratf ic : click * ) >
< ! ELEMENT click '' host , date, time , url ) *

cI ELEMENT host icategory | country ) >


ctATTLIST host
hostld ID HaSQUIKED*
< E ELEMENT category i PCDATE) >
< LELEMENT date SPCDAtEl >
^
< t ELEMENT tiffift EPCDATEi >
< L ELEMENT UXl [ Site fileType , urlCatagory * ) >
< !ATTLrST url
url Id ID #ftEQUZRSD>
< J ELEMENT site freruntryJ >
< JATTLIST site
siteld ID FREQUIRED >
< J E1EMENT COUI31 ry £ £ ? CDATE J >
< ! ELEMENT flleTyp* ( fcPEDATE ) >
< ! E LSMEHT ur lea t ago rv ( s PCDATE } *
]*

If you need to represent a one-to-one or a one-to-many association in XML, you can use
the sub-elements without any information loss. However, you can follow only one of both
.
association directions in a DTD. For example Figure 6 41 shows how a DTD expresses that
*

a url element can have many uxT Category sub-elements. But there is no way to infer if a
URL category' can refer to many URLs. You can conclude that this is true if you already
know the domain defined by the DID.
M ' | I
' '

The oilier way to specify element associations in DTDs uses ID and IDRER 5) attribute
pairs. These attributes operate in the same way as primary and foreign keys m relational
databases. The fundamental difference that prevents us from, using IDREF(S) for our
,

purposes is that their syntax does not allow for any IDREF(S) attribute to be constrained to
contain identifiers of a specific element type ..

6.3 , 2 Preliminary Phases


Before you can choose a fact and build its specific attribute tree, you need to simplify your 7

source DTD and create a DTD graph Sub-elements in. DTDs may be stated in a complex and
redundant way. If ( his is the case, they need simplifying ( Nhanmugasundaram et a]., 1999),
To simplify a DTD, transformations generally involve converting a nested definition into a
"flat" representation For example, hc t {category ! country ) is transformed into
^
hast [ category? , country ? ) in the web traffic example . Moreover, the 'V operators
are transformed into operators.
After simplifying your DTD, you can create your DTD graph , This defines your
DTD structure, as discussed by Lee and Chu ( 2000) and Shanmugasundaram et aL (1999 ) .
1% D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FIGURE 6*42
DTD grapn for web
traffic analysis

11e I a

Your graph nodes will correspond to your DTD elements, attributes, and operators. DTD
graphs do not make any distinction between attributes and sub-elements because we can
consider them as equivalent nesting elements for our purposes. Figure 6-42 shows the
DTD graph of Fi gure 6-41,

6.3.3 Selecting Facts and Building Attribute Trees


Designers choose one or more DTD graph nodes as facts. Each chosen node becomes the
root of an attribute tree. We will choose -cl iclc as the only relevant fact in our example.
Figure 6-43 shows the algorithm that builds the tree in pseudocode. The nodes in the
attribute tree are a subset of the DTD graph nodes representing elements and attributes.
We use the node of the F fact as a starting point for our tree . To grow our tree, we recursively
navigate the functional dependencies between the DTD graph nodes. The following section
shows how we insert each node called V to expand our tree ( translate procedure ):
1. For each node called W that is a child of V in the graph:
When examining the associations in the same direction of the graph, the mformatiori
on cardinality is expressed, either explicitly by the and "nodes, or implicitly if
they are not available. If W corresponds to an element or an attribute in the DTD
graph, you can add it to the tree as a child of V\ If W is the "operator, you can add
^
its child to the tree as a child or V. If W Is the operator, you cannot add any node.

FIGURE 6 -43 Pseudctcooe for the attnbute tneebuihjtng algorithm

roQt
=QfiwSode IFJ ;
/ / the root i s a. node libeled with the name
// o f the DTD graph note chosen as a fact;
translate!? , root );

p r o c e d u r e t r a n s l a t e E , VJ :
/ / E is t h e c u r r e n t node of t h e DTD graph ,
U V i s the current node of the a t t r i b u t e t r e e
| f o r e a c h c h i l d W of E such t h a t p a r e n t IV !
/ / the condition on the p a r e n t of V i n t h e t r e e avoids t h e loop
i£ w is ui element or &r. a t t r i b u t e
C h a p t e r 6: Conceptual Design 131

-
r.sxi aewNode tt 1 ,
addehildfv, rexi ?
translate tW next ,
4

; addE the ciiiis w :o the nods V and triggers the recursion

If
translacerw , vi i
/ the nodes H ? " are omitted
'

for each parent 2 if t . t arent (V ; #2 ar- d 2 := not a document *l *Frtnt


/ / tne condition or. tne parent of V ir. the tree avoids the loop
if z« n ? » or Z * " *
translate £ 2 Vi ;j

/ / the node s '*’ and


else
-- 1
are oraltied

-
if noc co cnanylE , z )
-
if askTo merz Z
j next =nevNrde 12 i
addChild(V , next : r
t r a n s l a t e i Z . nexti ;
/ / if the association is to one r
// Z is added gf a child of v
-
1
}

2. for each node called Z that is a parent of V in the graph:


When examining the associations in this direction, you should skip the nodes
,

r r
corresponding to the " ' and the 7 ' opera tors because they express cardinality only
F J

in the opposite direction. You have to query your XML documents conformed to
vour DTD to examine your actual data because DTDs do not provide any further
information on association cardinality. To do this, you need to use the to many -
procedure, that counts the number of discrete 2 values corresponding to even E *

value. If you find a to-man) association, you cannot include Z in your tree. If this is
-
not the case, you still cannot be sure that the cardinality of the E to-Z association is
-
to one. In particular, only designers who are very familiar with their application
domain cars define whether cardinality is actually ret to lo-one or to-many ( askTo -
one procedure ). You can add Z to the tree only if cardinality is set to to-one You do
not need to use any document elements because they have just one instance in XML
documents: for this reason, they are not relevant to aggregation and you do not
have to model them in your data mart
When you pass a node you should add an optional arc . Moreover, you should add
controls to prevent your algorithm from looping back at one-to-one associationsr end nodes.
This is because you can navigate a DTD graph both bottom-up and top down.
Uncertain associations are not navigated in our example We did not add the
-
urlCategory node to the attribute tree because it is a child of the DTD graph node.
Figure 644 shows the resulting tree. Before moving on to the fact schema , we need to apply
some changes. We can apply the switching and grafting procedure mentioned in section 6 14
to the host , uri, and site nodes We can replace the time attribute with hour, whose
-
granularity is coarser. The resulting schema is lossy grained
T

192 Data Warehouse Design: Modern Principles and Methodologies

FIGURE 6-44 nr lid ELTELE


Atmtxne tree and
p p
fact schema for category
arualyzing weo S.- h G s t click
o
* *
best I d O
traffic country
country 5J
date time ftleType
cate tour
p Q
category
hostId
CLICK.
-
crlld site d
country
country O nirafce r
fileType

Here are some general remarks on the approach we adopted .


The problem of checking XML documents for cardinality is connected to the problem of
determining functional dependences in relational databases. This is dealt with at length in
the literature on relational theory' and on data mining (Mannila and Raiha, 1994; Savnik and
Flack 1993). In our case, the conditions are much simpler because no inference procedure is
necessary. This means that we simply have to use an XML query language supporting
aggregation to query our data properly. For example, W3C (W3C* 2002b ) XQuery suggests
using the distinct function for this purpose . On the contrary, Deutseh et al. ( 1999 ) f

recommend using the group-by function . The main issue concerns the number of XML
documents to be examined to reasonably confirm the hypothesis of to- one cardi naliiy.
-
Qearty, the semi structured nature of the XML source data increases the level of
uncertainty of the data structure in comparison with the Entity-Relationship sources. This
requires designers knowledge to be called upon more often. In our algorithm, we chose to
'

-
ask designers questions interactively in the tree building phase to avoid unnecessary
document queries Alternatively, we could create the tree first and specify uncertain
associations. Then we should give the entire tree to designers so that they can examine it
-
and, if necessary; delete those associations, together with their sub trees. This solution
allows designers to have a broader vision of their trees, but it is also a less efficient solution
because a node deleted by the designers at this stage could have been expanded pointless!y
in the previous XML document querying phase.
As matter of fact, you may also need to infer cardinality of associations when your
design source is a relational schema. If a relation called R includes a C foreign key
referencing the K primary keyr of an S relation , this implies that C functionally
determines Kf and then all the other attributes of 5. But it does not provide any
information about the number of distinct tuples in K linked to each tuple in 5. In
principler it would be necessary to query a database to evaluate any uncertain
cardinality , as in the case of an XML source. However, this issue for relational databases
is somewhat less relevant than in the XML case. While XML document designers freely
choose the direction in which they want to represent each link, the need to retain the first
normal form forces relational schema designers to represent each association in a to one -
direction. For this reason , the association from S to R is generally one-to-manv and is not
relevant to the purposes of multidimensional modeling . The only relevant case, managed
by the algorithm mentioned in section 6 - 2-2, is when a designer used the C foreign key to
- -
model a one to one association.
Chapter 6: Conceptual Design 193

cat t eoory count ry

FiGUHr 6-45 Anottier possible DTD paaf for web traffic

-
FIGURE 6 46
Attribute tree for
siteid
Q
trie DTD graph of r z
Figure &4S host i d dick urUd u r l
O c ^ -o
country
date citie S2 eType

You can create many DTDs to represent an individual subject, and the algorithm can
build a different attribute tree for each of those DTDs. For example, if your DTD graph
was the one shown in Figure 6-45 r Figure 6-46 shows how your attribute tree would look.
If click is a fact, you have to analyze your data to check for any uncertain associations
to navigate from host Id to host and from ur i Id to url . Figure 644 shows wrhat the


resulting atttibute tree would look like after replacing host Id with host and urlld
with url this is allowed because these elements are linked to pairs by one to one
associations.
-

6,4 Mixed- approach Design


Our methodological framework for mixed-approach design uses the Tropes formalism to


analyze requirements ( section 4.3.) To discuss conceptual design methods, we will assume
that designers have already prepared the necessary diagrams and, in particular, an
extended rationale diagram for organization and one for decision-making processes for
each of the actors involved . On the whole, organizational diagrams give a broad picture of
source operational data, and decision-making diagrams show preliminary workload
Moreover, source data analyses and integration phases have resulted in. an operational
schema tor the reconciled database, in either conceptual or logical form .
In the conceptual design phase, you can pair the requirements derived from
-
organizational and decision making modeling with your source operational schema to
generate a conceptual schema for your data mart- You can break down this procedure into
three phases:
.
1. ResuireTngni mapping phase The facts, dimensions and measures found in the decision-
making modeling phase are associated with entities in the operational schema.
194 Data W a r e h o u s e Design: Modern P r i n c i p l e s a n d M e t h o d o l o g i e s

1 Fact schema building phase. After navigating the operational schema , you can create
a draft conceptual schema.
-
3, Refinement phase. Draft conceptual schemata are fine tuned to better meet users'
expectations.

6 ,4.1 MappingRequirements
The goal of the requirement mapping phase is to establish relationships between the facts,
dimensions, and measures found in the decision-making modeling phase and the relations
and attributes in operational schemata. This process is described in detail here:
Decision-making modeling facts are associated with entities or rt -ary relationships (in case
of Entity-Relationship schemata ) or relations ( in case of relational schemata ) in source
schemata . Now if you look at the banking example shown in section 43, the transaction
fact is likely to correspond to a table called TRANSACTIONS in the source database.
As far as dimensions and measures are concerned , you can reach your goal if you use the
attributes identified in the organizational modeling phase as a bridge. You can virtually set a
double mapping between organizational modeling attributes and both the attributes in your
operational schema and the dimensions and measures in your decision-making model. Look
at the banking example. The win he raws1 ca r d coos attrib ute, which is associated with
-
the enter withdraw1 card code goal of Figure 4 4, corresponds to the card code
dimension, which is assodated with the analyze withdrawal atrount and ana1yae
withdrawal r.umrer analysis goals of Figure 4-6. The same withdrawal card code
attribute may correspond ED a nun.Card attribute in the WITHDRAWALS operational schema
table. Similarly the wi thdrawa1 amount attribute of the er.ter withdrawa1 amount
goal corresponds to the total amount measure of the analyze withdrawal anomic
analysis goal and to the amount attribute in the WITHDRAWALS table.
Note that you may partially automate this phase if the names used for operational
schema and rationale diagrams are properly consistent

6.4 . 2 Building Fact Schemata


This phase implements the data-driven part of a mixed approach. You should navigate the
many-to-one associations expressed by your operational schema for every fact F identified
in your decision - making model and successfully mapped onto your operational schema .
This aims at building hierarchies and a draft fact schema for F.
You can use algorithms mentioned previously to carry out this navigation automatically
(sections 6,1.2 and 6,2.2 ). The only difference is that navigation is "blind " in. data -driven

design that is, all the source schema attributes linked to your fact by a many do-one
association are included in your hierarchies. On the contrary, user requirements actively
guide navigation in the mixed approach. In more detail,
1 Each dimension d that was successfully snapped from an extended rationale
»

diagram onto your operational schema is included in your fact schema. Tire
navigation algorithm creates the whole hierarchy rooted in d .
,

2 . Each measure m that w- as successfully mapped from an extended rationale dia.gram


onto your operational schema is included in your fact schema. No hierarchy is
created in this case.
Chapter 6: Conceptual Design 195

3. Even lime you find an organizational mode! attribute that is not included in your
1

dedsion-making model you have' to decide ii its main role is as a dimensional attribute
,

or measure. You can add dimensional attributes to your fact schema and label them
with ' offers/ ' The navigation algorithm specifies their positions in your hierarchies
Similarly you can add measures to your fact schema and label them with " offers."
4 . You can pick the dimensions and measures in your decision making model rationale
diagrams tor which you have found no operational schema correspondence still
-
include them in vour fact schema , and label them wish " requests. rJ

5. Fact schemata do not include those operational schema attributes that rationale
diagrams cannot map and that you cannot reach when navigating.
As far as points I and 3 are concerned , note that sometimes you cannot reach a
dimensional attribute to insert into your hierarchy from your fact if you exclusively
-
navigate many tcnone associations. Then you may need to navigate many to many - -
associations. This gives you the opportunity to add multiple arcs and cross dimensional
attributes automatically to your fact schema . On the contrary; you are supposed to carry
-
-
out this operation manually in the data driven approach -
-
Note that the names used for measures in a decision making diagram may sometimes
provide designers with valuable information for assessing which aggregation operators to
use. For example , look at Figure 4-6 : you can immediately realize that the decision-maker
wishes io aggregate the amount measure both by SUM and by AVG .
Figure ti-47 shows the preliminary fact schema obtained for the banking example.
The withdrawal Fee measure is labeled with "request" because it is displayed as a
measure paired with the analyze ext ernal withdrawals goal, but it is not in the
organizational rationale diagram. The description dimension is labeled with “‘offer "
because it is displayed as an attribute in the rationale diagram tor organization, but
-
decision makers did not classify it as an analysis dimension in their rationale diagram
-
for decision making processes.

-
FiiauRF £ 47
Preliminary fact de st Count ryCode
45c

schema for o
TRANSACTION J0 bankCsoe
-
Q EE t ir.it itciC ,,
[QFTEX )
=
c/ a Q tountryCode

*' aj
TRANSACTION
O
O
( S t j y, . AVC- 1
O O o- amount
t r ansae t i o&Nuidbe r
deseiiption
<OFF #3 =
year month date w i tluirawa 1 ?e s ?.EQUELST] '

CardCode Q
196 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FWRE -« 3sr
^er
Fact schema for
TRANSACTION bankCode
after tine-tuning
desCleat icoC/ A count ryCGd. fr

TRANSACTION
o o o
aipouc c •; sow , AVG ;
O O t r 5Ji =ac1i onNtinibHe r
year stttQCh dace withdrawalFee

cardCcde Q

After a comparison with the data -driven approach, we can conclude that, in the mixed
approach, (a ) initial fact schemata may be considerably smaller and simpler; (b) diagrams for
requirement analyses directly support the classification of facts, dimensions, and measures
so that designers do not need to take any action; (c) modeling particular concepts, such AS
multiple arcs, cross^iimensionai attributes, and additivity, becomes easier .

6 , 4.3 Refining
The aim of this final phase is to rearrange fact schemata to make them more suitable for users '
needs. The main operations that you can carry out are those mentioned m section 6.1.3:
pruning and grafting attributes, and adding and deleting functional dependencies.
Because dimensions and measures have been labeled in the previous phase, designers
can now distinguish m fact schemata (i) all the necessary and available information ( non-
labeled dimensional attributes and measures); (ii) all the necessary information that is not
currently available in operational schemata (dimensional attributes and measures labeled
with " requests "); and (iii) the available information that is not clearly relevant for analyses

(dimensional attributes and measures labeled with " offers"). The second category can make
designers evaluate the option of adding on to operational schemata or using additional
data sources . The third category may encourage decision - makers to try out different
analysis directions.
Figure 6-4S shows a final fact schema for the TRANSACTION fact in the banking
example mentioned in section 4.3 . This schema assumes that (i ) users have no interest in
customer granularity; (ii ) the data-staging phase calculates the withdrawal Fee measure;
and ( iii} the descripc. ion dimension is not considered as relevant to the analysis., but the
ties tinationC / a dimension is relevant .

6.5 Requirement-driven Approach Design


In the mixed and data-driven approaches, operational schemata are the determining
support factor in conceptual design, as we have previously mentioned . The functional
dependencies expressed in those schemata are particularly useful to discover hierarchies
Chapter 6: Conceptual Design 197

quickly, in their carl v -stage forms On the contrary ,- ^11 the weight of hierarchy building tails
*

squarely on designers' shoulders in the requirement-driven approach ,


Consistently with the requirement-driven scenario outlined in section 2.4 .2, we will
assume that ybu have already drawn up the extended rationale diagrams required by the
Tropos approach. The extended rationale diagrams you created during the decisi or -making
modeling phase are those mainly used in the requirement -driven approach.
In both previous approaches , we can find a very precise set of design steps, some of
which can be automated ., However, the design process phases may show blurred outlines
-
m the requirement driven approach. The main guarantee of a successful outcome depends
on designers' skills, experience, and ability to establish fruitful relationships with users and
'

experts involved in the application domain , The starting point for requirement driven
conceptual design is a set of preliminary fact schemata obtained by associating each fact
-
found in the decision - making rationale diagrams with its measures and dimensions.
-
In the banking example of Figure 4 6 , you can. immediately design the preliminary schema
of Figure 6-49. The main points you should take care of in close collaboration with users are
Listed next

.
hierarchical form ( for example, date month > year ).

1. Identify any functional dependencies between the dimensions and code them in
-
2 Mark any optional dimensions (for example cardCode, that takes a value only in
some types of transaction )
3. Merge those measures that differ only in the aggregation operator used ( for example
averageAmount and tota1Amount )-
4. If any dimensions or measures are related to specific primary event subsets, merge
them or fragment your fact (for example, you can merge witMrawalNumber and
cransactlonKumher into transact!onNuniber because all the withdrawals are
a particular type of transaction ).

Figure 6-.50 shows the resulting fact schema after applying the previous criteria .
Nowr you can assume that your dimensions and measures are properly defined - You still
have to extend and complete hierarchies. To do tills, you must first decide which additional
-
attributes are relevant to your analysis for aggregating and / or event selection purposes .

FIGURI 6 -49 yeAr C month Q c /a P


Preliminary tact
schema for
TRANSACTION in
ne requirement- valueD& Co TRANSACTION
driven approach
AverageAmount
o t o t s Ifcnour.t
date totalTransact ionNumber
w i t &3i •== w a iN umber
'

v11r.dra wai Fe e

cardCode -4
198 D ?ta Warehouse Design: Modern Principles and Methodologies

FI6URE *50 c /' tC


Fact schema for
TRANSACTION
after rearranging - valiaeEate THAKSfcCTIQK
dimensions end
mmsum t ran sact i cnType I
c c> o- amount i SUM , AVO >
.
t rax s act i ontAzcbe ?
o
y e ax Tima t h da"* withdrawalFee

cardCode A
Then you can interview users in order to understand functional dependencies properly. We
will make no secret of the fact that this is the most challenging stage in this approach, because
the users often have only a rough idea of the actual dependencies that link attributes- If you
are able to ask users all the right questions, you can achieve successful results.
When you have to design a conceptual schema even without Tropos diagrams on which
to base your work, this is the most complicated condition. This means that you have to map
the requirements expressed by users directly onto a fact schema. Section 4-2 showed that
you must eventually briefly sum up those requirements in glossaries. Although these
glossaries may help you better explain dimensions and measures, they can provide no
information on hierarchy compositions and structures. Regarding hierarchy compositions,
-
an in depth analysis of the reports normally used by the company is required , so that you
can edit a list of the main dimensional and descriptive attributes to include. Regarding
hierarchy structures, the information exchange between designers and domain application
experts is fundamental . If there are any doubts, you. can find the necessary answers only if
you carefully check the cardinality constraints on data. As a general rule, we recommend
that you reuse as much as possible those hierarchies / parts of hierarchies that are frequently
used In a particular application domain to reduce the complexity of design and maximize
7

fact schema conformity. If you arrange them suitably in libraries , they become an invaluable
and irreplaceable resource for designers because they or their semi- processed forms can be
adjusted to users ' real-worlds needs.
To conclude this chapter, we must not forget that one of the problems with the
requirement -driven approach is rooted in finding mappings between fact schema attributes
-
and source data . Those mappings are necessary to implement data staging procedures. To
do this, right from the start of the conceptual design phase it is vital that you make sure that
fact schemata agree with source schemata, and that fact schemata fully seize the analysis
potential of sources ( Mazon et al, r 2007a).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy