Bab 5 Golfarelli
Bab 5 Golfarelli
Conceptual Modeling
99
100 Data Warehouse Design: Modern Principles and Methodologies
For ail of these reasons, current literature has proposed different original approaches to
multidimensional modeling , some of which are based on ERM extensions, others on Unified
Modeling Language ( UML ) extensions. Table 5-1 lists the main models proposed so far and
shows whether each one is defined as a conceptual or logical model and whether it is
associated with a design methodology, It also specifies whether the conceptual modeLs are
based on the ERM or UML, or if they are ad hoc models. The remainder of this paragraph
briefly outlines the conceptual models considered the most representative and aims at
highlighting their similarities in expressive power.
-
Figure 5 1 shows a class diagram for analysis of purchase orders { Lujan-Mora et al, 2006) .
It uses an object -based formalism, more precisely a UML extension. Purchase orders play the
role of facts. A UML class, whose attributes are the fact measures, models purchase orders.
The dimensions are the supplier, the date, and the item ordered. Both the dimensions and
different levels of aggregation describing them are represented as classes. Aggregations
( represented in UML by small diamonds) connect the facts and the dimensions, and many to
one associations link the different levels of aggregatiore
--
On the contrary, tire model used in Figure 5-2 Ls an extension of the ERM ( Franconi , 1999)
The schema in the example analyzes the calls made through a telephone company. Here, the
fact is called target and the measures properties. You can represent relevant aggregations
(aggregateentities ) and use the specialization hierarchies of the ERM to list the values of one
level of aggregation.-
i
Chapter 5: Conceptual Modeling 101
4
price
digcount
i
3
1 * *
-
S « ppI lflr Date Item
i
1 .. * »
i
. . I
i l
T
1 I,*
-
County Quarter Group
T ’
1
S ate Year
^
1, .•
1
Couait ry
- A UML class diagram for purchase order analysis (Lu an-Mora. 2006 :
FIGURE Sl
Finally, Figure 5-3 shows a fact schema for the analysis of checking accounts ( Hiisemann
at aL, 2DOO). In addition to facts, dimensions, and measures., nonusable attributes for the
aggregation ( property attributes ) , optional attributes , and alternative aggregation paths are
represented . Specifically; the extensionsL meaning of the alternative aggregation path in
Figure 5-3 is that either a value of profession or a value of branch is linked with a value
of customer Id,
The remainder of the chapter describes in depth the Dimensional Fact Model proposed
by GolfareUt , Maio, and Rizzi in 1998 and constantly enriched and refined during the
"
following ten years to optimally fit the variety of modeling situations that may. be faced in
real projects, We will present basic concepts and more advanced constructs in sections 5.1
and 52, respectively Section 5.3 deals with extensions! properties and defines aggregation
semantics. Section 5.4 tackles specific features tied to the representation of time. Section 5.5
shows how to overlap fact schemata,. Section 5.6 proposes a formalization of the in tensions £
and extensions! properties of the model
102 Data Warehouse Design : Modern Principles and Methodologies
property
level
dirge
* dura tier.
dimension
1
/ 1. TO
r
DAY DATE CALL NUMBER
FROM
type
i ‘
tffiBtE aUSZ.TOSS
V /
A YU
!dm rati an ! \ /
KEEK CUSTOMER
DAY
AC0 - O TYPE
5 I
turgre ate
^
cnLilies
-
Fi&ufts 5 2 -
An ERM ba &ed conceptual schema for phone call aoaysis ( Frsnconl and SarJer, 1999 )
fact
at &enJLtjve pathi
*
c/ a
/
/
cranch
,
i list bme rTypifi
facts
balance dimeftsson
r
dime anionJJ
t cu = torr.erName
a9 £
orgld
^f. options] -irspbujc
interest level
_ property
creditLimir
A
oEgManue
^ attribute
prodoctId preduetType
measure
U t ime date KOTlth year
A
l
aggregation path
FIGURE 5 *3 A fact schema for the analysis of checking accounts iHusemann et aP , 2000) . .
C h a p t e r S: Conceptual ¥ iteM £ 103
These characteristics make the DFM an optimal candidate for use in real application
contexts in particular we prefer this model to others presented in the preceding paragraph
,
because we consider its graphic formalism particularly simple, expressive, and appropriate tor
communication between designers and end users. Moreover, the availability of a semiautomatic
technique to obtain conceptual schemata from operational source .schemata facilitates the
designer s work, especially when conceptual design is implemented within a design tool
'
.
The conceptual representation generated by the DFM consists of a set offici schemata..
Fact schemata basically model facts, measures ,, dimensions, and hierarchies.
Fact
-
Ajkcl is a concept relevant to decision making processes. It typically models a set of
events taking place within a company.
Examples of facts in the commercial domain are sales, shipments, purchases, and
complaints. In the healthcare industry some interesting examples are admissions,
discharges, transfers, surgeries, and access to emergency services. In the financial industries,
stock exchange transactions, checking account and credit card balances, contract creation ,
,
loan disbursement, and the like are considered facts, as are flights, car rentals, and nights '
stay in the tourism industry.
It Is essential that a fact have dynamic properties or evolve in some way over time.
Indeed few of the concepts represemed in a database are completely static- Even the
association between dues and states can change if state boundaries are modified . For this
reason , the distinction between facts and other concepts must be based on the average
frequency of change or on the snemric interests of users. For example, the assignment of a
new sales manager to a department occurs less frequently than the promotion of a product ,
Whole the association between promotions and products is a good candidate to be modeled
as a fact, the association between sales managers and departments generally is not, unless
users are interested in monitoring the transfers of sales managers to find out the correlations
,
between department managers and how much that department sells. See section 5.4.3 for
-
a more in depth discussion on the dynamic properties of fact schemata.
Measure
A measure is a numerical property of a fact and describes a quantitative fact aspect that
is relevant to analysis.
104 Data Warehouse Design; Modern Principles and Methodologies
For example, each sale is measured by the number of units sold , the unit price, and the
total receipts. The reason why measures must preierabiy be numeric is that they are
generally used to make calculations . A fact can also have no measures, as in the case when
you might be interested in recording only the occurrence of an event In this case, the tact
.
schema is said to be empty Section 5.3.5 discusses some specific features of empty schemata.
Dimension
A dimension is a fact property with a finite domain and describes an analysis coordinate
,
of the fact
A fact generally has more dimensions that define its minimum representation
granularity. Typical dimensions tor the sales fact are products, stores, and dates . Ln this
case, the basic information that can be represented is product sales in one store in one day.
At this level of granularity, it is not possible to distinguish between sales made by different
employees or at different times of day. Because facts are generally dynamic, a fact schema
will almost certainly have at least one temporal dimension whose granularity can vary
from the minute to the month ( more probably, the day or week).
The connection between measures and dimensions is expressed at the extensional level
( that is, at a data level rather than at a schema level) by the event concept we informally
define here, while referring you to its formal properties in section 5.63.
Primary Event
A primary event is a particular occurrence of a fact , identified by one n ple made up -
of a value for each dimension A value for each measure Ls associated with each
primary event.
Ln reference to sales for example a possible primary event records that 10 pad ages of
,
Shiny detergent were sold for total sales of $25 on 10 /10/ 2008 in the SmartMart store..As
-
this example shows, dimensions are normally used to identify and select primary events.
On the basis of the concepts introduced so far, you can design a simple fact schema for
sales in this chain of stores. Figure 5-4 shows that a fact is represented by a box that displays
the fact name along with the measure names. Small circles represent the dimensions, which
are linked to the fact by straight lines.
A fact expresses a many-to-many association between dimensions. For this reason, the
-
Entity Relationship schema corresponding to a fact schema consists mainly of an n-ary
^
A sjmple fact
schema for sates
product <2
I act
SALE
O O
date quantity store
receipts
unit Price
n LLffibe ro f "u ® t C«e r s
mct uTcs
^
C h a p t e r 5: Conceptual Modeling 105
FHHJHE 5 - 5 product
paODUCT
The Entity-
RelaticxwiD score
schema
corresponding to tOpB} *
the fact schema of
-
Figure 5 i "
ATS
(0. n)
£ 7C ?..~
relationship, which mode Li the fact , among entities that model dimensions The measures
are attributes of this relationship. Figure 5-5 show's the Entity-Relationship schema
-
-
corresponding to the fact schema of Figure 5 4 . Clearly, though the ERM is expressive
enough to show facts, dimensions, and measures, it does not represent these concepts as
-
first class citizens.
Note that some multidimensional models in the literature are focused on the
symmetrical treatment of dimensions and measures ( Agrawal et aL > 1995; Gyssen and
Lakshmanan, 1997 ). This is an important result from the view-point of uniformity in the
logical mode! and flexibility of online analytical processing (OLAP) operators Despite that,
we believe that you should distinguish between measures and dimensions at the conceptual
level , because this enables logical design to be aimed more at reaching the efficiency'
required by data warehouse applications.
Before defining what a hierarchy means, we should introduce the concept of
dimensional attribute.
Dimensional Attribute
The general term dramsrtfnnJ attributes stands for the dimensions and other possible
attribute's always with discrete values, that describe them
,
For example, a product is described by its type, by the category to vvhich it belongs, by its
brand, and by the department in which it is sold. Then produce, type, category, brand,
and department will be dimensional attributes. The relationships among the dimensional
attributes are expressed by hierarchies
Hierarchy
A hierarchy is a directed tree 1 whose nodes are dimensional attributes and whose arcs
model many -toon e associations between dimensional attribute pairs. It includes a
dimension, positioned at the tree s root , and all of the dimensional attributes that
describe it .
'Graph theory reminds us that a tree is an acyclic connected graph (Beige, 1985). A directed tree is a tree with a
n?of or a node called rD from which you can reach all the other rvode s ia directed paths. Within i directed tree,
.
only one directed path connects the r. root to each of the other : nodes. Given a node called & into w hich an arc
,
Called a enters and tram which bf c, d . . . arcs exit , we veiiJ cat! the node from which a exits the parent of u and the
. .
nodes into which fr, c, d . enter the cteJim of v in addition to its parent, the predecessors of v are the parents of
its parent and soon. Lit addition to its children, the descendants of p are the children of its children and SO on.
106 Dati Warehouse Design: Modern Principles and Methodologies
Do not confuse the term nierfffrfty used in this context with the identical term used in
Entity-Relationship modeling, where it refers to specialization, links between entities {IS A
hierarchies }. In the multidimensional modeling context, hierarchy refers instead to
-
associative links of different kinds in a way that is not dissimilar to aggregation hierarchies
-
in object oriented models- For example in the product dimension hierarchv you will have
,
an arc from product to type to express the type of each product, an arc from product ip
brand to express its brand . an arc from category to type to express the fact that all. the
, ,
drink type products belong to the food category, an arc from category to department to
,
express the fact that all of the food category products are sold in the food department, and
so on. In relational terminology, each are in a hierarchy models afunctional dependency
between two attributes:
— —
p redo c t t yptr , p rod L C t K> rand,
—
type tc a t egc ry, c a c egory ^depart ment
Because the transitive property applies to functional dependencies, each directed path
inside a hierarchy represents in turn a functional dependency between the start and end
—
attributes. For example, produce *cype and type-+caceg ~ ry imply produce teategaty.
Figure 5-6 shows how you. may add hierarchies built on dimensions to enhance the fact
—
schema of Figure 5-4. Dimensional attributes are represented by eirdes and are connected
by lines that mark the hierarchy arcs and express functional dependencies. For example, the
city where a store is located defines the state to which that store belongs. Hierarchies are
structured like trees with their roots in dimensions. For this reason, you should not
explicitly show arc directions as each one of them is implicitly oriented in a direction
moving away from the root.
Figure ,5-6 shows a typical temporal hierarchy that ranges from date to year. A fact
can include more than one temporal hierarchy modeling different dynamic properties.
For example, a shipments fact schema may include a hierarchy bulk on the shipping dale
and one built on the order date. Other frequently used hierarchies are geographical hierarchies.
hierarchies category
1 SALZ state
O D O quantity
o o o
year quarter mot date store BtorcCity court ry
receipt s
week unit-Prl te
nun&erQ f Cus toene r 3
hierarchies related to company organization charts, and part -component hierarchies. Figure 5 6
shows an example of geographical hierarchy as one built on the store dimension.
-
NOTE AU of the attributes and measures mtktn a fact schema must have different names You can
differential e similar names., if you quality them with the name of the dimensional attribute that
comes before them in hierarchies ( forexample, coreCicy and brandCity) *
The convention proposed by Kimball et a!„(1998) provides for each name to be built
from three components: an object (client product , city
/ a dtissificatkffl ( average, total date,,
/
Secondary Events
Given a set of dimensional attributes ( generally belonging to separate hierarchies), each
n-ple of their values identifies a secondary ei' ent that aggregates all of the corresponding
primary events Each secondary event is associated with a value for each measure that
sums up all the a lues of the same measure in the corresponding primary events
For example, sales can be grouped according to the category of products sold , to the month
when sales were made to the city in which stores are located , or according to any combination
/ ,
of (hose Let 's choose store lity . product, andmonth as dimensional attribute , for our
-
aggregation. The n ple (stcreC ; zyi Miami', product;, "Shiny , month: 10/ 20081) identifies
4
-
a secondary event that aggregates all of the Shiny product sates in October 2008 in Miami
stores , In other words, it aggregates all of the primary events corresponding to the n-ptes
where the product value is Shiny the value of store is any store in Miami, and the value of
,
dace ranges from 10/ 01 / 200$ to 10 / 31 / 2008. The value of the receipts measu re in this
secondary event will be expressed as total receipts related to the sales it aggregates. See
-
section 53 tor an in depth discussion on complex problems connected to aggregations.
-
Now we can make a quick comparison with the Entity - Relationship schema corresponding
to the fact schema in Figure 5-6, which is shown in Figure 5-7. Note that each dimensional
attribute ( dimensions included ) corresponds to an entity, which has that attribute as an
identifier, Also note that a many -to-one relationship represents each arc in the hierarchies.
From this viewpoint, the hierarchy notation adopted in fact schemata can be interpreted as
-
a simplification of the Entity Relationship notation, where the representation of relationships
is simplified ( their names and their multiplicity, which is always many-to-one. are not
shown ) and where just an identifier is shown for each entity .
10S D a t a Warehouse Design: M o d e r n P r i n c i p l e s and M e t h o d o l o g i e s
- DEPARTMENT
marfcetingGroup
BASKETING
country GROUP year
COUNTRY tuu YEAR
category /a ,E)
CATEGORY
(1r R)
tun
STATE
tUn)
(
o
l i l)
tUlf
BRAND
CITY
(UR)
brandcity
QUARTER
cun
onarter
TYPE
rUft )
tun
(Un)
'
BRAND
tun
brand o tU * >
cun
city month
CITY I U 1) hONTH
C Un » (Un)
tuni tun
store
STORE
unitPrice
( Mi ( M| qaif .t i % y
O O
r&eaipts nuwborOfCi:* towers
CUR * ( < 1r r o
SALES SALES HOLIDAY DAY
MANAGER DISTRICT
. *
9 ai esHar.5 ger
.
aelesDiatrict •
hoIid a.v
( i , n}
day
WEEK
week
very useful to best express the multitude of conceptual nuances that characterize actual
scenarios. In particular, we will show that the introduction of some constructs generally
makes hierarchies no longer mere trees, but graphs
wtegory
optional CTOi -dlmcmioTual
.re type brandCity \ attribute
veivh
brand
product: diet
day s a 1 e sKins gc T
‘ * holiday 0 alesDi.strict
Q
SALE
i score state
o . y
o O
rea r t e r month /tiat e
'u
storeCicy country
receiptB
4 TTdTberOf 2u Stoners
week
/
/ ijr.it Price ( AVG ) \itelephone
address
non ^ adcSitiviiv
startSate y
prone tier.
optional
dimension
^ ^ descriptive
jutnbuic
endDat E. discount
cost
p
advertising
-
d <??± ximenth'ead
'
depa
DERASTMENT -
di r ^' ’ or
Q
'1, 11 BEAMS
CITY -
q ^ar “
r
5t*‘B etr
QUARTER
STATE
<1, r.! *
Cl . ii ) ( l, n )
11.11
I tl .
Jil
( i, u .
i i i)
BRAIO
brand u . ii
ci .y
•1 i:
.
rr- r ifc
. a _ et
CITY
product MONTH
< 1,1:
tl nl
PRODUCT
.
11 n J O
wiight
[ 0 , Jl )
10.1)
.
tl nl
DIET
! 1 . n)
( 1.1 ) ( 1.1 )
address
stars —O ; D,x> 1 ( CrG| date
STORE DATE
o •miz Price
d,i) .
ii ii
tei *pltcr.e
quant ty - 6
ii . ijjiiTiir
.
receipts n -u ~ no rC f CM* T. OTT K r .
: i , ni .
ll nl ( 1, n !
SALES SALES
.
!0 n)
HCDIDAY DAY
UI 5TEICT MANAGER
• II
p nsmrit iun C
PROMOTION scartDate
*
5s e s l i s t r i c t
#
aaleaKanager
“
O holiday
.
II a )
(1.1) endLnte
Hi 1)
WEEK
il nl. .
t l n)
discount
EISCOUUT ADVERTISING ; advertisinfi
cgst.
FIGURE 59 "
h: s Efltity- aelatlorcship schema corresponds to the fact schema of Figure 5-3
A descriptive attribute can also be directly connected to a fact if it describes primary events,
but it is neither possible nor interesting to use it to identify single events or even to make
calculations (otherwise, it would be a dimension or a measure, respectively). FOT example,
consider Figure 5 10, which shows the fact schema for shipments. The order, shipping, and
*
Chapter 5: Conceptual Modeling HI
depar t m ntHead
^
lepArdlDeiit,
veignr
packaging C hrsrji
i EVD i zelfmrjber
Drcducd h-© diet
rece:pirate address
a
year
Q O
Tiarter
G O
veek
ir.cr.dh
ordfirMonch
datt
arderYear orderD&ts/ordef
A
SHiPKnrr
sh i ppcdCvar.111 y
shipwentCast
desnaaticn
warehouse
address
——
city
o
sciintry
dirEctor
c^sune r “
-
d de H i on ra zz
conditions
carrier allowance
address / ^ -
incentive
receipt dates muss be represented . While it Is useful to define the order and shipping dates
as dimensions and build two different temporal hierarchies on them , you should probably
not represent the receipt date as a dimension as well However representing a date as a
,
measure causes problems because the only applicable aggregation operators would be MAX
and MIN. For this reason , you can accurately represent the receipt date as a descriptive
attribute of the SHXPMEOT fact Keep in mind that in order to correctly link a descriptive
attribute directly to the fact, you will have to give it a single value for each primary event
In the case of shipments, the representation adopted for receipt Date would not be
correct if separate shipments received by a customer on different dates could be made for
the same order and the same product.
O order
Q CErrii ly
sc^cies
hour SIGHTING
Q
5.2.3 Convergence
The concept of convergence deals with the structure of hierarchies. In particular , hierarchies
may not be real trees because two or more distinct directed paths may connect two specific
dimensional attributes on the condition that each one of them still represents a functional
dependency. Look at the example of the store geographical hierarchy in Figure 5 , 8 . Stores
are grouped into cities, which in turn are grouped into states belonging to countries.
Assume that stores are also grouped into sales districts and that no inclusive relationship
exists between districts and stales . Assume also tha .> each district is nevertheless part
—
of exactly one country the same country to which the store dty belongs. Lf this is the caser
each store belongs to one country alone, independent of the path followed;
—
s to r e
-
——
s tore -> s r o r e C1 1y *s t at e +count ry or
a1e s Di s t r i c T »coun t ry
Two or more arcs belonging to the same hierarchy and ending at the same dimensional
attribute mark convergences in fact schemata . If a convergence exists, a tree structure can
no longer define arc directions uniquely. Figure 5-8 shows that you have to add arrows
-
to converging arcs. On the corresponding Entity Relationship schema of Figure 5- 9 r the
convergence could be modeled by adding an explicit constraint stating that the association
cycle between the STORE, STATE , CITY , COUNTRY, and SALES DISTRICT entities
/
is rediindant.
If there are apparently similar attributes, this does not always result in a convergence.
For example, look at the brandC: ty attribute in the product dimension, which represents
the dty where products in one brand are manufactured , and at the storeCity attribute
in the store dimension. Both city attributes obviously have different meaning and must
be represented separately because a product manufactured in one city may also be sold in
other cities.
C h a p t e r 5: Conceptual Modeling 113
FIGURE 5 12
Redundant
-
convergence (left)
and its correct - *
representation
(r t g W
-
Finally, look at Figure 5 12. X ote that a hierarchy like this, in which one of the
'
alternate paths does not include intermediate attributes, does not have a reason to exist.
The convergence is completely obvious in virtue of the transitive property holding for
functional dependencies.
^
caliiisg hour
CALL
schema aod O
equivalent schema
di 5cries. .
i eiSuaber “ limber —o o— -o
without shared dat * amth year
hierarchies called dusrat ion
c R!1ingNixmbe rType
*
rule
callir.gNun±ierCisi.riot non sr
a
c a l l i ngfruoiber
CALL O
number •O
ca 11 edNUmbetuistrict date raenth vear
a duration
calledNUmher
ca11BdHurhbsrType
114 Data Warehouse Design: Modern Principles and Methodologies
i
warehouse city state country
FIGURE 5*14
Shared hierarchies SHIPMENT
to] O -O
and roles in the
shipments fact O number —
U customer
schema oroduc CCEt
'
vdate
ortten zD&:=
-io> —o o
mooch year
telephone calls made with the number calling, number calledr date, and hour of the call as
dimensions, A double circle represents and emphasizes the first attribute to he shared
( t 0 lNun.be r ). It should always be implicitly clear that all descendants of shared attributes
are shared , too . If one or more descendants of the shared attributes should not be shared in
turn , you would then need to represent those hierarchies separately.
When hierarchies are shared starting from their dimensions, you need to add a role
that specifies its significance to each incoming arc ( cal i in 3 and cal led in Figure 5-13 ,1 .
Figure 5-14 shows the case of dry. Here, you can instead omit the role as its parents
implicitly define it ( the warehouse city, the customer city )
-
FIGURE 5 iS genre
Fact schema ~or SALE
book sa’es
C
a arbor
i
book ^r
nuri
receipts
O O
— O
dace month year
mull ip k JCTV
Chapter 5; Conceptual Modeling U5
-
FIGURE 5 16 aepart rr.erL"
O
Factsenega for
hospital f LTEtNaTCC
admissions
laStName
O ADMISSION
rate-gcry
o o rest
rat lent 5;
-Os« *erM
Li a gr.o sis
"
'
-
0 as eriag^er.t
*
city
tisYsar
But multiple diagnoses often correspond to a single admission, each one belonging to a
-
category. Figure 5 16 shows that you can turn the start arc in the diagnosis dimension
into a multiple arc to model this scenario in that fact schema. As discussed m section 5-3-4,
when a multiple arc enters a dimension, the semantics of -aggregation becomes more complex.
O type
quantity productCode
amount
or products ( food, clothing, and household), coverage proves to be partial and disjoint
because expir&tionDate and s i z e are defined only for food and clothing , respectively. .
country Q B r $
-A , .
Z K vaclean City
ei -y 0 Santa BTorw
Anaheim Monterey Springfield
county
t
FrCwRt 5 20
*
cC z
Two equivalent
modeling options
for a hierarchy 4 b
o O a G
-
FIGURE 5 21
Role hierarchy in
President
an unbalanced
company
CEO Presideii - s assistant
organization cnart
Prediction manager Marketing manager
FISUKE 5 - 22 rcle
Recursive hierarchy
for a company employes D
organization nnert
ACTIVITY
o
date
o mu?* worked
'
ectivityType
prc T set
5 , 2.9 Additivity
Aggregation requires the definition of a suitable operator to compose the measure values
that mark primary events into values to be assigned to secondary events. From this
viewpoint, measures can be classified into three categories { Lenx and Shoshani, 1997) ;
* F/ett? A1fK5 H 7£S refer to a timeframe, at the end of which thev are evaluated
cumulatively. Some examples are the number of products sold m a day, monthly
receipts, and yearly number of births.
* Level Measures are evaluated at particular times. Some examples are the number of
products in inventory and the number of inhabitants in a city*
* Uni* Measures are evaluated at particular times but are expressed in relative terms.
Some examples are product unit price, discount percentage, and currency exchange.
The usable operators for aggregating different types of measures along temporal and
nontemporai hierarchies are shown in Table 5-1
Additivity
A measure is called add:free along a dimension when you can use the SUM operator
to aggregate its values along the dimension hierarchy ll this is not the case, it is called
,
.
non -additive Anon-additive measure i> non-nggregtibie when you can use no aggregation
operator for it,
Table 5- 2 clearly show’ s a general rule: flow measures are additive along all the
-
dimensions, level measures are non additive along temporal dimensions, and unit measures
"
-
are non addid ve along all the dimensions.
An example of an additive flow measure in the sales schema is quantity. The quantity'
sold m a month is the sum of quantities sold every day in a month. Figure 5-23 shows an.
-
example of a non additive level measure. The inventory level is not additive along date ,
but it is additive along the other dimensions. An example of a non-additive unit measure LS
weight
-
r yp*
( £ rafid
'
packaging itsmsParPallet
predict
ahdr «« s
:r.V2~ RY
—
"
uni zPrice in the sate schema (Figure 5-B). In any case , you can use other operators, such
a ? AVG, MAX, and NUN, to aggregate those non-additive measures. For example, it makes
sense to average unicPrice for more than one product, store, and date.
-
On the contrary, you cannot intrinsically aggregate non aggregable measures for
conceptual reasons. For example ., look at the numberCr £uscomers measure in the sales
fact schema (Figure 5-S). This measure is estimated for a certain product , store, and day,
,
counhng the number of sale receipts issued on that day in that store, which refer to that
product Totaling or averaging the number of customers for two or more products would
lead to an inconsistent result , because the same sales receipt may also include other
-
products. Then numberOf Customers is non aggregable along die product dimension
( while it is additive along date and store}. In this case, it is non aggregable because the -
association between sales receipts and products is man ) -to-many instead of many-to one- -
You cannot aggregate the numberOf Customers measure consistently along the product
dimension, no master the aggregation operator adopted , unless event granularity is set to
a finer level.
Additivity is the most frequent case. So in order to simplify graphic notation in the
DFM, you should explicitly represent only the exceptions. Figure 5-S show's that a measure
connects to the dimensions along which it is non-additive via a dashed line labeled with
the usable aggregation operators. If a measure shows the same type of additivity on all. of
the dimensions, you can mark the aggregation operator on its side If the total amount of
non-additivity were such that it reduced schema readability, you should adopt a matrix
representation. Table 5-3 shows the equivalent additivity matrix for the fact schema of
Figure 5-8,
-
IAB-LE 5 3 Additivity Matrix of the Sales Fact Schema
120 Data Warehouse Design: Modern Principles and Methodologies
Generally speaking, the meaning of missing events depends on the existing relationship
between primary events and transactions in operational database sources. Typitallv. the
information modeled in a fact schema sums up data in operational databases. This means
that each primary event summarizes one or more transactions that actually occurred in an
application domain in a unit of tune . If this is the case, that fact schema is said to be a lossy-
-
grained fact schema. The sales schema is an example of a lossy grained schema , because each
sales event modeled in that schema sums up all the sales of one product on the same day at
the same store. If this is not the case, event representation granularity in fact schemata
coincides with that of operational databases. We would then say that this kind of fact
schema is a lossless-grained schema. An example of a lossless-grained schema is the one that
exploits the date, department, and patient dimensions to model admissions to a
hospital, if we assume that the same patient cannot be admitted twice on the same day into
the same department .
Based on this distinct!'on, we can say the following:
• In a lossless-grained tact schema, an event is missing because it did not occur in the
application domain. For instance, this applies to admissions and shipments
After a careful examination, we can conclude that some relevant pieces of information
may not be represented in lossy-grained schemata. Look at the sales, for example. If a
missing event shows that a product is still unsold, how can you represent that a product Ls
or is not for sale in a store on a certain day? From the conceptual viewpoint, the best thing
to do is to place a new empty coverage schema with produce , dace , and store dimensions
next to the sales schema. Each ' coverage schema event should represent that a specific
product was actually tor sale at a specific store on a certain date. This empty schema
corresponds to the coverage table defined by Kimball \ 1996 ) , To compare it with the sales
schema, see section 5.5 on overlapping procedures that allow you to answer queries such as
which products for sale are still unsold ?
As a matter of fact, you could also use only the sales schema if you created an event
that corresponds to each product for sale on one day in one store and use those events
whose measures are null, to explicitly represent the products still unsold . It is clear that
litis solution leads to a remarkable amount of wasted space if the number of unsold
products is high.
As we mentioned, often you cannot analyze data at the maximum level of detail; and
-
this results in the need for primary events to be aggregated along different abstraction lex els.
According to OLAF terminology', aggregation is called roll-up and group-bv set is the key
concept used to define it
Group- by Set
A jrowp-iiy ict is any subset of dimensional attributes in a fact schema that does not
contain two attributes related by a functional dependency. The group-bv set that
includes only and ail of the dimensions of a fact is said to be its printaru gwup-by set
-
.All of the others are called secondary group by sets and identify potential ways to
aggregate primary events.
The primary group-by set for the sales fact schema of Figure ?-6 is Gr = { produce ,
store , date ). Some examples of secondary group-by sets are shown here^
G. = I product . state , quarter !
G2 = Typer brand , 3 tor* , month , day ]
G3 = ( country , dace I
Gt = ( year )
G5 = U
Note that no attributes appear for one or more hierarchies in some of these group-by
sets. For example, no attributes appear for the product Itierarchy in G „ and no attributes are
-
present for the product and store hierarchies in G .. The Gf group by set is an extreme case.
-
It has no hierarchies, and we can call it the empty group by set . The (product , category,
-
date|attribute set is not a group by set, since p r o d u c e —a t a gory.
-
It is fundamental to note that the set of all possible group by sets in a fact schema are
linked by a partial order relationship called roN - up.
122 Da la Warehouse Design: Modern Principles and Methodologies
-
Roll up Order
-
Given two group by sets called G and & and belonging to the set of ail the possible
-
group by sets in a fact schema , we say that G . G , when C — * G - that is, when a
,
b attribute exists in G for each a attribute of G , such that b belongs to the same
—
hierarchy as a and a * b. —
For example , the follow ing relationships between group-by sets hold:
{year} (type, quarter} (product,quarter} < (product,store,datej
|state} <{type, brand store, month, day} < (product,store,date}
, ,
A maximum dement, corresponding to the primary group-by set of the fact schema, and a
minimum element, corresponding to the empty group-by set, always exist in roll up orders. -
-
Additionally, for each pair of group by sets, both a superior (sup) and an inferior { inf } group-by
-
sets always exist in roll up orders. From the algebraic viewpoint, roll up characterizes a lattice. -
Figure 5-24 shows a small fragment of the roll-up lattice for the sales schema of Figure 5-6. In
-
real world applications, it is easy to imagine that the entire lattice can be huge.
{produc t , quarter}
{type , year}
{ca t eccry, ye = rJ
{}
FIGURE 5 - 24 A part of the roll-up lattice for the sales fact schema
Chapter 5: Conceptual Modeling 123
-
A coordinate of a group by sot C is an n - ple of values of the dimensional attributes in G.
-
We already showed an example of a coordinate ct of the sales primary group by set G ,
Some examples of coordinates of the G 3 and G ? secondary group -by sets are respective]y.
a" = (product : GeanHand ', state: Texas', quarter: U, 2007)
a' - (country: France , date: 5 / 5/ 2008')
rr ' ”
We recall that each dimension value exactly determines one value for each attribute in
its hierarch )- by means of functional dependencies. In this way, each coordinate of the
- -
primary group by set determines one coordinate of each secondary group by set , and vice
-
versa, one coordinate of a secondary group by set corresponds to a set of coordinates of the
primary group-by set.
Each coordinate a of a given secondary group-by set uniquely identifies a secondary
event that aggregates all the primary events identified by the coordinates of the primary
-
group by set corresponding to a For example, the secondary event identified by the ant
-
coordinate of the G;1 group by set aggregates the primary events related to the sales of all
products on May 5, 2008, along the French stores. Trie empty group- by set Gtj has one single
second ary event that aggregates ail the primary events.
Each secondary' event then expresses a value for each measure . In order to define hau? to
calculate the value of each measure for second ary events, we need to specify aggregation
1
semantics We provide a gradual solution for this requirement in the following paragraphs.
We begin with the simplest case and then introduce new situations to outline the global
framework for advanced constructs of the DFM .
-
TAALE 5 7 The year | 2007 2008
{category, year } category
Group-by Set
Secondary Events House cleaning 870 760
Food 1030 975
Chapter 5: Conceptual Modeling 125
primary events to calculate the measure values for each secondary event. However, we
should add that not all aggregation operators show the same useful associative property
.
holding tor sums. From this viewpoint Grey and others (1997) classified aggregation
operators inter th ree groups'
* Distributive Calculating aggregates from partial aggregates
* Algebraic Requiring the usage of additional information in the form of a finite
number of support measures to correctly calculate aggregates from partial aggregates
1
Holistic Calculating aggregates from partial aggregates only via an infinite
number of support measures
Some examples of distributive operators are the previously mentioned SUM operator,, as
well as the MIN and MAX operators; the considerations made in the preceding paragraph
apply to these operators . Some examples of algebraic operators are the average, the standard
deviation, and the barycenter operators. If you add all the required support measures to
every second a ry event, you can still calculate secondary events from other more fine-grained
secondary events.
Let ' s look at the case of the AVG operator. Generally it is dear that the average of partial
averages ol a set oi values is different from the average trf the same set. Tables 5-5, 5-9, and
5-10 show' examples of the unic Price measure of SALE. The AVG operator can be used to
aggregate un.2.1 Price along all the dimensions. It is apparent that there is no way to obtain
year 2009
quarter lP09 iro9 liras iv 09
'“ Jr 4 =
category f“ Vi product
-
TABLE 5 8 Primary Events of me Sales Cube: A Dash Stands for the Unsold hems in a Quarter
year 2009
Quarter l '09 ll ’ Q9 111 ' 09 IV 09
category type
House Cleaner 1.75 2.17 2.40 2.67
cleaning Soap 1.25 1.35 1.75 1, 50
Average: 1.50 1.70 2.08 2.09
TA£U 3 -S The 4 type, quart erf Grouchy Set Secondary Events
126 D a t a W a r e b s i i s t D e s i g n : K o d t r n P r i n c i p l e s a n d M 1 1 h a 4 o L o g i e s.
-
TA&LE 5 10 ~ e - year 2009
{category,
quarter} Group
by Set Secondary
-. category
quariei i 09
'
iros 11109 3 Vr 09
Events
House cleaning 1.50 i - 34 2.14 2 - 30
the proper aggregation by can ego ry and quarter '[Table 5 10 ) from the aggregation by
type and quarter (Table 5-9 ) unless you add a new measure that counts the number of
—
primary events that make up each second ary event. COUNT LS the support measure for
AVG. See section 9. L9 for more details of the impacts on logical design .
Some examples of holistic operators are median and rank. Secondary events must be
calculated from primary events because they cannot use a futile number of support
measures to calculate aggregates from partial aggregates.
Things get more complex when different operators are used to aggregaie a measure
.
along different dimensions. For example Figure 5-23 shows the level measure in the
INVENTORY fact schema. The MJN operator is used to aggregate level along the date
dimension, and the SUM operator is used to aggregate it along the warehouse dimension.
Table 5-11 shows some primary events with reference to a single product . Tables 5 12 and -
-
5 13 show the secondary' events obtained by aggregating along only one dimension if you
instead have to aggregate simultaneously along btff / j dimensions (for example, by month
Defense 10 10 8 4 20 2G 15 15 12
^ ahs ElysSe 5 4 4 4 2 2 2 10 10
ReuLily 14 14 14 12 20 20 20 20 16
Vil Ue 4 2 2 2 10 10 10 S 3
Lytm ^
Ainay 4 20 20 15 15 12 12 10 9
-
TABLI 5 13 The month March 1999
.
[iBoath
c i ty .
wa re r cuse
warehouse ]
Groyp-by Set Defense 4
Secondary Events
Chicago flys£e 2
Reuilly 12
Villette 2
Boston
Ainay 4
and city) , this causes a problem . Will each secondary evert have to be calculated as the
minimum of the sums or as the sum of the minim urns? It Is obvious that the results
obtained in both cases are generally different! in practice, you must choose a priority order
I
for dimensions to decide which aggregation should come first . Note that in the specific
-
example of Table 5 11, the result of the aggregation by cnonth and city would have
seemed to be independent of the application order of operators if we had used the AVG
operator instead of MIN to aggregate along dace. As a matter of fact, this applies only
because no primary events are missing. It this is not the case , results still depend on the
operator priority order.
To conclude this section, we would like to consider the case of derived measures , whose
values can be calculated from other measures in the same schema. For example, the
receipts. = unit Price x quantity measure is derived from the unit Price and
quantity measures. If a derived measure is additive, as in the case of receipt s, you can
calculate the aggregation from partial aggregations ( the annual receipts are the sum of
monthly receipts). However, you cannot calculate the aggregation from aggregations of its
component measures. For Instance, you cannot multiply the amount of items sold in a year
by the average unit price during the same year to calculate annual receipts accurately!
d If your fact schema includes a cross-dimensional attribute, you should use similar
reasoning to check for aggregation semantics. Figure 5-25 shows a simplified sales schema.
Here , the category and country parents jointly define the VAT cross-dimensional
attribute. Each primary event is associated with a product and a store, and, consequently,
a category and a country. For this reason, the secondary events of the group-by sets that
include VAT are uniquely defined because he VAT value for each primary event is clearly
-
defined . Figure 5 26 sums up the substantial difference between convergence and cross-
dimensior.a1 attributes from the viewpoint of roll - up relationships between group-by sets,
128 Data Warehouse Design: Modern Principles and Methodologies
FIGURE 5 25-
Cross-dimensiona!
category Q VAT
attribute- in the
sales fact schema n product
SALE
da r e
O
quantity
o
s t o r e country
FIGURE 5 26
-
*
Roll up lattices
with convergence
3 V”
*s sQ
ar>d cross - F F
dimensions ' b a
attribute
{ *? }
\ ae } { cbj
[i] tcI i }
^
( }
arc then results in a set of fictitious secondary events that group all the primary events
—
excluded from the normal roll- up that is, those primary events for which b equals ‘ No Value'-
For example, look at the diet attribute in the sales schema ( Figure 5-8), linked to
-
produce via an optional arc. Tables 3-14 and 5 15 show an example of aggregation where
the dash means T>So Value ".
Things get: more difficult for a fact schema that includes a multiple arc. Look at Figure 5-15,
w'hich shows the book sales schema in which a multiple arc links book and author, in
particular, relevant queries are those that group sales by book or author. Table 5-16
shows the number measure of some primary events related to a certain date, and Table 5-17
shows the corresponding secondary events grouped by date and author. Note immediately
that totaling all partials of the various authors would lead to an inaccurate result (an estimated
total of 37 books sold against the 3d actually sold } when you calculate secondary events
grouped by date for this reason , it is not correct here to exploit the distributive property’ of
,
the SUM operator, even though the number measure is additive along the book dimension.
C h a p t e r 5: Conceptual Modeling m
T*si 5 -id date 1/ 2 / 09
Primary Events in
.
the Sa es Schema
diet type product
Slurp Milk 30
Cholesterol Dairy product Slim Yogurt 15
Fatty Yogurt 20
Cholesterol Stnct Cake 10
h Sweets
Macrobiotics Gnandi Cake 5
expressed
-
Now let's associate each book author pair with a weight ranging tram 0 to 1 that
the relevance of that pair. Table 5-18 show’s a potential weight distribution .,
assuming that the contribution of different authors to each book they write is equal. If you
total a weighted sum of the sales contributions for every single book to calculate the
7
secondary events grouped by { author , date ), you will obtain the results shown in Table 5-19.
Front those results, you can easily infer that the use of weights restores the distributive
property of the SUM operator if the weights are normalized for every book that is, if
the sum of the weights for each book is equal to 1.
—
Now let ' s examine the case in which the multiple arc enters a dimension, as shown in the
hospital admissions schema of Figure 5-16. Here, more than one diagnosis is associated with
date 1/ 2/ 03
author book
Gor’amlli, Rizzi racts & Crimes 3
Technical
C-olfarelll Sounds Logical 5
RFZZ3 The Right Measure 10
Current affairs
GolfarellS, Rizzi Facts How and Why 4
*
Science Fiction Golfareill The 4th Dimension 8
7ofaf: 30
each admission- In particular the aggregation semantics seems to be rather confusing. More
than one coordinate (each one with a different diagnosis ) corresponds to one primary event
( an admission ) And , vice versa , a coordinate of the j diagnosis , date , department,
patiencSSN} primary* group-by set does not define a primary event, but a fraction of it
( the 'contribution ' a single diagnosis makes to the admission ) To evaluate the aggregation
semantics, we suggest that you first transform the fact schema into an equivalent schema. To
do this, you can stick to two procedures described in the following paragraphs . Section 9.1 .4
shows how both procedures generate just as many logical design solutions.
Figure 5-27 shows the first equivalent schema. Here, we introduce the fictitious
diagnosisGroup dimension with a domain of all the diagnosis combinations linked to
admissions. This transformed schema is largely similar to that of book sales, where the
multiple arc enters a dimensional attribute. Using a multiple arc weight allows the
admission cost to be distributed over the different diagnoses, as required by the application
domain, so that significant aggregates can be obtained for single diagnoses and for single
categories. Table 5-20 shows the cost measure of a set of primary events related to one
department and one date, and Table 5-21 shows tire weights of the multiple arc .
The second schema of Figure 5-2S has been obtained from the original one of Figure 5 16
by using the ADMISSION PER DIAGNOSIS fact rather than the arc entering diagnosis
-
in order to model the multiplicity of the association between admissions and diagnoses.
department
o
f i rstName
lastName
date
a ADMISSION
diagnosis Ogender
O Q O CCSE C 4 ticntS3N
category use rS egTter.t
di ignoG is
^ rcvip
city
blrthVear
FIOURE 5-27 Equivalent fact schema for admissions
Chapter 5: Conceptual Modeling 131
TABLE 5-21 G1 G2
Multiple Arc
Cardiopathy 0.5 0
Weights Between
-
d i a gn 2 s i sGr oup Hypertension 0.3 0
and diagnosis
Asthma 0.2 0,3
I Fracture 0 0, 7
To achieve this result, we reduced the fact granularity . The diagnosis dimension is now
a normal "single" dimension, and each primary event no longer corresponds to an entire
admission ., but to the " portion " of an admission that you can assign to a single diagnosis.
-
We will briefly describe this operation as the diagnosis attribute push dawn. Table 5-22
shows the costFerDiagnosia measure of the primary events that instance this schema .
Note that the admissions costs have been weighted according to the weight? listed in
Table 5 -21 to calculate the partial costs in primary events .
lastlfam#
ADMISSION
Q FIR DIAGNOSIS
dais
category ddstPeriDiagnosifi O ger.de r
J O patientSSK
diagnosis O ueerSegrrisnl
city
JCdrxhVeax
FIGUSE 5-2 S Equivalent feet schema for admissions resulting from the diagnosis: attribute push-down
The COUNT aggregation operator is distributi ve, even if Ihe operator that enables the calculation of a COUNT
'
aggregate from the partial aggregates is not COUNT itself but rather SUM*
132 Data Warehouse Design; Modern Principles and Methodologies
address ares
name ATTENDANCE
O
• tudent
o
nattonality ( COUNTJ course schccl
ace
gender
implicit integer measure that equals 1 if the event occurred or 0 otherwise, and that the
SUM operator aggregates the events. Then, the ( course: Database Design ', gender; F)
coordinate defines the secondary^ event that shows the fcotai number of female students that
took the Database Design course.
An additional approach to aggregation is actually also possible. In this approach, the
-
information each secondary event carries is linked to the existence of the corresponding
primary events. In order to explain this concept, we can assume that you have an implicit
Boolean measure, which is TRUE if an event has occurred or FALSE if the event has not
occurred. Then you may use both the AND and OR operators for aggregation with universal
and existential semantics, respectively. The secondary event defined by thefscudent: 'Will
Smith ', area: ' Databases ', school; ' Engineering' ) coordinate can then mark that Smith took
at least onecourse on databases in the School of Engineering (OR operator ), or alternatively
that he took ail of the courses on databases in the School of Engineering ( AND operator ).
Table 5 24 compares the different aggregation options based on the simple example of
*
Table 5-23.
area Databases
course Database Design Information Systems Data Structures ..
Advanced I S
5 tudar
.r
Will Smith
Peter Johnson
Pau Berry
U
‘
Fran* Armstrong
Mari* Stephens
TABLE 5*23 Primary Events ( in Gray ! Sn trie Course Attendance Schema for Engineering
C h a p t e r 5: C o n c e p t u a l Modeling 133
-
TABLE 5 24 'he { student , area} Group- by Set Secondary Events Using. the COUNi ( Left ) , OR
(Middle), AND ( Right) Ooerators for Aggregation
-
ease of sales, this happens for all group by sets that include date, store, and product
-
coordinates of some pairs of group -by sets is one-to-one rather than many-to one. in the
You may note the effects of these conditions when OLAP operators query that schema . If
events are being analyzed by date, store, and product , the application of roihup and
drill-down operators along the promotion hierarchy will not modify the query result,
because one promotion value will be determined at most and you will not be able to
execute any significant aggregation .
To conclude, you should note that a fact schema where the . . , f dimensions
am
^ - * ^
determine the j 1 . , t ar dimensions is equivalent to a schema where u , .... flm are the on.lv
-
+
an
dimensions and m I , , are included as cross dimensional attributes jointly determined
by the dimensions it is nevertheless more useful to represent A . , , , a as dimensions , if
,
countryO
;iateQ
D . 3. A . -
E K VatiiaE City
countyy year p
Cal i f omia Col or a
s; tyO Tucnti
A.VG *i
^p Cirar» c9 Kontorey = sr.a Norfolk
PGFu
^TIOff Santa vi : A* Korvi ;h \
eiiiffinersd I nfc& b i t ar, c a
r lagf it 1 d S her inghMn
We fsrst discuss the case of incomplete hierarchies and make reference to the geographic
hierarchy example shown in Figure 5-30 If a fact expresses the number of inhabitants per
dty as a result of a census . Table 5-25 shows some examples of specific primary events ,
Because states are not defined for the United Kingdom and for Vatican City, primary
events can be aggregated by stars after balancing hierarchies in three different ways ,
depending on user preferences and decision -making process features. More specifically
data about the cities that cannot be grouped to states can be
1. completely summed up in a single value labeled "Other ', as shown in Table 5 26 -
( bal&ncmg by exclusion );
1. shown at the coarsest aggregation level among those finer than stare, for which
a value is defined aunty in the example of Table 5-27 {d&wntvard bohmcing );
3. shown at the finest aggregation level, among those coarser than count;y, for which
—
,
We deal with logical design of different types of balancing in section 9.1.6. Here, we
would like to draw your attention to two of their features:
* Balancing by exclusion fails to meet the classic roll-up semantics, which requires
that each progressive step in aggregation causes more than one group to collapse
into a single group. When you roll up from s t a t e to country in the geographic
hierarchy. the group labeled w ith 'Other ' is broken down into two groups: UK and
,
other hierarchies* you are not supposed to use a fixed number of distinguishable
aggregation levels to characterize diem . Look at the example of Figure 5- 22. The example
,
fact represents the activities carried out around projects Table >29 shows some primary
z
employee
Schema Mark 2
-
Andrea Laura 5
Paul John 4
Laurence 3
George m Tom 2
Anna 4
Andrea 1
George 2
Paul i
events related to a specific project and’ a specific activity type. As you can see, each I
employee works a specific number of hours on a specific day . Additionally, reporting
relationships of varying length exist between employees, as the recursive hierarchy models
When there is no explicit subdivision in levels of aggregation with specific semantics,
the only way to aggregate primary events to secondary' events is recursively. Table 5 30-
shows secondary events at the first level of recursion . The total number of hours worked
links to each employee and to possible other employees that report directly to each one oi
them ( the "children" in the hierarchy ). For example, we allocate 6 hours to Paul, calculated
— - —
by totaling I hour he worked , 1 hour Andrea worked , and 4 hours John worked . At the
second level of recursion Table 5 31 shows its secondary' events we calculate instead the
,
total hours per employee and those of their ''grandchildren ." The hours worked and
assigned to Paul then become 13 The result at subsequent levels of recursion remains
unchanged because the hierarchies m the example have a maximum length of 3.
See section 9.1. 7 for an explanation on logical design solutions tor recursive hierarchies.
TABLE 5 30-
Secondary Events
dace 5/5/ 20081
employe e
tn the Activity
Schema at the Mark 2
First Level of
Recursion Laura 5
John 4
Laurence 3
Tom 2
Anna 4
Andrea 8
George 7
Paul 6
C h a p t e r 5: Conceptual Modeling 137
Laurence 3
Tom 2
Anna 4
Andrea 8
George 7
Paul 13
5 . 4 Time
Tune is commonly understood as a key factor in data warehousing systems, since the
decision process often relies on computing historical trends and on comparing snapshots
of the enterprise taken at different moments. In the following discussion , we collect some
notes regarding temporal dimensions in fact schemata and their semantics.
O ha. s i r .
0 ri '
er
it s z ance ? rowMouch
vaterGauae
r1 :A
--
r
jzr uoiszr .
level
I
* A sratfpshtfl fact schema is one whose events correspond to periodica] snapshots of the
—
fact Its measures are mostly stock measures ( see section 5.2. 9 ) that is, they refer to
an instant in time and are evaluated at that instant, so they are non-additive along
temporal dimensions ( that is, their values cannot be summed when aggregating
along time, while, for instance, they can be averaged ).
For a flow fact , a transactional fact schema is typically the most natural choice. For the
sales fact , for instance, a subset of events of the transactional fact schema in Figure > 31
-
might be those reported in Table 5 32r each representing the total quantity of a product sold
in a store during a week.
On the other hand, for some flow facts, both transactional and snapshot schemata can
be reasonably used . This is true, for instance, for the stock inventory fact, tor which a
transactional fact schema and a snapshot schema are depicted in Figure 5-32. The two
schemata arc clearly identical except for the meaning of their measures. A sample set of
-
TABIE 5 32 store week product quantity
Events fo the
' SALS Srr. artMart 10/i/ 2008 COR 100
Transactional
FACT Schema SmartMart 10/ 1/ 2008 0VD*R 20 I
SmartMart 10/ 1/ 2008 CD-RW SO
SmartMart 10/ 8/ 2008 DVD+R 25
C h a p t e r 5: Conceptual Modeling . 139
-
FIGURE 5 32
Transactional ( top !
category Q
and snapshot
( bottom ) fa t
^ product 0
schemata for the
slock inventory
fact
—— UrVEKTOftY cley
O a c
o o- warehouse country
yftdr wsck flow
category Q
produce 0
IMVEKTORt
c
year
o—
week
wa rehouse country
stock
inventory events for the transactional solution is shown in Table 5-33; each event records
the net flow of items of one product over a week within one warehouse . Table 5-34 shows
how the same events could be represented within a snapshot solution; here each event
he records, on a specific date, the total number of items of a product available in a warehouse.
to The choice of one solution or another for a flow fact depends first of all on the expected
* workload , and in particular on the relative weight of queries asking for flow and stock
information, respectively. For instance , the current inventory level for LCD TVs in Austin,
Texas, can be obtained in the transactional solution by summing up all pertinent events ,
which may be costly, while in the snapshot solution at is sufficient to read a single event ( the
12
most recent one ). On the other hand , consider a query' asking for the net flow of LCD TVs rn
Id
TASCE 5 33 week product; warehouse f low
\ Events for the 10/1/ 2006 LCD TV Austin *20
Transactional
INVENTORY 10 /8/ 2008 LCD TV Austin -o
Fact Schema
10 /15/ 2008 LCD TV Austin T 3
-
TABLE 5 34 week product warehouse stock
Events for the I 10 1 2008 LCD TV Austin 20
Snapshot / /
INVENTORY Fact 10/8/ 2008 LCD TV Austin 15
Schema
10/15/ 2008 LCD TV Austin 18
140 Data Warehouse Design: Modern Principles and Methodologies
-
TABUS 5 35 hour
_-
wac er auge level
Events for the .
9: 00 a . m . , Jsn .7 2008 Westminster Bridge 8,0
Snaps no: now
Fact Schema -
10:00 a . m .. Jan 7 , 2008 Westminster Bridge S.4
11:00 a . m ., Jan .7 * 2008 Westminster Bridge 7.9
Austin during a specific week. While in the transactional solution this query is answered by
reading one event iq the snapshot solution the result must be computed as the difference
between the values of quantify registered in two consecutive weeks.
Differently from flow facts, stock facts naturally conform to the snapshot solution; for
instance, a sample set ot events for the liver flow fact in Figure >31 is reported in Table 5-35:
like in the transactional solution applied to a flow tact , each event records the exact value of
the measurement - In principle, adopting a transactional solution for a stock fact is still
possible, although not recommended . In fact , it would require disaggregating the (slock)
measurements made in the application domain into a net flow to be registered , which
implies that, before each new event can be registered , the current stock level must be
computed by aggregating all previous events.
in conclusion, a transactional faa schema is the best solution f in the application domain ,
events are measured as m- and out - flows Lt should not be adopted when events are measured in
the form of stock levels. A snapshot fact schema is the best solution if . in she application domainT
events are measured as stock levels , but it can also be adopted when events are measured as
flows . In general , the best choice also depends on the core workload expected for the fact-
,
The meaning commonlv given to the time dimension in fact schemata is the so-called valid
time (Tansel et al . , 19931, which stands for the moment when an event occurs in the business
world. Transaction fnw is the time at which a database stores an event; i t is not typically
given importance in data marts because it is not considered as important for decision
support. However, this is not always true, as you will see in the following sections.
One of the underlying assumptions in data marts is that, -once an went has been
registered, it is never modified so that the only possible writing operation consists in
,
appending new events as they occur. While this is acceptable for a wide variety of domains,
some applications call for a different behavior. In particular, the values of one or more
measures for a specific event may change over a period oi time, longer than the refresh
interval, to be finally conscslidated only after the event has been registered tor the first time
in the data mart . This typically happens when the early measurements made for events may
be subject to errors or when events inherently evolve over time
As an example for this discussion, consider an educational data mart in which a tact
models the enrollment of students to university courses. Figure 5-33 shows a possible fact
schema. Each primary event records tine number of students in a specific city that registered
for a specific course tor a specific academic year on a specific date An enrollment is
completed and sent to the education secretary and then recorded as an event in the data
,
mart only when the enrollment lee is paid. If you think of the delays tied to bank payment
management and transmissions , ii is not at all uncommon that the secretary receives
payment notification more than a month after the actual enrollment date.
In this context, if you still wish to draw decision-makers' attention to the actual scenario
of the current phenomenon at the right time , you should update past events with each
Chapter 5: Conceptual Modeling 141
FACULTYQ
oegreeCourse Q r
.
.- 0 " ^ ^
ENHOLLKSTTT
c y
O-
ccj .nrry
of
<p
2?
***
e* ^
o
'
* ^
“
e
nuiribe rOf St uae r, " s • . i
•i
stats >4
C acaderriioYesr
FIGURE 5 -33 Fact schema for stumerr enrollments ‘. r"
!
population cycle to reflect new data entered. We will call this type of operation late update ,
*-
as it may imply registration of more. than one measure for the same event. See section 10.4
for the implications of the popuiation procedure. i
The usage of valid time only is no longer enough to attain full query expressivity and
flexibility when you have to deal with late updates:
- " , ^ * if a - "
L It is the decision -maker 's responsibility to justify his or her decisions. This requires
the ability to trace precise information available at the time when decisions are -
made . If old events are replaced by their new versions , past decisions can no longer
be justified . " ‘
In some scenarios, accessing only information from current versions is not enough ,
5 to ensure the accuracy of analyses. A typical case comes from those queries that
compare the advancement status of a phenomenon in progress and past statuses ot
that phenomenon Because the data recorded for the phenomenon in progress are
not yet consolidated, their comparison with past, already consolidated data is
.
conceptually inaccurate.
i
-
Mow look at the enrollments example. Figure 5 34 shows a possible flow of data the
secretary received over an interval of six days for a specific combination of - cities, academic
years , and courses. We have distinguished two temporal Coordinates: the registration date,
1 1
course and
-5.4* .
i academic year . 3 2 1 V *
r
nj
6 * ‘
U
±J
6 5 1 2 !H
Li
14
til
H
5 5 2 2 1 . 15 4'
V 4 6 5 5 I 2 IT' 23
•u V
141 V
enrol Imer.cOaoe
D
20 19 9 9 2 2
142 Data Warehouse Design; Modern Principles and Methodologies
the date when the secretary receives notification of payment, and the enroll merit date , the
date to which a payment refer!? . Each matrix cel] records the number of enrollments received
on a specific date related to the date tn which they were actually compicted -
This example helps us to understand that two different temporal coordinates can be
distinguished for ail of the tacts potentially affected by late updates. The first temporal
coordinate refers to the time when each event actually took place and coincides with valid
time as we previously mentioned The second temporal coordinate refers instead to the time
when a measurement for an event was received and recorded to a data mart; this coordinate
coincides with the transaction time. As in the case of operational databases, you do not
necessarily have to maintain both coordinates, because this mainly depends on the
workload type. In particular, we can distinguish three types ot queries:
* --
Up to date queries Only require currently valid measurements for the events. For
.
example, look at the ENROLLMENT fact . We can ask In which months do students
tend by preference to enroll in a certain course? To answer this query’ accurately, you
must use the most updated data available on the number of enrollments per date of
enrollment . This means the data shown along the enrol 1mentDate axis at the
bottom of Figure 5*34 You do not have to record any transaction time to answer
--
up to date queries because they just use the valid time.
* Rollback queries Require, tor each event, the measurement that was valid at a
tune t . For example, you may find it interesting to examine the current trend of total
number of enrollments, per faculty, compared with that of the previous year. If you
answer this query on the basis of enrollment date when you compare the current
data (still partial ) with past data (already consolidated ), you could erroneously infer
that enrollments are declining this year Instead , if the average delay of payment
receipt shows no change from previous years, you can base an accurate comparison
on the population date ( the regas era tionDate axis on the right in Figure 5-34 ) , It
is clear that the valid time alone is no longer enough to support this type of query
and it becomes essential to model transaction time as well.
* Historical queries Require more than one measurement be made at different times
for each event An example of a historical query ic a query that defines the daily
distribution of the number of enrollments received on a certain enrollment date. As
in the previous case , you should also explicitly represent the transaction time here.
workload , two main types of conceptual design solutions can be envisaged for a fact ;
rrtcrtctcmpcrffh where only valid time is modeled as a dimension, and hitemporal , where both
valid and transaction time are modeled as dimensions.
Monotemporal solutions are commonly implemented for facts that either an? not subject
to late updates or are only required to support up-to-date queries. They are the simplest
solutions: updates are done by physically overwriting the measurements taken at previous
times for the same event, so that one single measurement ( the most recent one ) is kept in the
database for each event The transaction times of measurements are not represented and no
trace is left of past measurements, so only up-to-date queries are supported , and, in case of
late updates, accountability i> not guaranteed. For instance, the schema of the monotemporal
C h a p t e r 5: Conceptual Modeling 143
-
FIOEJRE 5 35 tacuity Q
Bitemporal fact
schema for
enrollments . degr *sOc rsft Q
_
city 3curst ry
O O
state
solution tor the enrollment facts is exactly the one already shown in Figure
, where the
only temporal dimension is enrollment Date ( valid rime).
Conversely, bitemporal solutions are the most comprehensive solutions that can be
adopted when late updates might occur, and they allow all three types of queries to be
accurately answered For each population cycle, new update measurements for previous
events may be added, and their transaction time is traced ; no overwriting of previous
.
measurements is carried out, so nothing is lost A bitemporal fact schema for enrollments
-
is shown in Figure 5 35; the shared temporal hierarchy enables the modeling of valid
time (enrol1merit Dae e} and transaction rime ( in the form of a currency interval
[currency3 t a r t , currency End ])* A sample set of events is depicted in Table 5 36.
The interested reader is referred to Go!fa rail i and R.izzi (2007) for a deeper analysis of
-
the issues arising in the presence of late updates and a more detailed description of the
possible conceptual design solutions for transactional and snapshot fact schemata ,
5A 3 Dynamic Hierarchies
Up to this point, we have hypothesized that the only dynamic component described in a
fact schema may be the fact itself and its events We have attributed an exclusively static
nature to hierarchies This is evidently not completely true Slowly, sales managers rotate
among various departments . Each month, new products are added to those already for sale,
product categories change , and their attribution to products changes. Sales districts may be
modified or a store may be moved from one district to another We should darify that only
—
the dynamic properties at the extension*! level that is. the dynamic properties of hierarchy
—
instances will be considered in this paragraph. We will not be concerned about possible
5- 36 Events for tne ESROLlK rrs Schema with Late Updates in Italics
^
144 DaU Warehouse Design : Modern Principles and Methodologies
modifications that alter the structure of hierarchies, such as the addition of a new attribute
to a hierarchy or the addition of a new dimension* which are not considered routine
phenomena , hut rather extraordinary events connected to data mar! maintenance.
On the conceptual level, the representation of dynamic properties in hierarchies i$
strictly bound to their impact on queries. When you use a dynamic hierarchy, you can
actually distinguish tour different temporal scenarios in the event analysis, as proposed first
by SAP Business Warehouse. To discuss those scenarios, we will make reference to the
following example. Assume that on 1 / 1 / 08, the sales manager for the EverMore store
changed from Smith to Johnson, and that a new store, EverMore2. opened with Smith as
manager. Consider the possibilities:
-
Tod a y -fdr yes te rday Al1 the e vents areanalysed according to th e hierarchies ‘
current configuration In the example, we attribute EverMore store sales before 2Q0S
as well as the more recent ones to Johnson , while we attribute EverMoreZ store sales
to 5mith. This approach proves interesting if Johnson also becomes responsible for
past sales.
* Yesterday-for-today All of the events are analyzed according to the configuration
the hierarchies had at a previous time. In the example, we attribute all EverMore
store sales to Smith and we do not consider EveiMoreZ store sales.
* --
Today or yesterday Each event is analyzed according to the configuration ihc
hierarchies had al the rime when the event occurred- For this reason, we attribute
the EverMore store sales prior to 2008 and ail of the EverMoreZ store sales to Smith,
and we attribute EverMore store sales from 20® onward to Johnson .
-
Today-and yester day Onlv the events referring to the hierarchy instances that
remain unchanged are considered . Sales in neither of the two stores are then
,
considered.
—
the association between store and saleswanager because instances of the
—
In the example, the dynamic properties involve an arc in a hierarchy the arc expressing
corresponding association vary. However, dynamic attributes are very frequent as well. For
example, the name of a product categoiy or store may change over time . Let's assume then
that the Ev erMore Store changes its name to EvenMore on I / 1 / 09. We may then describe
the four scenarios as follows;
- -
Today for yestenday We attribute all of the store's sales (even those prior to 2009 >
TO EvenMore .
*
to EverMore,
-
Yesterday-for today We attribute all store sales {even those from 2009 forward )
* Today -or-yesterday We attribute the sales prior to 2009 to EverMore and those
from 2009 forward to EvenMore .
* -
Today and - yesterday The store's sales are nut considered .
From the viewpoint of conceptual modeling, it is important to note that you should not
also consider as dynamic the creation of a new value in the attribute domain (for example,
Chapter 5; Conceptual Modeling 145
Store - x >:
stor* - silesEistrict x
type - TI .= rket ingGroup x x x
OEtegory - department x
a new product for sale or a newly opened store ) or the consequent creation of a new
instance for all of the associations regarding it ( the new product must be associated with a
type and brand, the new store to a city and to a sales district ) . All logical design solutions
'
Obviously, two schemata are comparable when they share at least one dimensional
attribute Some examples of comparable schemata are SALE ( Figure 5-8 ), Sr ~ FKENT
.
( FIGURE 5-10 ) , and INVENTORY ( Figure 5-23) They have the majority of dimensional
146 Data Warehouse Design: Modern Principles and Methodologies
Reducing Hierarchies
Given a hierarchy h and a subset JR of its dimensional attributes, the reduction of h to R is
a set of hierarchies that include all and only the attributes in R linked to each other via
functional dependencies transitively derived from those in Ji.
Figure 5-36 shows an example of reduction . On the left is a hierarchy with the a
dimension. On the right are two hierarchies obtained after reducing to the c e , g, i , and 1
#
-
FiGUftE 5 36
Reducing a
hierarchy to
a subset of
attributes
b c
d t S h * 9
O
i I m l 1
Chapter 5; Conceptual Modeling 147
-
dependency in the result from c-*h and h 4 l in the initial hierarchy. —
attributes, respectively . For example,, note that you can transitively derive the c *1 functional
Compatibility between two fact schemata implies that the functional dependencies they
contain are all mutually consistent. It is easy to verify that the SALE. INVENTORY, and
SHIPMENT schemata are compatible. For example, two fact schemata are incompatible if
one of them contains the dependency and the other has the a —dependency. As a
matter of fact, both functional dependencies coulc theoretically coexist . This implies that a
--
one to one association exists between tf , and a .. In our context though, the fact that the same
attributes are in separate hierarchies in inverse order probably means that a design error
occurred in one of both schemata (Gbifarellli et al., 1998).
Figure 5-37 shows how to overlap INVENTORY and SHIPMENT. You can use the
resulting over lapped schema to compare quantities shipped and those warehoused for each
schemata.
*
-
product. Sec section 7.1.2 for a detailed discussion on drill across queries on overlapped
We repeat that hierarchies are widely shared between more than one fact schemata in
the majority of cases. And in many cases, they will even be identical (conformed
—
hierarchies for example the product hierarchy). In other cases, a hierarchy dimension
will coincide with a dimensional attribute from another hierarchy ( for example, a hierarchy
rooted in the customer dimension in the complaints fact schema will probably coincide
with the subhierarchy rooted in customer from the order hierarchy in the shipments
schema ). Figure 5-38 shows an example of conformed hierarchies.
FJOUHE 5- 37 department
Overlapping the
INVENTORY and category
SHIPMENT
weight type
schemata
packaging brand
product
SHIPMENT
year month date INVENTORY
C O Op- shippedQuar.t I ty
AVC-\ shipmentCost warehouse
MIN i n vent tryLevs1
x ncomingOuantity \ address
148 Data Warehouse Design; Modern Principles and Methodologies
-
: rr.r n £
ilgpA .
depart rcer t
? agjcaa:7ia
tyjw
category
p aC cs:i ng
-ypa
Ht t
^ary I
weigh'
e in&ic
desi:lp £ :cxi
-
brac jd;
JgjCTipbign
-
brr 3
T C diet dLiet
product prodaci.
"O itemaFerPalleL addresi iie ^ slerPallet
E.aT.e
ESTOEHTORY CQHPLAIMT
w * rer.::ae
i + naa* o
siatfi
o--
city
a ddre = 0
company
dlroc o
^-
I I
iepari‘t/Jfa d. d #ps nmem Kea -3
departmast dcpartnant
:at.egtry category
packaging type - pack jibing
_ type
—
weighs. ^SiGh‘
de script
product
-O diet.
j
brand
deasTlptim
:
b:ou.ct
brand
Ddie "
_
lifflawFc r Pal le:
SffIPKEHT H SUK
—— ——
orderTear crderEace
6ofoerMoniii
O Q
yorder
war shpy.s v
j
^iir c
.
0
stare city
Q —. CPatamer
~-
address
dire -ccor I
company address I
5.6 . 1 Metamodel
-
Figure 5 39 shows the UML metamodd of the DFM. We sugges? you read one of the many
texts on this subject (for example. Allow and Neustadt, 2005} for an explanation of the
i
FiGuae 5- 39 1. * Measure
FsctSenema
UML melamcde 1
na»e
Of the DFM -
f anue
A
-
t ’ xe
U
i auditiveAicng
Adiitivity
cteraict
1 . .*
Hierarchy
u) Arc
1 .. *
1
Attribute iron
0 ..1
n*^ n
doetai r. to
1
Hierarchy
A hierarchy h is a paJr (A , < v ) where
brandCity
category ( j
.
rr. i : .nqUrcup
department O
ISO Data Warehouse Design; Modern Principles and Methodologies
Fact Schema
We define a fact schema F as a pair ( H , M) in which
* H is a finite set of hierarchies
* A ? is a finite [ possibly empty ) set of measures
*
Group-by Set
Let F be a fact schema. Wo define a group-by set of F as any subset of attributes, G c AtttF ),
_
such that for each pair of attributes and a contained in G from the same hierarchy hf
f
The [i j , sj group-by set including all of the dimensions of F b called the primary
set of r and marked with Gte(F), All Of the others are called secondary group-by sets
gmjjp-fty
and identify possible ways to aggregate events, as we mentioned in section 5.5.
Roll-up Order
Given fact schema F , we define a partial rtf // - up order on the set of all possible gjoup
by sets of F as follows: C 1 < G if and only if G.-+G1.
-
r
> J
According to the roll -up order definition ,, it is G . G when, for each attribute a from
hierarchy k contained in G ^ either G includes an attribute at such that at <h or no attribute
from ft is part of G ,.
FIGURE 5 41
*
y «4r O
C h a p t e r 5: Conceptual Modeling 151
Hierarchy Instance
Given a hierarchy h = ( A , < . ), an rnstance o f h LS a pair fD„ L ) where
* D Is the set of domains ,) of the attributes in A; each domain is the
-
You should actually define a roll up function only for the immediately adjacent pairs of
attributes in the partial order. You may transitively compose the function for the other pairs.
Given the a.! , a.* and a.? attributes such that J <n (f. <n a., it would actually be
t
j
< () = uK fepSo
- )
If two or more separate paths exist in the <h order that join two attributes a and we
assume that the upj roll-up functions obtained from composing roll up functions along the
different paths are identical to enable an accurate aggregation. In the stores hierarchy in
-
Figure 5-41. country and store are connected by two paths that include seAesDistrict
on. the one side and city and state on the other side. For this reason, it should result
in this ;
.
r c:?iirit ry
c[ r l c
"
• aiLaartfiLf ic .M ~ “ 7?MUSiry ("p eicy
State ("pSTil
Coordinate
Given a fact schema F = ( H, M ), an instance of each hierarchy in H and a group by set
G = |LTL . ;z j, a coordinate of G is a function that maps each C attribute into a value of
-
L3ci»r(aJ. We will mark with Ppm(G) the set of possible coordinates of G : ^
Dum(G) = Dom [ n } x .. x * )
Now you can. define an instance of a fact as a function that associates the coordinates of
i -
the primary group by set with primary events.
152 Data Warehouse Design: Modern Principles end Methodologies
^
Each specific n-ple or measure values determined is called a priruny r tfnf . -
A primary event is a basic ceil of information defined by a coordinate of the primary
-
group by set. A value for each measure in the fact schema is associated with each primary
... .
event. Let 0. » a be the dimensions of fact schema F and c be a cube instance of F . We will
mark the primary event corresponding to the a icL : a .. . aj =
DornfaJ x x Dom( ) an ... ar
coordinate with r ( «) .
In the sales example, a cube may be then defined as follows:
c ( product: Shiny ', store:'EverMore ', date:’4 / 5 / 07 ) - (quantity:10, receiptsi25)
1
c (product ; 'Shiny ', st cre:_'EverMore‘, dace:’4/ 7 / 07'} - ( quant ity:20 receipts ;50 ) #
-
Given two gimtp by sets C, and G such that G G , the functional dependency between
(
G - and Gf and the roll - up functions in F allow each coordinate et of G to be mapped into
exactly one coordinate a of G;. Each coordinate of G corresponds to a set of coordinates of
-
(
Gr You may then extend the roll-up function definition to map from one group by set
domain to another
For example,
fcjPKtfftLyl
i I:a (Hor,'.iii.e .
*
( ' Drink , Los Angeles ') = ( Food , California
' '
' '
)
( We have omitted the attribute names that define the coordinates to simplify notation, as the
subscript and superscript group-by sets of the roll-up function already express them . )
ll G includes more attributes from the same hierarchy the constraint on unique
-
compositions of roll up functions among attributes ensures a clear definition of roll up
functions among coordinates; here's an example:
-
Jtfxct, eiiy )
" r 1 c^ untTy! ( \TorthIllinois', 'Chicago' } = ( ‘USA ’)
product, - .’-- *, cr
Here we use the up (I type year i ,
I M
—--r-*rl
-
roll up function to map all the coordinates of the
( product , state, quarter | group by set and related to soap-type products, to the tour
quarters of 2000 and to all the stores into the ( type:'Soap \ year:‘2007 ) coordinate .
Chapter 5: Conceptual Modeling 153
As a last example, think of the specific case of a roll-up function that maps into the
-
empty group-by set. The coordinates of the empty group by set do not exist strictly because
no attributes are displayed in the empty group-by set . Conventionally you may assume that
one fictitious coordinate exists, to which all of the coordinates of any group-bv set G
correspond via the xpf , roll- up function,
-
Now let there be a secondary group by set G GMT). The roll- up function that
associates a set of coordinates of Gbs( F ) with each coordinate of G enables the aggregation of
-
primary' events of cube c by group by set G . You then obtain an aggregation c * where the
measure values for each coordinate sum up those in the coordinates of c corresponding via
-
the roll up function.
!
CHAPTER
Conceptual Design
c hapter 5 described a conceptual, model for data marts, which should be used to
create a set of fact schemata . In this chapter, we discuss how to bund a conceptual
schema for a data mart in order to meet user requirements and be consistent with
operational source schemata .
Chapter 2 already pointed out that three fundamental methodological approaches can
-
be taken to data mart design : the data-driven approach, the requirement driven approach,
and the mixed approach . Their differences are based on the relevance given to the source
-
database analysis and the end user requirement analysis phases. Opting for one of those
approaches dramatically affects the way conceptual design will be carried out
,
The data-driven rapproach defines the conceptual schema for a data mart in relation to
*
—
the structure of an operational data source typically, your reconciled database.
This helps you skip the complex task of linking them at a later stage. Moreover, von
can almost automatically derive a preliminary' conceptual schema for vour data
mart from your data source schema. When you analyze user requirements, this
gives you the immediate opportunity to have a precise idea of the maximum
*
performance potential of your data mart , and then to discuss specifications more
realistically.
In a rtijuirfm&ii -drivert approach, there is more room for a user requirement analysis
because no detailed information on data sources is available or sources are too
complex to be investigated. The resulting specifications become the foundations for
a conceptual schema of a data mart. Virtually, this means that designers have to be
able to manipulate their interviews with users to extras ( i ) very precise instructions
about facts to be represented ; (ii) measures defining those facts; and ( iii ) hierarchies
for those facts to be usefully aggregated . Designers will deal with the resulting
conceptual schemata linked to operational data sources only at a later stage .
* In a mixed approach, user requirements play an active role in imposing iimjis on
complexity of data source analysis. The m-depth requirements analysis is formally
-
performed as in the requirement based approach The results of requirement
analysis guide the algorithm that obtains a chart structure ot fact schemata from
data source schemata as in the data -driven approach.
155
156 Data Warehouse Design ; Modern Principles and Methodologies
In this chapter, we will examine conceptual design with reference to these three
approaches.
-
In the data driven approach, it is extremely interesting to study how you can derive
your conceptual schema from those schemata that define a relational database Tills is
because the majority' of the operational databases created over the last decade are based on
the relational model - Our methodology’ may be applied to conceptual Entity-Relationship
schemata ( see section 6.1 ) or relational logical schemata (see section 6.2) along with minor
changes. Naturally, conceptual Entity-Relationship schemata are more expressive than
relational logical schemata . For this reason, they are generally considered as a better design
-
resource. However, companies often provide Entity' Relationship schemata that axe
incomplete and inaccurate. Very' often, the only accessible documentation consists of logical
schemata of databases if no careful investigation is carried out .
Companies are increasingly considering the Internet as an integral part of their business
and communications plans. Moreover, a targe part of web data is stored in extensible
Markup Language (XML ) format. For this reason , it is important that you can integrate
XML data into your data warehousing systems. Section 6.3 shows how a data mart
conceptual schema can be derived from schemata defining XML sources. The material
presented is discussed in Gollarelli and others ( 2001 ).
NOTE See Atzmi et a /., 1999 for a more detailed discussion on the Entity-Relationship model
( ERMJ . See Cabibbo and Tcrione, 1998; Husemann et alr 2000; and Song et al . 2007 for other
-
f
approaches to conceptual design front tlx Entity Relationship source schemata. Jones and Song 4
( 2005 / discuss an interesting approach to conceptual design based on the application of relevant
design patterns. Other work relevant to XML sources and semantic web contexts is presented in
Vrdoljak et al ,, 2003; Jensen et al . 2001 and Romero and Abelio. 2007.
r r
Section 6.4 gives a detailed description of the mixed approach to conceptual design that
is only partially' different from the data -driven approach. Finally, section 6.5 provides more
-
details on the mam instructions for the requirement driven approach to conceptual design.
First, select relevant facts from vour source schema (step 1) , Then, create your attribute im'
-
in semi automatic mode (step 2a ), This is a transitional structure that is useful for delimiting
the area relevant to your fact schema in order to eliminate all the irrelevant attributes and
modify dependencies that link these (step 2.b ), and to define measures and dimensions
(steps 2.c and 2d ). The attribute tree is also very important because it links vour data mart
and source schema. This Link acts as a key for the data-staging process. Then it is relatively
simple to translate your attribute tree into a fact schema (step Z.e). Note that step 2a is based
on the application of an algorithm; steps 2c, 2 d, and 2.e are based on the objective properties
of attributes; and steps 1 and 2.h require an inniepth knowledge of the corporate business
model. For this reason, designers must pay much greater attention to the last two steps,
-
Even step in this method will be described with reference to the sales example. Figure 6 1 -
shows a simplified Entity-Relationship schema for this example. Each instance of the SALE
relationship represents the sale of a specific product listed on a specific sale receipt. The
unitpr ice attribute relates to SALE rather than PRODUCT because product prices can vary
ova: time. The identifier of SALES DISTRICT is the pair consisting of the dialriciNuir.
attribute and the COUNTRY entity identifier Note that there is a redundant cycle involving the
*
STORE, CITY, STATE, COUNTRY and SALES DISTRICT entities. However, the cycle
involving PRODUCT and CITY through STORE and BRAND is not redundant because the city
where a product of a particular brand is produced is usually different from the rides where
those products are sold.
U.ni . : i, n) I 11 . n
-ype category '
Ml , l » II . L ) .n (1, 1)
f T [l )
-:
address OF B RAND
11 , 11 $>
<
brand
FIGURE 6-1 Conceptual Entry-Relationship schema for the sales operational database
158 Bata Warehouse Design; Modern Principles and Methodologies
simplicity, you can transform R into an entity ( reification process ). To do this, add a new
entity - called F and replace each branch of R with a binary relationship ( R . ) between. F and Er
If you mark the minimum and maximum cardinality level , at which an entity’ E participates
in a relationship A , with min ( E , A ) and max[ Er A ) ,1 respectively, you get this:
TtP Those entities that represent frequently updated archives, such as SALE, are good candidates for
fact definition , out those entities flint represent structural domain properties corresponding to
almost static archives , such as STORE and CITY , arc not .
As a matter of fact, this common sense rule must not be applied too literally. This is
because die borderline between what should be a fact and what should not depends largely
on the application domain and the type of analysis users wish to carry out Consider the
example in which shop assistants in a supermarket are assigned to different departments.
If each assistant is assigned only to one department at any time, is it correct to classify this
relationship as a fact ? If users want to pay attention to the sales made by each assistant both
assistants and departments will be modeled as hierarchy attributes. A functional dependency
will model shop assistant assignments to a specific supermarket department. On the contrary,
FWUK 6 2*
Reification of the R
rsl3t>on 5h : n
<3>
.
t' l l ) < 1.1 J
R,
.
* 1 F El
-
Fl6Lft £ 6 3 qu*nr i z y .
iir i r Pr i ce
Re:~ cation o+ tne
'
.
FF.OD'J cr SALE: N SALE RECEIPT
if users ana more interested in studying assistants transferred from a supermarket department
to another one, then you should introduce a new fact to emphasise dynamic properties of the
assignments to specific departments , Note that the choice between one or the othet design
solution does not necessarily depend or. how often assistant assignments change.
Each fact identified in a source schema becomes the root of a different fact schema. In
the following sections, we will focus on one single fact that corresponds to an entity
possibly deriving from a reification process. The most relevant fact for the analysis is the
sales of a product in the sales example. At a conceptual level , the sales fact is represented by
the SALE relationship, reified into the SALE entity, as shown in Figure 6-3.
It is important to note that different entities can sometimes be candidates for expressing
an individual fact If this Is the case, we suggest you should always choose as a tact the
entity to build an attribute tree that includes as many attributes as possible. Section 6 2, 2
will give an example to clarify this concept
Attribute Trees
Given a relevant part of an Entity-Relationship source schema and one of its entities
F classified as a fact , an attribute tree is the tree that meets the following requirements:
* Each node correspond* to a source schema attribute ( simple or composite ),
* The root corresponds to the identifier of the F entity.
* For each node vr tire corresponding - attribute functionally determines all the
attributes corresponding to the descendants of r.
You can automatically build the attribute tree corresponding to F if you run an
algorithm that recursively navigates functional dependencies expressed by identifiers and
to-one relationships in your source schema . Each time you examine an entity Er you should
create a new node v corresponding to the E identifier, and then you should add a child
node to v for every attribute of E. These also include all the individual attributes that
make up the E identifier, if it is composite. Every time you find a relationship R with a
maximum cardinality of 1 from the E entity to another entity G, you should add a child for
the G identifier and all the children for the attributes of R to the r node. Then you should
repeat this procedure for the G entity, Tne entity from which this process is triggered is the
one you chose as the fact . F
160 Data Warehouse Design: Modern Principles and Methodologies
Figure 6-4 shows the basic features of the algorithm that builds the attribute tree in
pseudo-code. Figure 6-5 shows an example of the control flow for the translate
procedure and it gives a step-by-step explanation on how to build an attribute tree branch
for the sales example. Figure 6-6 shows the resulting tree.
-
FHSL'SIE 6 4 Pseudocode for the attribute tree building algorithm
*
r o o t = newNode ( i d e n t ( F ) ) ; / / i d e n t ( F ) i s the i d e n t i f i e r of F
/ / t h e t r e e r o o t i s l a b e l e d with t h e i d e n t i f i e r
/ / of t h e e n t i t y s e t as a f a c t
translate ( ?, root ) ;
procedure translate H , v ]:
// E is the current entity of the source schema ,
// v is the current node of the tree
{ for each attribute aeE such that a ident ( Ef
a ddChi 1d(v, nevNodeia)1 r
^
// adds the a child to the v node
for each entity Q linked to E by an ? - T.EX 1, E, R )= i
. re 1 E1 1 unship s 1
'
-
Fisuft* 6 5 Operatingflow of trie t r a n s l a t e procedure for trie saies example
-
t r a n s l a t e I E SALE RECEIPT , v = BS1eReceiptMum ] :
.
addChild! aaleReceiptNurc d a t e ) ;
f o r G « STORE ;
addChiId IsaleReceiptNum, store ) j
trans1ate(STORE. = t ore );
Chapter 6: Conceptual Design 161
- ranslate ^ E- 0RE
addChild < store .
ST 4
addressi ;
vaster*
- - -
t r a n s l a t e 1 E DIST? ITT , v d l s t r i c t N u j m *o c i i n t r y ; :
addChilc ( &i3trictNum*c o u n t r y , districcNust ;
=
for G COUNTRY:
addzh iId ( distractNhim-* country , country j
.
1
.
trar s lace ( E = COUNTRYr v = coir.try 1
Naturally, the need to duly pay attention to many specific elements, for which you must
check your source schema , makes your algorithm operations more complex than this. In the
following paragraphs we will make a few remarks and give you some guidelines to manage
those elements.
- -
Your source schema may contain a many to one relationship cycle. For example, one
part may be made up of many other parts. If this is the case, the translate procedure
loops forever and attempts to build an infinite branch in the attribute tree. Two solutions are
possible at the conceptual level. The first solution involves the use of a recursive hierarchy
to show that the height of the hierarchy is generally unlimited With this solution you can
have instances of hierarchies of different heights, but bear in mind that the logical modeling
of recursive hierarchies inevitably involves complexity and performance problems. See
sections 5.2. S and 9.1. 7 for more details on conceptual and logical modeling of recursive
i
hierarchies The second solution is very often preferred This solution slops the loop cycle
after a specific number of iterations determined by the relationship relevance in your
application domain , Figure 6-7 shows a simplified schema for the transfer of personnel
within a company There is a many-to-one relationship in this schema involving the
EMPLOYES, the DEPARTMENT, arid the DIVISION entities. There are three cycles because
you can reach DEPARTMENT twice directly from TRANSFER, chosen as a fact , and a third
5t £ t date salesMsnager
c- O* Qcity c
nation ?? address
Qtelephone
-
o
department
weight
o category
brand quantitv
diet 0 o
/
*C91
e
--
fry
Wr
—. — —
city state country
o
-
d 1st r i it JJUQ country
departcentHead
cype product ^ saleFece - ptNurr.
directorO
TL= rket LTJQGX &JP site X6 .
*
di « ricxNum
unitFTIce
-
FIGURE 6 6 Attribute tree for the sales schema ; the root is highlighted in gray .
162 D a t a Warehouse Design: M o d e r n P r i n c i p l e s end M e t h o d o l o g i e s
fir IJ ( Q , r.l
V
(1 . ,n) - f cO
.
^ftKIVSFER D£P «rtPKHsrr DIVISION
1
*
f -i . i )
aate divisiqn
, ID dflr-ar: rr.:
cfjDCode OF ~
[i
T C 0 . n) 11, 13
<8>
EMPLOYES
OF
10.1)
date destination
current
O department
department
division enpGoda division
0 y— _ , „ O
dat v
department
^^ HSpCod«\
iMtadCcde
headCode
department O
divisi on
origin
department
hcadCode
current destination O
department departmenf
Q division division ©
1
date
O department O depart cent p
division. empCode department division
O O <3 <
headCode department date *\ headCode
errpCode headde-de
o
department division
department &
div 1a ion 0
origin
department
^
headtode
RGL- RE 6-7 Entity-Relationship schema for the personnel transfers ard two possible correspofidsng
attribute trees
C h a p t e r 6: Conceptual Design 163
time from EMPLOYEE. Figure 6-7 shows two possible attribute trees . In the first tree, we
used three recursive hierarchies for modeling. In the second tree, for each department r . we f
chose to stop the cycle at the r: department to which the r manager belongs, because we
assumed that the r, division coincides with the r division.
While you are exploring a cyclic schema, you may happen to reach an individual E
entity twice while you go dow n different paths. In this way, you generate two homologous
nodes called r ' and v" in your tree . If each instance of F precisely determines one instance of
E* regardless of the path followed, you can have the v ' and i f nodes coinciding in a single
node called v . Tn this way, you create a convergence. The same applies to each pair of
homologous children of r 'and z f . It this is not the case, n' and \ f must be left separate. After
creating the fact schema, you can opt for a shared hierarch )' construct . Figure 6-7 shows an
example of the inappropriate conditions for a convergence because the origin and
destination departments must be separate. The sales example shows the correct conditions
for convergence instead , because the country attribute can be reached either from
districtNumor from scats.
-
To many relationships ( mdx(£. ,R ) > 1) and multiple attributes in the source Entity-
Relationship schema are not automatically inserted in mees These may give rise to cross -
dimensional attributes or multiple arcs that designers will frame manually in fact schemata
at the end of the conceptual design phase. See section 6.1.5 for more details on cross
dimensional attributes or multiple arcs.
-
It is important to show’ whether an optional link exists between attributes in one
hierarchy. To do this, you can use a hyphen in those arcs that correspond to optional
-
relationships (min(£ , R ) 0) or optional attributes m your Entity Relationship schema . In
-
the sales example, this applies to the diet attribute.
-
If you carry out a reification process (section 6,1.1), you can turn an n ary relationship in
-
your Entity Relationship schema into w binary relationships- Most n~aiy relationships have a
maximum multiplicity that is greater than 1 for all their branches. It this is the case they define
,
n one-to-many binary relationships that you can not insert into attribute trees. Figure 6-S gives
an example of this point. On the contrary, a branch of an .n -ary relationship , whose maximum
multipiitity is equal to 1, defines a one-to-one relationship in a reiiied schema. In this case, you
can insert it in your tree . Figure 6-9 shows an example of this point. You can reach TEAM twice
from. MATCH. Remember that you can always replace an n-ary relationship with a maximum
- -
multiplicity of 1 for one branch in your Entity Relationship schema with n 1 equivalent binary
relationships without the need for reification .
Specialization Entity -Relationship hierarchies are equivalent to optional one-to-Dne
relationships between super-entities and sub-entities , and they may be treated as such in the
algorithm . Alternatively if a hierarchy exists between the E super-entity and the E 3 . . , , Em
-
sub entities, you can merely add a chDd to the node corresponding to the E identifier. This
child allows you to discriminate between different sub-entities, and it actually corresponds
to an attribute with n possible values. Figure 6-10 show’ s both solutions and makes reference
to the order line example . Note that the specialization attributes of the sub-entities are
optional in the second solution that includes the protin it Type discriminator
164 Data Warehouse Design: Modern Principles and Methodologies
T department
.
11 1 ' O . Rl 10 , di
WW:5$ ION PATIENT VISIT,
=OT7 :
A
from!:ate
»
O
toDate
ijadrires :
patieatSSK
.
10 n 3l
doctorcode
date
f roitiCata
o » e
o o o
detartnATet cc-Daie
- atreet
ZIP
cicy
piEientSSN ad ores a
-
FIGURE 6 8 Entrty-rtelaTjonship schema, for hospital admissions and the. corresponding attribute tre
*
attributes, such as a,
-
A composite attribute called c of the Entity Relationship schema consists of simple
.
ft is inserted in the attribute tree as a c node with children am
-
See the address example in Figure 6 S. Then you can graft c or prune its children , as
mentioned in section 6,1-3- Figure 6-11 shows that composite identifiers can also be treated
in the same way.
ia ,m
? result
tlplJ
SUSPENSION VIN HATCH TEAM
-
.
V 3Y
A
rAiscn
1
minute
A
match Code
(O .n )
learr- Toie
.
rea &or teamCode
o 0
c de
rrd nu t e + ma tchC od e *$& < **O° result
o 6
mlTiUt e - ear Cede
.
FIGURE 6 *S Entit Relabonshlp schema for suspensions in foolbai matches and the corresponding
attribute tree ^
Chapter 6: Conceptual Design 165
ii .n ) .
O al
•- prcdtictcod*
ORDER
ORDER .
I IKE
PRODUCT
i
onforCod*
A
date
o
iirie r quant i t y
price. i il
itot »1, disjoins )
A
expiry?!D4
A
= :zs
quant i t y t oodProdC'&d* quantity
0 Oexpi ryTir-e 0 Qexp' iryTine
°«s ^ 6
— _ =O-—cJ;0 £ t CC
'
a
date
o
orderCede
o Qfaqusei:oldProdCod =date
Q
ardeFCoG
"
~
" ^
0
product Type
price * prf ce
nuiR&er O Oe z*
cl pt&lsgFrcd&ode - number Q 0 5 -se
FIGURE -10 Errtity*HeiaCiqfishlp schema for the order line example and two solutions for the attribute tree
CALL TELZPEOtfE
* A— A
and, the
corresponding
attribute tree
date tour minute areaCode urier
duration
o « date
Q hour
0 Q minute
cost
166 O a t a W 5 r E h n i t a c D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d 11 » g i t s
FIGURE 6-12
Attribute Tree ( top ) .
Dmnin (left ) and
^
grafting
-
To pnflTf node vf you should remove the entire sub tee rooted in r. The attributes that
you remove will not be included in your fact schema, so they can not be used for data
aggregation. Grafting is used when you need to retain the descendants of a node in your
tree, although that node shows information that is not relevant To graft node rr whose
parent is called u you should link all the children of v directly to r 'and then remove v. As a
-
,
result , the aggregation level corresponding to v V T 11 be lost but the levels corresponding to
, .
5a I e 3Manager
Qcity o address
£reltpfiane
quantity stcrr city Etate country
brandy O
—
J
diet 0
~
d = r artment* ‘ 5' ’ “ <X \
1 _ - cunt: ry
i s t r irtN' LTr. T •
O
det rtnentHead category roduct * sEieF.-B ce ipUftun
^ type product
director O
BSRKETIN^Grcup data
O
uni t Prices
FIGURE 6-13 The attribute tree tor sates after pruning and grafting
C h a p t e r 6: Conceptual Design 167
f 1 sdiunate -
depart ureTii*
^
f i t or.xN
_ .be r
LTr
— Oairline
lATAGode
AIRPORT
I IromCity ;
airline
iromAirporc Q
be partureT ime
£ lighfcffuHber nucriberOfBags
passenger 0"
a bC'Airporb aeat
flibhtDate
I
tocity
f rcmcicy
\ _
.= :r :r.e
fxomAirporz h 0
'
it
•
a depart lire Time
flightxuober
ue
passenger C f
1) toAiipbare s e a t.
£ Light Date 1
toCity
m FIGURE 6-14 Entity-Relationship schema for airline ticketing and the attribute tree before and after
illy -
prunfehg the optional tickstmiBber CHICS is) attribute
I
168 Data Warehouse Design: Modern Principles and Methodologies
ti . h <U)
F E
O G
oa , t
flap
*6
bl
o
b.
ar -ram+if i +-A -
O
_ am *
•
0
d
A T . JJ r
:: o
o *
1
£ wiHf 6-1$ Vanagirg compos * e ktentrfi.ers: source schema , attribute tree , pruning i lefr . an 3
grafting ( right )
You can achieve two possible results if you have to leave the granularity of E unchanged
,
in your tact schema,, you can retain the c node and prune one or more of Us children For
example, you can retain the districtNum+ country node because you need to aggregate
by individual sales districts, but you can prune discrictNum because this is not a
significant aggregation. Otherwise, you can graft c to remove it and retain all or some of its
children if the aggregation expressed by £ is too fine grained .
To conclude this section , we must emphasize that you may need to manipulate your
attribute trees further. In particular, you may need to make radical modifications to the
structure of your trees by replacing ‘/he parent of a specific node . This is the same as
adding or deleting a functional dependency, depending on whether the new parent is a
descendant or a predecessor of the original parent , respectively. Figure 6-16 shows how the
structure of a hierarchy changes when you modify functional dependencies. The left-hand
—
tree shows the &*-+ b, b ct and the <2 K- functional dependencies you can use transitivity —
-
only a *!* and
-
to infer the third functional dependency from a +6 and 6 *c The right-hand tree shows
—
In practice, you should add a functional dependency if this is not displayed in the
source schema but is inherent to the application domain. On the contrary; you could also
find it useful to delete a functional dependency if you want an attribute that is not a direct
root child to become a dimension or measure. However, you should remember that this
particular operation introduces a functional dependency between dimensions. For example .
If you do not want to sacrifice the information from the sale receipt but you also want to
have a time dimension in the sales fact, you can delete the functional dependency between
saleReceiptNum and date. Figure 6-17 shows the tree you can obtain. In this tree, you
can choose as dimensions product , sale&eceiptNum, and date,: the resulting fact
schema will dearly have a functional dependency between the saleReceipcNum and the
date dimensions.
Chapter 6: Conceptual Design 1G 0
BSEME 6 IG-
Changes applies to
an attribute tree
after dealing or.
adding a functional
O O
*2
'0
b
O
c
O
if
delete b
— c
O
43
£? i
o
C
o
-
W
S a1*5?kar:agar
o address
telephone
brand ,
—
diet
::
dapartjnentliead
di rector 0
— o— -cr
d frpa r une
O
category^
type
D
T^arke t i noOroijp -
unitPrice
-
FIGURE 6 17 An alternative Sates attribirte tree
-
When a one-to one association exists, you can sometimes find it useful to invert both its
end nodes . This can be helpful when (i ) a relation key is not a mnemonic code so you do not
need to include it in your data mart , and ( si ) one of the relation attributes is a description
field that can act as an alternate key. In this case, you can switch both attributes and then
-
prune the code attribute. Figure 6 18 shows an example of this point ,
-
FIGUSE 6 1S
Switching and
description
Q
pruning attributes
:n one-to-one
associations 0- O O
description
170 Data Warehouse Design; Modern Principles and Methodologies
i
Selecting dimensions is crucial for design because ii defines the granularity of primary
events. Each primary event "sums upr ‘ all the instances of F corresponding to a combination
of dimension values. If all the attributes making up the identifier of F are chosen as
dimensions, each primary' event will correspond to an instance ot F (lossless grained fact
schema ). Otherwise, you can prone or .graft one or more attributes identifying F so that each
-
primary event can correspond to multiple instances ( lossy-grained fact schema ). i
In the sales example, we have chosen the dace , sto re, and product attributes as
dimensions. The granularity of primary events is coarser than that of tire SALE source
schema entity because we grafted the saleReeeipt uro node, a child of the iwt.
- ^
Figure 6 19 show's an interesting example of how to choose granularity. It show ? part of
-
the Entity Relationship schema relating to the admissions of a patient and the attribute tree
associated with admissions. Data warehousing in the health industry faces the classic
dilemma of keeping or losing patient granularity, The left - hand tree in the figure shows that
we retained the patient sm node . For this reason you may use it as a dimension , together
with f rornDate, zoDate, and department . On the contrary, wre grafted the pacientSSN
node and pruned f irstName and surname to sacrifice the granularity' of individual
patients in the right-hand tree. Now the dimensions become f roraDate, tcDate ,
department , gender, userSegment , city, and yearOf Birth, The last dimension is of
course obtained from riateOf 3 i rth. Figure 6-20 shows the fact schemata that you can
obtain for both cases after selecting your measures. The first schema is lossless-grained. The
second schema is lossy -grained . The second schema might seem to have a higher number of
primary events than the first one because it has more dimensions- But the opposite is true.
When the department and admission date values are equal, all the patients bom in the same
year, of the same gender, resident in the same city, and belonging to the same user segment
are classified into a single went in the second schema The first schema, instead, classifies
them as different events
One particular feature of the choice of dimensions requires a separate discussion Let us
consider the airline ticketing case of Figure 6-14. Here, she identifier of the F entity chosen as
DDE,o department
o
ADMISSION
II , l )
'OF
center O
!
< 0 , rJ
dateOfBirth
o
PATIEUT
oc
—
a , ii
y
aSlLOSttS
iO , ni USER
S EGKE3ST
m x
TC
A t-e f romTXftte
irstName
f *ti*aCSSJS
-
S'irrtame
neerSegment
frcirnDare f DDWl- E D E
0
ffO^DZr-
COIL 1
~- test i »
iet r 1' *
department
C taE ED e
departcent
a — OtoCote
iirstName i
a
au-tname
t C Cl y - city
Rati RE 6-19 Hospital admission schema ano two atoerosttve attribute trees
Chapter 6: Conceptual Design 171
-
FiSiJBi 6 20
Fact schemata
department
0
corresponding to i iratNanse
the trees in
a 'irnau
Fgure 6-19
c-
f roriDats
ADMISSION
cost 0 gender
uaerSecprent
tofcate
city
depar t merit \
o OyeirOf Birth
Ogender
O ADKISSCPK
froauate
O userSegnerr
o cost O ci : y
tOuate O ye arO f B i rth
f
—
a fact is simple that is, it consists of a single attribute fc, It the granularity of each single
instance of F is not relevant , you can choose the dimensions among the children of the root,
as we previously mentioned. Figure 6- 37, later in this chapter, shows an example of a ticketing
>
fact schema with the f 1ighrNumber . tlightBate , check - in, and passengerijencer
.
dimensions However, users can sometimes request the top granularity level defined by tine
k identifier. If tills is the case, you could perform analyses at the individual ticket level in the
ticketing example This request mostly conceals that users may (improperly ) wish to use their
data marts for operational queries linked to everyday life operations, which would be far
better addressed by an operational database. However, we still find it useful to show designers
how the)' can proceed if they are asked to retain maximum granularity:
1. Duplicate the root node of your attribute tree. This means that you have to add a new
node that becomes the new root. This new root is also labeled with the name of the
identifying fact To connect it to the previous root you can use an arc showing a
,
.
Duplicating the
root ar a ceding O
.
a 1 r1 i r c
-
Id re check I r.7 iene
O o
functional
dependencies from
tne attfioute free
irowAirport
depamireTiene
f lirh 3?uinher
“
^ c;c e
* fc L
MS
tick*
* O
In tne air treeing numberl' f Be ?E
examp;e passenger y U
S4ftC
ToAlrport
flighcDace
'
^
ZOCitfrj
/ ttckezNiunber
( TICKETI
4
f r ACTC i
_ y'
airline :are Check I n.7 ime
f romR insert Q G Q
- el
—— ^
Cjc >
departui eTiioe - xS’
:11ghiNniTibe r
- c/cerj
*
j . I
1 G
absence humoerCi 3aas
u V
--
seal
1
1 CAIR ?cil
-
PC city
( TICKET!
object V3 x
o T
—
s t a r t irtgPri ce . r. I il , i > D v^ t Q' U )
.
t Q i 1,
O 1 AUCTION K. sir i; V
CJS2S
I
a action Code nserCode narre
o
date i
ocjeci Q Coi8
e
st art marries
O ^ —
*
a^c'
o
= ^o
v -‘c
sure:orXod* UStrCodt n nrr
^
Cvalue
FIGURS 6 22
* Tran stem database schema for online auctions ana rts attribute tree
obiset value
:
? .
s naming Price
O WJCTIGS
1 Cmi
rC?
11 ,1 ) j' l U IO
- HJ
USER
.
aucticnCco =
#
date *
userCode name
I
FIGURE- 6-23 Temporal database schema for online auctions
order line Entity- Relationship schema . Figure 6-24 shoves the original attribute tree ( left )
and the tree obtained by deleting the functional dependency between orderCode and
sate (right ) where date is a candidate for becoming a dimension.
See sections 5 4.1. and 5-4.2. for more details i « n the choice between a representation based
,
on a transactional or a snapshot fact schema and for issues raised if updates are delayed.
,
guan* ity =
qu niitv
'
QexpiryTime O OexpiryTline
> ..C' c°
- ie c
f - c^O p
O o _
rr:‘da:iT \T'S >
order-Coce
- Oduct Type
±= i = arccrCids
aa ~ ^ Cp r - --
njnM z 6 O si Z 5 nazusr Q 6 size
FIGURE 6-24 Attribute tree for erde' tines wrth/ wrthout trie functional dependency between crd*. r oe
and daie
174 OaU Warehouse Design: Modern Principles and Methodologies
If the fact granularity Is different from that of the source schemata , you may find it
useful to define a number of measures tha ? use different operators to aggregate the same
attribute For example, suppose that each even ? corresponds to a set of airline tickets, sold
foe a specific flight on a specific date, and that the fare attribute belongs to the TICKET
entity m the source schema. As a result, you can define one or more of the fallowing
measures in the tact schema: average Fare calculated by using the AVG operator:
raaximumFexe calculated by using the MAX operator; -n in imunFare calculated by using
the MIN operator ; and receipts calculated by ridding the fares for individual tickets
together . Of course, the only additive measure will be receipts. To aggregate the other
measures you should use the same operator that defines them .
In this phase, you can prune and graft to eliminate any irrelevant details, but you cart
also add new aggregation levels ( typically, to the time dimensions) and define appropriate
ranges of numeric attributes.
You can niark attributes as descriptive if they are not used for aggregation but only for
information. These attributes also generally include attributes determined by one-to-one
associations with no descendants. As tar as the attributes, which are root children but are
not chosen as dimensions or measures , are concerned, the following points apply:
-
——
If a fact schema is lossless grained that is’ the granularity of the primary events is 1
*
the same as the one of the F entity those attributes can be represented as descriptive
attributes directly Linked to the tact A descriptive attribute linked to a fact takes a
value for each single primary event .
-
- r ' _1 • *
— —
3» •
• If a fact schema is lossy-grained that is, both granularity values are different
.
those attributes absolutely need pruning: other vise multiple values of these
attributes would be associated with each primary event .
, i >
-
Figure 6-26 shows the lossless grained fact schema for the order line example. The ,
attribute tree is the one on the right of Figure 6-10. The granularity value is the same as the
one or the source schema because all the attributes ( order Code and produce Code!that
identity' the fact entity were maintained as dimensions. For this reason, we can represent the
line number as a descriptive attribute number linked to the tact. In addition to the price
measure, which is non-additive because it show s the unit price of a product , we defined a
derived measure called receipts , which shows the mathematical product of prize and
quantity for each order line. However, if wre grafted the order Cede node and turned
date into a dimension together with product Cede to create a lossy-grained fact schema ,
vve would need to prune the number attribute because it cannot be unequivocally defined
in the order line set for an individual product issued on the same day.
If you find a shared hierarchy while you are building an attribute tree, you can choose
to highlight it to simplify your fact schema .
In this phase, you can also highlight any cross-dimensional attributes and multipie arcs.
-
It is very difficult to use source schemata as a starting point to identity these types of
1
attributes because you would have to navigate to-many relationships, too. However, if
to-many relationships could be navigated as well, this would extend navigation to entire
source schemata , give rise to an extremely large number of possible paths, and make cycle
management very difficult - Because you can neglect the frequency of cross dimensional
attributes and multiple arcs in normal fact schemata , we suggest that you define them on
-
-
FiauftE 6 26
Fact senema for
.
r m&er explryTine
\ size
the oroef line ? -3,
example zTderCod* ORDER LZliE product Type
O <> quantity
recelpt = ._
_ ftfr,
O
AVG price A VO
r
176 Data Warehouse Design: Modern Principles and Methodologies
the basis of your user requirements. Then you should represent them in fact schemata at a
later stage. For this purpose , remember following:
* A crossed ] mensional attribute generally corresponds to an attribute ot a many-tq-
many relationship called R 1r the source Entity-Relationship schema Cross-
dimensional attribute parents in fact schemata then correspond to the ’denbriers of
the entities involved in K. See Figure 6-27 for an example
* A multiple arc corresponds to a to-many relationship called R from an entity called
E to an entity called G , It will link the E identifier or vour fact with an attribute of R
r
is as follows:
If the ( w t f j v a l u e s are given to the m measure in the k primary events corresponding
to k different values from the d domain , and from a preset value for each remaining
n - 1 dimension, which aggregation operators does it make sense to use to mark all the
k event? with a single value for rn 7
receipts
iste
n 0 quantity
\i! 0 , n >
‘1 . 1 )y
.
STORE SALE PRODUCT
I tlrl )
#
i: i s
#
store
$
< >
!1. n )
saleCoce
(l .a)
=
o? y - roduct
ll , n )
STATE SALES CATE TORY
L
State VAT
.
category
z a t e goI T O \
V _0 VAT
Q product
SALE
date
o quantity
o
store state
receipts
FIGURE 6 - 27 Entity-Reiatiofiship schema for sales and the corresponding fact schema
Chapter 6: Conceptual Design 177
FIGURE G - 2S Q 0 rest.
Entity-Relationship ( 0 , n ) . (1.11
.
10 nJ
schema far
<8> = At1EN?
=^AETM£MT
.
D N,
hospital
aPmissfons anc
the corresponding
*
tl , nj
i
fact schema departnasmr
<$ > .±7 1E E 1 r. ZoQE cat leritSSN-
’
. £l
l C , r.
;i, i! CI ,nJ
DIAGNOSIS
dz.agr.OELE
.
; T::, Z Zr cr.t
.
CATEGORY
category
o
data
ADMISSION
ra:eoory :D E I O
o disgnqais
Q paLieziZSSX
• Given a store and a product if 10 items were sold yesterday and 3 items have been
'
sold today, the total amount of items soid in both days would be 13. Then the
quantirv measure is additive along the date dimension .
• Given a day and a product, if 7 items were sold in the Columbus store and 5 items
were sold in the Cleveland store, the total amount of items sold un Ohio stores would
be 12. Then the quantity measure is also additive along the store dimension.
• Given a day and a store if 4 boxes of Siurp milk and 3 boxes of Yum milk were sold ,
the total amount of boxes of milk sold would be 7. Then the quantity measure is
also additive along the product dimension .
The INVENTORY schema with the product and date dimensions and the level
measure supports the following remarks instead.
• Given an item if there were 100 items in a warehouse yesterday and there were
,
95 items today, how many items would the warehouse total over both days? The
answer 195 ts obviously wrong , so level is non-additive along date. However,
you can reasonably aggregate on the basis of AVG, MIN , or MAX.
• Given a day, if there were 40 boxes of Slurp milk and 30 boxes of Yum milk in a
warehouse, the total amount of boxes of milk would be 70. Then the level measure
is additive along the product dimension
178 DaU Warehouse Design: Modern Principles and Methodologies
depar unanE ad
diresier \dflp* r Ernest
BI& rfce
- i ngj&roup
ca
_ egcT\’
type
-
eight
produce' ^ brmd
s a1e sKaziager
P ul«DLstrict
care state
O O
quantity \\ HtoreCity C ClLTit!“,
receipts
EHiraOf Customers address
.
ur itPrice IAVG )
telephor e
-
Fisvut - 29 rac' schema ferine ssies example
Figure 6-29 shows a lossy-grained fact schema obtained from ihe source schema in
Figure 6-1. We inserted the month. quarter , and so on, attributes and added them to the
time dimension . As tar as additivity is concerned, u n u t P r i c e l s non additive, but you can
use the AVG operator to aggregate it nutoOf Customers cannot be aggregated.
-
Figure 6-30 shows two possible tact schemata that can. be derived from ihe attribute tree
of Figure 6-7. In the first schema , which is lossless-grained , vve left the employee
aggregation level unchanged and the fact is empty, In the second schema , we pruned the
eccCode node so that we turned this schema into a lossy -grained one. In this way, we
managed to add the number measure, which counts the number of transfers made on each
day between two departments . Note the shared hierarchy in department .
To conclude this section, we should comment that designers can sometimes "split" a
single fact schema into two or more schemata to standardize their hierarchies. This is more
properly referred to as fact schema fragmentation. From a logical viewpoint, the result obtained
-urr&n £
Q date
TRANSFER
-ie£aI head
r.- j-mbeT £2
O
division
o
C h a p t e r 6: Conceptual Design 179
is very similar to the one obtained from horizontal fragmentation (section 9 3-2), To illustrate
this, consider an example related to invoicing in a large retail chain. Assume that a company
sells products in Italy and abroad. A hierarchy for foreign customers is less informative than
the one for Italian customers. Moreover, the invoice line shows only the currency dimension
-
for foreign customers. Figure 6 31 shows a comparison of two conceptual design alternatives .
In the first conceptual design, the specific properties of the application domain cause the
geographical hierarchy for customers to be incomplete, because information on the foreign
customers city and regions is missing. This also creates an optional dimension (currency) .
In the second conceptual designr the decision was made to create two separate fact
schemata. The first feet schema models invoices issued to Italian customers. The second fact
-
schema models invoices issued to foreign customers. The dear cut advantage of this is that
all the hierarchies are complete and compulsory, but every time you need the data summing
up ail the invoicing, you have to issue a drill-across query accessing the overlapped schema .
0 category
O type
Q. product
INVOICE LUCS
cate customer
O
quantity
o <t> 6 O
city region country
receipt *
_
:Currency
Qcategory Q category
6 type
count ry Q
£ product
i Q product
region f
NATIONAL FOREIGN
date date CUETAMER
INVOICE LINE INVOICE LIFTS
o quantity
o o o
euacorner city quantity c&cm ry
receipts receipt *
QCURRENCY
FIGURE 6- 31 Fact schema for invoicing lines , witnout and witn fragmentation
180 Data flare he use Design: Modern Principles and Methodologies
WAREHOUSESywarehouse , address I
-
SALES I prsducr : RROHUCTS saleReceipTLN .in:: 5 AL2 _ 5C32 ?7S , quantity ,
= unit ?rice )
Sta ? ft * STATES ]
STATES i s t a t e ,
tountry ; CCUKTRIES I
COUNTRIES i court ry )
_
SALES 0 ISTRI CTS ( di tr ictWiiir. r country : COUNTR Z £ S J
=
_: <
PAOD NWAEEHC7SE c- roduct . PRODUCTS , warehouse : VARZHQUSES \
EF^RirpS i. COdSrand , prodi: C« dlai CITIES )
_
TYPES 1 typei marketineGroup :KARK 3RCUPS , category': CATEGORIES I
>5AfiY GROT?5 [ ma r.ce t rngOr cup . d i r e c t o r )
CATEGORIES [category , department:DEPARTMENTS)
DEPARTMENTS ( department . iepartrr.entHead 1
-
The tree building procedure is based on the principle of following functional
dependencies. In a relational schema, functional dependencies link the primary' key of each
R relation to all R attributes on the or e hand, and each foreign key of R, which references S ,
to the primary key of S on the other hand. The first examined relation is the one that you
choose as the F fact . Every time you examine a relation R , you create a new node v in your
tree. The v node corresponds to the primary key of R, with a child node added for each
attribute of R ( including each single attribute that make up the primary key of R if this
primary key is composite, but excluding the single attributes that are part of a composite
foreign key, because those will he dealt with in the next recursion step). You should also add
a child to v every rime you find a foreign key t in R, and then recursively repeat the whole
procedure for die relation 5 referenced by c:
R (k .
s ( c, dr e
C :S . • i .1
A recursion is also triggered when a one-to-one association links R and 5 , and the
primary key of S is foreign and references R ;
.
R (k r a to ...)
S ( k : R , c , d . .. )
In this case, you have to check that this procedure does not return to K after examining S,
because this would result in an infinite loop.
-
Figure 6 33 shows pseudo-code that defines the basic operation of the algorithm
building an attribute tree rooted in F. The comments on cycles and convergences that we
-
previously made for Entity Relationship schema-based design also apply to this case.
root -
newNode ( pk < FJ J ?
/ / the r o o t i s labeled with the primary kay
l l of t h e r e l a t t o r i chosen as a f a c t
translate ( Fr root 3 ;
-A node may correspond tn multiple- attributes only if a primary key or a foreign key of a relation consists of
those attributes.
182 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
_
procedure £rarslat.e i R r v ! :
// ?. is the rurrsr.t relation, y is the current node of the tree
for each attribute a*zF: such that a*pk[H end 53: £k(Rr 5
// pkiR) and fkfR S' mark the sets of attributes of R that font ,
(
1
}
Note that you cannot process optionality automatically. This is because no information
in a relational schema dearly specifies whether you can give attributes null values’ Figure 6-6
shows the attribute tree corresponding to the sales schema , which is the same as the one
obtained from the Entity-Relationship schema . .
We will provide another example related to DVD rental and described in the transient
database schema;
CARDS { cardNuTTiber. exp- iry)
CUSTOMERS fcartiMimber :CARDS, nsr.-e , gender , addres s .
tel ephone, persons1Docum&nt!
H Q V I E S 1 mov ; eCode , t i t l e , c a t e g o r y
* d i r e c t o r , l e n g t h. , n s i n A c t o r
;
COPISS i p c s i t i o n Q n S h e1 f , rr.ovieCode : MOVIES )
REHTALS(pcsitionQnSha1 f :COPIES » cardNutaber:CARD, date, time)
-
Here, RENTALS is the only relevant fact Figure 6 34 shows the control flow for the translate
procedure and explains how io build a branch of the attribute tree shown in Figure 6 35. Note -
that the association linking the card number to the customer number is one to-one. -
FJGLRE fi-34 Operating flow of tne translate procedure in the DVD rental example
-
root newseode ! pos it ton0nS'nel f )
t r a n a - a t e ( R= ftEN7ALS * v = r c s i t i c- n OnShe I f ‘ :
addChi 1d ( pa s i E. i o r.O c S h a l f , d a t e ) ;
add 2h i I d ( pa EIE £ o r.OnE h a l f , E. icie ) *
f o r S = COPIES ?
3SQL
DtfimtKn Urngungt allows ;'cr a NOT NULL standard clause lr principle, it could be used to extract
optbnaBy. As a cutler of fact* this clause ts normally used to express entity integrity CQfistiAtnte drily Entity
integrity constraints specify that it is forbidden to give a null value !Q the attributes of whidl a key consists- For
this reason, there is no point In inferring that all the attributes not specified as NOT NULL are actually optional
Chapter 6 : Conceptual Design 183
.
-
addChi id(poHit:or.0r.5hE 11 , p osi1icnChShelf! j
translate(COPIES , posit ion.On.SheIf ) ;
for S =CARDS :
addchild ( positionOnShelf, cardNumber )
translate ( CARDS * cardNUmber) r
trans1ate(3 COPIBS ,
- =
for S MOVIESJ -
v pcsitionOnSheIf ):
ft dd chi1d(pcsit ionOnShe1 £ , mov i eCode)
translated KOVIS3, movi eCode}
-
1 1an*late(E MOVtES . v = movieCode):
adcChi1Q(movieCode . title);
addChiid(fcnovieCode , category);
addCh11d(movieCoder director) ,
addChiId(movieCode , length] ;
addChiId(movin Code , xainActox)-r
t rans1at e < E
-CART'S , v« ca rdNumbe r ):
addChiId \ number, expiry) r
-
for G CUSTOMERS:
addChlid(careNttmh =r . cardNumber);
translate i CUSTOMERS , cardNumber );
.
To conclude this section, the following example demonstrates how to select a fact in
order to model as many concepts as possible. To this end, you should keep in mind! how
attribute trees are built The operational database is shown next:
FIGURE 6 35-
Attribute tree for z it la
the DVD rental telephone ( category
example
gender Q Olength
I nardXumber ositionOnSheL f
t address I CARDS)
0
r : COPIES ) Udirertot
cnainActor
r pe rsons1Document date time
184 Data Warehouse Design: Modern Principles and Methodologies
The relations that aie candidates for expressing facts are FLIGHTS, FLIGKT INSTANCES, _
-
TICKETS, and CHECK IN LH this example. Figure 6-36 dearly shows that the las: two options
are the best , because the existing functional dependencies make it possible to include the
maximum number of attributes in the tree . However, note that the selection of TICKETS
means that you opt for modeling the TICKET ISSUE, fact. But si you select CHECK ZK . this -
name
ci
^y 0 city
countryn
carrier -
caun ry
f 1:sfctHumbK f 1ighieliimh er + da ie
d s part ure Time O - ( FLIGHTSJ deparcureTiiae Q iam INSTANCE I
jrr IVE I Ti me Q? arrivalTine
LoAiroon
name
o
=
> i -y country
city nan*
Q
-cumry
f rOTEAIrpcr1
airlineQ
- ^S*-
r‘ 1
cepart jreTite Q - o
numberCf Bags
azr i va1T i me O'
coAirport
pas sergerTend a r
name
city country
city
ecfuntryi
f a r e cb*ckInline
fronAirport
airline Q &
ae pa rtmr ST;me Q-
numbe rQ £ 3ag £
an i va1Tine QT
tcAirpcrt
name
city cc Lin t ry
-
FI«JK 6-36 Attribute trees for the nights example, obtained by choosing different relations as facts,
corresponding to lhe nodes ingray
Chapter 6: Conceptual Design 185
results in the CHECK - IN tact . The difference between both solutions is not merely the name,
because not all the tickets are necessarily checked in . As a result, the CHECK - IN fact primary
events will presumably be a subset of the TICKET ISSUE fact primary events.
country ( j \
f rwfclrpan:
a ir l in ^ O.
departretime Q~ ->
numb* rDfBags
auf civ*1Time O
toAirporr
city country
country o Pity
3 : rport
date 0
-
O checx in
0
CJ
alrlim from TICKET ISSUE
s w de r
ynj setige t3
epartureTi me munberCf r Iighta
r.umberOf Bags
c
arrivalTime
0 receipts
ft
carrier
-
FIGURE 6 37 Modified attribute tree and fact schema for the flights example
186 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
matHK.cc or
Q daze
i
RENTAL category
gander
o o nuenrer
O length
CUElCEWr title
director
mainActor
In case of relational sources, measure glossaries are typically written in SQL If schemata
-
are lossy grained, SQL queries defining measures will necessarily use the GROUF- BY
clause. Figure 6-39 shows the glossaries of the sales, flights, and DVD rentals examples ,
FIGURE 6-33 SQL measure glossaries for sales, flights, and DVD rentals . The check - ir. dimension
was teft out of the flights example to avoid making the query too complex.
qua cicy
^ = SELECT SOM i S . quantItyl
FSOK BALES S INNER JOTS SALE R.SCEI.FTS F. _
OK R . saleSfitze. iptMu.nl = S .. sale teze iptNun: -
GROUP BY S product , R dace , a , store
- .
receipts = .
SELECT smis quantify *5.uni thrice)
_
FROM SALES 3 jmtR JOIN SALE RECE I ?TS R
ON R , sftleReceiptNXsm S saleReceipiNuia
3ROUP BY S productr R .date, R .store
-
- .
unitPrice
-SELECT AVO[E.unitPrice}
FROM SALES S INNER JOIN £ALE R£CEIFTS R
ON R.salapaceiptNuns
_
.saleSecsiptNma
GROUP BY S . product * R . date ^ R . store
-
aunofcusccfflwrfi
-
SELECT COUNT( O
FROM SALES 5 INNER JOIN SALE REC EIFTS R
ft . salefteoeipCttuB
_
ON aaleReceiptNimj =
GROUP BY S product , p , date R , store
- .
nuiftberQf Flights SELECT COUNT I " )
*
FROM TICKETS T INNER JOIN FLIGHT INSTANCES I
ON T f 1 ightNuinher I.f1i rlt tNtrier AND 7.daz I . date
GROUP BY T , passengerGender 2 date T , f I i ghtN unibe r . . .
Cmmcerlf Bag* = SELECT STM C , ElUifiOerOCBaga } '
complexity. Any number of additional elements and textual data may be placed between
the opening and closing tags, of an element . Attributes and attribute values are included in
-
element opening statements. Figure 6 40 shows an XML document containing data on the
traffic on a web site.
-
FMUK 6 40 An XML document describing traffic on a web Site
-
< /h o s t >
< dac, > 2 3 -MAY -:Q C 5 < / del t i >
-
< tiine > 16 i 43 ; 25 < / time >
curl urlID HSLGQ 23 " »
< 3 ite si * eI 0 = " ww , csb . f r " >
< ccuntry > traace < / count ry >
< /s i t e >
<£i LeType > Bhtral < / i L LeType >
< ur1Ca t eg-n ry >cat a1ogus < / ur1Cat eg cr/ >
< / url >
</ clicSc >
< click >
4 ..
< / cLicX >
— —
An XML document is valid if it has an associated schema that is, a DTD or an XML
schema and if it complies with the constraints expressed in that schema . The following
discussion will focus on the ways you can display many-to-one associations in DTDs ,
because our conceptual design methodology is based on recognizing those associations.
*
This defines whether elements may appear one or more zero or more
-
comes after the name of a nested element or a list of elements in the element type statement.
or zero or
one ("7") times. The default cardinality is precisely one. Figure 6-41 shows a DTD that
validates the XML document of Figure 6 40. The wei; Traffic element is defined as a
*
Chapter 6: Conceptual Design m
docwngK ? dement fand Si becomes the root of XML documents. The wenTr a f i l e element
may contain many click elements. But the s i t e sub-element can be exactly displayed
once in a url element; the fileType sub-element and many urlCategory elements can
come after it , The host element may have either a category or a country element .
If you need to represent a one-to-one or a one-to-many association in XML, you can use
the sub-elements without any information loss. However, you can follow only one of both
.
association directions in a DTD. For example Figure 6 41 shows how a DTD expresses that
*
a url element can have many uxT Category sub-elements. But there is no way to infer if a
URL category' can refer to many URLs. You can conclude that this is true if you already
know the domain defined by the DID.
M ' | I
' '
The oilier way to specify element associations in DTDs uses ID and IDRER 5) attribute
pairs. These attributes operate in the same way as primary and foreign keys m relational
databases. The fundamental difference that prevents us from, using IDREF(S) for our
,
purposes is that their syntax does not allow for any IDREF(S) attribute to be constrained to
contain identifiers of a specific element type ..
source DTD and create a DTD graph Sub-elements in. DTDs may be stated in a complex and
redundant way. If ( his is the case, they need simplifying ( Nhanmugasundaram et a]., 1999),
To simplify a DTD, transformations generally involve converting a nested definition into a
"flat" representation For example, hc t {category ! country ) is transformed into
^
hast [ category? , country ? ) in the web traffic example . Moreover, the 'V operators
are transformed into operators.
After simplifying your DTD, you can create your DTD graph , This defines your
DTD structure, as discussed by Lee and Chu ( 2000) and Shanmugasundaram et aL (1999 ) .
1% D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
FIGURE 6*42
DTD grapn for web
traffic analysis
11e I a
Your graph nodes will correspond to your DTD elements, attributes, and operators. DTD
graphs do not make any distinction between attributes and sub-elements because we can
consider them as equivalent nesting elements for our purposes. Figure 6-42 shows the
DTD graph of Fi gure 6-41,
roQt
=QfiwSode IFJ ;
/ / the root i s a. node libeled with the name
// o f the DTD graph note chosen as a fact;
translate!? , root );
p r o c e d u r e t r a n s l a t e E , VJ :
/ / E is t h e c u r r e n t node of t h e DTD graph ,
U V i s the current node of the a t t r i b u t e t r e e
| f o r e a c h c h i l d W of E such t h a t p a r e n t IV !
/ / the condition on the p a r e n t of V i n t h e t r e e avoids t h e loop
i£ w is ui element or &r. a t t r i b u t e
C h a p t e r 6: Conceptual Design 131
-
r.sxi aewNode tt 1 ,
addehildfv, rexi ?
translate tW next ,
4
If
translacerw , vi i
/ the nodes H ? " are omitted
'
-
if noc co cnanylE , z )
-
if askTo merz Z
j next =nevNrde 12 i
addChild(V , next : r
t r a n s l a t e i Z . nexti ;
/ / if the association is to one r
// Z is added gf a child of v
-
1
}
r r
corresponding to the " ' and the 7 ' opera tors because they express cardinality only
F J
in the opposite direction. You have to query your XML documents conformed to
vour DTD to examine your actual data because DTDs do not provide any further
information on association cardinality. To do this, you need to use the to many -
procedure, that counts the number of discrete 2 values corresponding to even E *
value. If you find a to-man) association, you cannot include Z in your tree. If this is
-
not the case, you still cannot be sure that the cardinality of the E to-Z association is
-
to one. In particular, only designers who are very familiar with their application
domain cars define whether cardinality is actually ret to lo-one or to-many ( askTo -
one procedure ). You can add Z to the tree only if cardinality is set to to-one You do
not need to use any document elements because they have just one instance in XML
documents: for this reason, they are not relevant to aggregation and you do not
have to model them in your data mart
When you pass a node you should add an optional arc . Moreover, you should add
controls to prevent your algorithm from looping back at one-to-one associationsr end nodes.
This is because you can navigate a DTD graph both bottom-up and top down.
Uncertain associations are not navigated in our example We did not add the
-
urlCategory node to the attribute tree because it is a child of the DTD graph node.
Figure 644 shows the resulting tree. Before moving on to the fact schema , we need to apply
some changes. We can apply the switching and grafting procedure mentioned in section 6 14
to the host , uri, and site nodes We can replace the time attribute with hour, whose
-
granularity is coarser. The resulting schema is lossy grained
T
recommend using the group-by function . The main issue concerns the number of XML
documents to be examined to reasonably confirm the hypothesis of to- one cardi naliiy.
-
Qearty, the semi structured nature of the XML source data increases the level of
uncertainty of the data structure in comparison with the Entity-Relationship sources. This
requires designers knowledge to be called upon more often. In our algorithm, we chose to
'
-
ask designers questions interactively in the tree building phase to avoid unnecessary
document queries Alternatively, we could create the tree first and specify uncertain
associations. Then we should give the entire tree to designers so that they can examine it
-
and, if necessary; delete those associations, together with their sub trees. This solution
allows designers to have a broader vision of their trees, but it is also a less efficient solution
because a node deleted by the designers at this stage could have been expanded pointless!y
in the previous XML document querying phase.
As matter of fact, you may also need to infer cardinality of associations when your
design source is a relational schema. If a relation called R includes a C foreign key
referencing the K primary keyr of an S relation , this implies that C functionally
determines Kf and then all the other attributes of 5. But it does not provide any
information about the number of distinct tuples in K linked to each tuple in 5. In
principler it would be necessary to query a database to evaluate any uncertain
cardinality , as in the case of an XML source. However, this issue for relational databases
is somewhat less relevant than in the XML case. While XML document designers freely
choose the direction in which they want to represent each link, the need to retain the first
normal form forces relational schema designers to represent each association in a to one -
direction. For this reason , the association from S to R is generally one-to-manv and is not
relevant to the purposes of multidimensional modeling . The only relevant case, managed
by the algorithm mentioned in section 6 - 2-2, is when a designer used the C foreign key to
- -
model a one to one association.
Chapter 6: Conceptual Design 193
-
FIGURE 6 46
Attribute tree for
siteid
Q
trie DTD graph of r z
Figure &4S host i d dick urUd u r l
O c ^ -o
country
date citie S2 eType
You can create many DTDs to represent an individual subject, and the algorithm can
build a different attribute tree for each of those DTDs. For example, if your DTD graph
was the one shown in Figure 6-45 r Figure 6-46 shows how your attribute tree would look.
If click is a fact, you have to analyze your data to check for any uncertain associations
to navigate from host Id to host and from ur i Id to url . Figure 644 shows wrhat the
—
resulting atttibute tree would look like after replacing host Id with host and urlld
with url this is allowed because these elements are linked to pairs by one to one
associations.
-
—
analyze requirements ( section 4.3.) To discuss conceptual design methods, we will assume
that designers have already prepared the necessary diagrams and, in particular, an
extended rationale diagram for organization and one for decision-making processes for
each of the actors involved . On the whole, organizational diagrams give a broad picture of
source operational data, and decision-making diagrams show preliminary workload
Moreover, source data analyses and integration phases have resulted in. an operational
schema tor the reconciled database, in either conceptual or logical form .
In the conceptual design phase, you can pair the requirements derived from
-
organizational and decision making modeling with your source operational schema to
generate a conceptual schema for your data mart- You can break down this procedure into
three phases:
.
1. ResuireTngni mapping phase The facts, dimensions and measures found in the decision-
making modeling phase are associated with entities in the operational schema.
194 Data W a r e h o u s e Design: Modern P r i n c i p l e s a n d M e t h o d o l o g i e s
1 Fact schema building phase. After navigating the operational schema , you can create
a draft conceptual schema.
-
3, Refinement phase. Draft conceptual schemata are fine tuned to better meet users'
expectations.
6 ,4.1 MappingRequirements
The goal of the requirement mapping phase is to establish relationships between the facts,
dimensions, and measures found in the decision-making modeling phase and the relations
and attributes in operational schemata. This process is described in detail here:
Decision-making modeling facts are associated with entities or rt -ary relationships (in case
of Entity-Relationship schemata ) or relations ( in case of relational schemata ) in source
schemata . Now if you look at the banking example shown in section 43, the transaction
fact is likely to correspond to a table called TRANSACTIONS in the source database.
As far as dimensions and measures are concerned , you can reach your goal if you use the
attributes identified in the organizational modeling phase as a bridge. You can virtually set a
double mapping between organizational modeling attributes and both the attributes in your
operational schema and the dimensions and measures in your decision-making model. Look
at the banking example. The win he raws1 ca r d coos attrib ute, which is associated with
-
the enter withdraw1 card code goal of Figure 4 4, corresponds to the card code
dimension, which is assodated with the analyze withdrawal atrount and ana1yae
withdrawal r.umrer analysis goals of Figure 4-6. The same withdrawal card code
attribute may correspond ED a nun.Card attribute in the WITHDRAWALS operational schema
table. Similarly the wi thdrawa1 amount attribute of the er.ter withdrawa1 amount
goal corresponds to the total amount measure of the analyze withdrawal anomic
analysis goal and to the amount attribute in the WITHDRAWALS table.
Note that you may partially automate this phase if the names used for operational
schema and rationale diagrams are properly consistent
diagram onto your operational schema is included in your fact schema. Tire
navigation algorithm creates the whole hierarchy rooted in d .
,
3. Even lime you find an organizational mode! attribute that is not included in your
1
dedsion-making model you have' to decide ii its main role is as a dimensional attribute
,
or measure. You can add dimensional attributes to your fact schema and label them
with ' offers/ ' The navigation algorithm specifies their positions in your hierarchies
Similarly you can add measures to your fact schema and label them with " offers."
4 . You can pick the dimensions and measures in your decision making model rationale
diagrams tor which you have found no operational schema correspondence still
-
include them in vour fact schema , and label them wish " requests. rJ
5. Fact schemata do not include those operational schema attributes that rationale
diagrams cannot map and that you cannot reach when navigating.
As far as points I and 3 are concerned , note that sometimes you cannot reach a
dimensional attribute to insert into your hierarchy from your fact if you exclusively
-
navigate many tcnone associations. Then you may need to navigate many to many - -
associations. This gives you the opportunity to add multiple arcs and cross dimensional
attributes automatically to your fact schema . On the contrary; you are supposed to carry
-
-
out this operation manually in the data driven approach -
-
Note that the names used for measures in a decision making diagram may sometimes
provide designers with valuable information for assessing which aggregation operators to
use. For example , look at Figure 4-6 : you can immediately realize that the decision-maker
wishes io aggregate the amount measure both by SUM and by AVG .
Figure ti-47 shows the preliminary fact schema obtained for the banking example.
The withdrawal Fee measure is labeled with "request" because it is displayed as a
measure paired with the analyze ext ernal withdrawals goal, but it is not in the
organizational rationale diagram. The description dimension is labeled with “‘offer "
because it is displayed as an attribute in the rationale diagram tor organization, but
-
decision makers did not classify it as an analysis dimension in their rationale diagram
-
for decision making processes.
-
FiiauRF £ 47
Preliminary fact de st Count ryCode
45c
schema for o
TRANSACTION J0 bankCsoe
-
Q EE t ir.it itciC ,,
[QFTEX )
=
c/ a Q tountryCode
*' aj
TRANSACTION
O
O
( S t j y, . AVC- 1
O O o- amount
t r ansae t i o&Nuidbe r
deseiiption
<OFF #3 =
year month date w i tluirawa 1 ?e s ?.EQUELST] '
CardCode Q
196 D a t a W a r e h o u s e D e s i g n: M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s
FWRE -« 3sr
^er
Fact schema for
TRANSACTION bankCode
after tine-tuning
desCleat icoC/ A count ryCGd. fr
TRANSACTION
o o o
aipouc c •; sow , AVG ;
O O t r 5Ji =ac1i onNtinibHe r
year stttQCh dace withdrawalFee
cardCcde Q
After a comparison with the data -driven approach, we can conclude that, in the mixed
approach, (a ) initial fact schemata may be considerably smaller and simpler; (b) diagrams for
requirement analyses directly support the classification of facts, dimensions, and measures
so that designers do not need to take any action; (c) modeling particular concepts, such AS
multiple arcs, cross^iimensionai attributes, and additivity, becomes easier .
6 , 4.3 Refining
The aim of this final phase is to rearrange fact schemata to make them more suitable for users '
needs. The main operations that you can carry out are those mentioned m section 6.1.3:
pruning and grafting attributes, and adding and deleting functional dependencies.
Because dimensions and measures have been labeled in the previous phase, designers
can now distinguish m fact schemata (i) all the necessary and available information ( non-
labeled dimensional attributes and measures); (ii) all the necessary information that is not
currently available in operational schemata (dimensional attributes and measures labeled
with " requests "); and (iii) the available information that is not clearly relevant for analyses
’
(dimensional attributes and measures labeled with " offers"). The second category can make
designers evaluate the option of adding on to operational schemata or using additional
data sources . The third category may encourage decision - makers to try out different
analysis directions.
Figure 6-4S shows a final fact schema for the TRANSACTION fact in the banking
example mentioned in section 4.3 . This schema assumes that (i ) users have no interest in
customer granularity; (ii ) the data-staging phase calculates the withdrawal Fee measure;
and ( iii} the descripc. ion dimension is not considered as relevant to the analysis., but the
ties tinationC / a dimension is relevant .
quickly, in their carl v -stage forms On the contrary ,- ^11 the weight of hierarchy building tails
*
experts involved in the application domain , The starting point for requirement driven
conceptual design is a set of preliminary fact schemata obtained by associating each fact
-
found in the decision - making rationale diagrams with its measures and dimensions.
-
In the banking example of Figure 4 6 , you can. immediately design the preliminary schema
of Figure 6-49. The main points you should take care of in close collaboration with users are
Listed next
.
hierarchical form ( for example, date month > year ).
—
1. Identify any functional dependencies between the dimensions and code them in
-
2 Mark any optional dimensions (for example cardCode, that takes a value only in
some types of transaction )
3. Merge those measures that differ only in the aggregation operator used ( for example
averageAmount and tota1Amount )-
4. If any dimensions or measures are related to specific primary event subsets, merge
them or fragment your fact (for example, you can merge witMrawalNumber and
cransactlonKumher into transact!onNuniber because all the withdrawals are
a particular type of transaction ).
Figure 6-.50 shows the resulting fact schema after applying the previous criteria .
Nowr you can assume that your dimensions and measures are properly defined - You still
have to extend and complete hierarchies. To do tills, you must first decide which additional
-
attributes are relevant to your analysis for aggregating and / or event selection purposes .
v11r.dra wai Fe e
cardCode -4
198 D ?ta Warehouse Design: Modern Principles and Methodologies
cardCode A
Then you can interview users in order to understand functional dependencies properly. We
will make no secret of the fact that this is the most challenging stage in this approach, because
the users often have only a rough idea of the actual dependencies that link attributes- If you
are able to ask users all the right questions, you can achieve successful results.
When you have to design a conceptual schema even without Tropos diagrams on which
to base your work, this is the most complicated condition. This means that you have to map
the requirements expressed by users directly onto a fact schema. Section 4-2 showed that
you must eventually briefly sum up those requirements in glossaries. Although these
glossaries may help you better explain dimensions and measures, they can provide no
information on hierarchy compositions and structures. Regarding hierarchy compositions,
-
an in depth analysis of the reports normally used by the company is required , so that you
can edit a list of the main dimensional and descriptive attributes to include. Regarding
hierarchy structures, the information exchange between designers and domain application
experts is fundamental . If there are any doubts, you. can find the necessary answers only if
you carefully check the cardinality constraints on data. As a general rule, we recommend
that you reuse as much as possible those hierarchies / parts of hierarchies that are frequently
used In a particular application domain to reduce the complexity of design and maximize
7
fact schema conformity. If you arrange them suitably in libraries , they become an invaluable
and irreplaceable resource for designers because they or their semi- processed forms can be
adjusted to users ' real-worlds needs.
To conclude this chapter, we must not forget that one of the problems with the
requirement -driven approach is rooted in finding mappings between fact schema attributes
-
and source data . Those mappings are necessary to implement data staging procedures. To
do this, right from the start of the conceptual design phase it is vital that you make sure that
fact schemata agree with source schemata, and that fact schemata fully seize the analysis
potential of sources ( Mazon et al, r 2007a).