Implementation: Data Warehouse
Implementation: Data Warehouse
Data Warehouse
Introduction
Data warehouses contain huge volumes of data
OLAP servers demand that decision support
queries be answered in the order of seconds
Therefore, it is crucial for data warehouse systems
to support highly efficient cube computation
techniques, access methods, and query
processing techniques
2 4/4/19
Efficient Computation of Data Cubes
At the core of multidimensional data analysis is the
efficient computation of aggregations across many
sets of dimensions
In SQL terms, these aggregations are referred to
as group-by’s
Each group-by can be represented by a cuboid,
where the set of group-by’s forms a lattice of
cuboids defining a data cube
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one
cell
3 4/4/19
()
0-D (apex) cuboid
4 4/4/19
Example: Data cube for sales data that contains
the following: city, item, year and sales_in_dollars
()
Query Examples:
• Compute the sum of total sales
• Compute the sum of sales, group by city
• Compute the sum of sales, group by item,
year
5 4/4/19
Compute Cube Operator
A SQL Query such as
“compute the sum of total sales” is zero dimensional
“compute the sum of sales, group by city” is one
dimensional
6 4/4/19
How many cuboids in an n-dimensional cube with
L levels? n
T ( Li 1)
i 1
8 4/4/19
Data Cube Materialization
No materialization:
Do not precompute any of the “nonbase”
cuboids
This leads to computing expensive
multidimensional aggregates on-the-fly
Which can be extremely slow
Full materialization:
Precompute all of the cuboids
The resulting lattice of computed cuboids is
referred to as the full cube
This choice typically requires huge amounts of
9
memory space in order to store all of the
4/4/19
precomputed cuboids
Partial materialization:
Selectively compute a proper subset of the whole set
of possible cuboids
Alternatively, we may compute a subset of the cube,
which contains only those cells that satisfy some
user-specified criterion; also referred to as subcube
Partial materialization represents an interesting
trade-off between storage space and response time
Three factors:
identify the subset of cuboids or subcubes to
materialize
exploit the materialize cuboids or subcubes during
query processing
efficiently update the materialized cuboids or
10 4/4/19
subcubes during load and refresh
The selection of the subset of cuboids or subcubes
should take into account
the queries in the workload
their frequencies
their accessing costs
workload characteristics
the cost for incremental updates
the total storage requirements
The selection must also consider the broad context of
physical database design such as the generation and
selection of indices
A popular approach is to materialize the cuboids set on
which other frequently referenced cuboids are based
11 4/4/19
Alternatively, we can compute an iceberg cube
which is a data cube that stores only those cube
cells with an aggregate value (e.g., count) that
is above some minimum support threshold
Another common strategy is to materialize a shell
cube
This involves precomputing the cuboids for only
a small number of dimensions (e.g., three to
five) of a data cube
12 4/4/19
Indexing OLAP Data: Bitmap Index
The bitmap indexing is popular because it allows
quick searching
The bitmap index is an alternative representation of
the record ID (RID) list
In the bitmap index for a given attribute, there is a
distinct bit
vector, Bv, for each value v in the attribute’s
domain
If a given attribute’s domain consists of n values,
then n bits are needed for each entry in the bitmap
index (i.e., there are n bit vectors)
If the attribute has the value v for a given row in the
data table,
the bit representing that value is set to 1 in the
13 4/4/19
corresponding row of the bitmap index and all
Example
14 4/4/19
Example
Cust Region Type RecID Asia Europe Am erica
C1 Asia Retail 1 1 0 0
C2 Europe Dealer 2 0 1 0
C3 Asia Dealer 3 1 0 0
C4 America Retail 4 0 0 1
C5 Europe Dealer 5 0 1 0
15 4/4/19
Indexing OLAP Data: Join Indices
The join indexing method gained popularity from
its use in relational database query processing
Traditional indexing maps the value in a given
column to a list of rows having that value
In contrast, join indexing registers the joinable
rows of two relations from relational database
Join indexing is especially useful for maintaining
the relationship between a foreign key and its
matching primary keys, from the joinable relation
The star schema model of data warehouses
makes join indexing attractive for crosstable
search
16 4/4/19
Between a fact table and its corresponding
dimension tables comprises the fact table’s
foreign key and the dimension table’s primary key
Join indexing maintains relationships between
attribute values of a dimension (e.g., within a
dimension table) and the corresponding rows in
the fact table
Join indices may span multiple dimensions to form
composite join indices
We can use join indices to identify subcubes that
are of interest
17 4/4/19
Example
18 4/4/19
Example
19 4/4/19
Efficient Processing OLAP Queries
Determine which operations should be performed on the available cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g.,
dice = selection + projection
Determine which materialized cuboid(s) should be selected for OLAP op.
Example:
sales_cube [time, item, location]: sum (sales_in_dollars)
Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which cuboid should be selected to process the query?
20
Practice Question:
Suppose that a data warehouse for Big University
consists of the four dimensions student, course,
semester, and instructor, and two measures count
and avg grade. At the lowest conceptual level (e.g.,
for a given student, course, semester, and instructor
combination), the avg grade measure stores the
actual course grade of the student. At higher
conceptual levels, avg grade stores the average
grade for the given combination.
Draw a snowflake schema diagram for the data
warehouse
If each dimension has five levels (including all),
how many cuboids will this cube contain (including
the base and apex cuboids)?
21 4/4/19
Data Generalization
Data Generalization summarizes data by replacing relatively low-
level values with higher level concepts
For example: Variable age can be generalized as young, middle-
aged, senior
It can also be done by reducing the number of dimensions
For example: Telephone number, date of birth can be removed
when collecting data for student behavior
Concept description generates descriptions for data
characterization and comparison
It is sometimes called class description when the concept to be
described refers to a class of objects
Characterization provides a concise and succinct
summarization of the given data collection
While concept or class comparison (also known as
discrimination) provides descriptions comparing two or more data
collections
Attribute Oriented Induction
Proposed in 1989
Basically, a query-oriented, generalization-based,
online data analysis technique.
General Idea:
Collect the task-relevant data (initial relation) using a
relational database query
Perform generalization based on the examination of
the number of each attribute’s distinct values in the
relevant data set
Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts. This
reduces the size of the generalized data set
Interaction with users for knowledge presentation in
different forms
Basic Principles: Steps
Data focusing: It should be performed before attribute
oriented induction. This step collects task-relevant data
based on information provided in data mining query and the
result is called the initial working relation.
Data Generalization: It is performed by two ways:
Attribute removal: To remove attribute A if there is a
large set of distinct values for A but
Case1 - There is no generalization operator on A (no
concept hierarchy)
Case2 - A’s higher level concepts are expressed in terms
of other attributes then the attribute should be removed
Attribute generalization: If there is a large set of
distinct values for an attribute in the initial working
relation, and there exists a set of generalization operators
on the attribute, then a generalization operator should be
selected and applied to the attribute
Attribute generalization threshold control:
Either sets one generalization threshold for all of the
attributes, or
Sets one threshold for each attribute
Data mining systems typically have a default attribute
threshold value generally ranging from 2 to 8
If a user feels that the generalization reaches too high a
level for a particular attribute, the threshold can be
increased. This corresponds to drilling down along the
attribute.
Also, to further generalize a relation, the user can reduce
an attribute’s threshold, which corresponds to rolling up
along the attribute.
Generalized relation threshold control:
It sets a threshold for the generalized relation
If the number of (distinct) tuples in the generalized relation
is greater than the threshold, further generalization should
be performed.
Otherwise, no further generalization should be performed.
Such a threshold may also be preset in the data mining
system (usually within a range of 10 to 30),
For example, if a user feels that the generalized relation is
too
small, he or she can increase the threshold, which implies
drilling down.
Otherwise, to further generalize a relation, the threshold can
be reduced, which implies rolling up.
These two techniques can be applied in sequence:
First apply the attribute threshold control technique to
generalize each attribute, and then apply relation
26 threshold control to further reduce the 4/4/19size of the
generalized relation
The aggregate function, count(), is associated with
each database tuple and is initialized to 1
Through attribute removal and attribute
generalization, tuples within the initial working
relation may be generalized, resulting in groups of
identical tuples.
In this case, all of the identical tuples forming a
group should be merged into one tuple.
The count of this new, generalized tuple is set to the
total number of tuples from the initial working
relation that are represented by (i.e., merged into)
the new generalized tuple.
27 4/4/19
Algorithm for Attribute Oriented
Induction
Initial_Relation, W: Query processing of task-relevant
data, deriving the initial relation.
W← get_task_relevant_data.
Prepare_for_Generalization (W): Scan W, based on the
analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal,
generalization, attribute threshold value.
Prime_Generalized_Relation, P: Based on the plan in
previous step, perform generalization to the right level to
derive a “prime generalized relation”.
P← generalization (W)
For each generalized tuple, insert the tuple into a sorted
prime relation. If tuple is already in P then increase its
count.
Presentation: User interaction: Adjust levels by drilling,
Example
Query: Describe general characteristics of graduate
students in a University database
use University_DB
mine characteristics as “Science”, “Arts”, “Engg”, etc.
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
34 4/4/19
Example
Suppose that you would like to compare the general
properties of the graduate and undergraduate students at
Big University, given the attributes name, gender, major,
birth place, birth date, residence, phone#, and gpa
use Big University DB
mine comparison as “grad vs undergrad students”
in relevance to name, gender, major, birth place, birth date,
residence,
phone#, gpa
for “graduate students”
where status in “graduate”
versus “undergraduate students”
where status in “undergraduate”
analyze count%
from student
35 4/4/19
Initial Working Relation: target
class(grad)
36 4/4/19
Prime Generalized Relation for the Target Class
37 4/4/19
The discriminative features of the target and contrasting
classes of a comparison description can be described
quantitatively by a quantitative discriminant rule
Associates a statistical interestingness measure, d-
weight, with each generalized tuple in the description
Let qa be a generalized tuple, and Cj be the target class,
where qa covers some tuples of the target class
The d-weight for qa is the ratio of the number of tuples
from the initial target class working relation that are
covered by qa to the total number of tuples in both the
initial target class and contrasting class working relations
that are covered by qa
38 4/4/19
Example
• dweight = 90/(90+210) =
30%
• dweight = 210/(90+210) =
70%
39 4/4/19
Complex Aggregation at Multiple
Granularities
Multi-feature cubes enable more in-depth analysis
They can compute more complex queries of which the
measures depend on groupings of multiple aggregates
at varying granularity levels
The queries posed can be much more elaborate and
task-specific than traditional queries
Many complex data mining queries can be answered
by multi-feature cubes without significant increase in
computational cost, in comparison to cube
computation for simple queries with traditional data
cubes
40 4/4/19
Example:
Query: Grouping by all subsets of item, region, month, find the
maximum price in 2010 for each group and the total sales
among all maximum price tuples
select item, region, month, max(price), sum(R.sales)
from Purchases
where year 2010
cube by item, region, month: R
such that R.price, max(price)
The resulting cube is a multi-feature cube in that it supports
complex data mining queries for which multiple dependent
aggregates are computed at a variety of granularities
The sum of sales returned in this query is dependent on the set
of maximum price tuples for each group
In general, multi-feature cubes give users the flexibility to define
sophisticated, task-specific cubes on which multidimensional
aggregation and OLAP-based mining can be performed
41 4/4/19
Data Warehouse Backend Tools
Data warehouse systems use back-end tools and
utilities to populate and refresh their data
These tools and utilities include the following
functions:
Data extraction, which typically gathers data from
multiple, heterogeneous, and external sources
Data cleaning, which detects errors in the data and
rectifies them when possible
Data transformation, which converts data from
legacy or host format to warehouse format
Load, which sorts, summarizes, consolidates,
computes views, checks integrity, and builds
indices and partitions
42
Refresh, which propagates the updates
4/4/19
from
the data sources to the warehouse
Tuning Data Warehouse
Tuning a data warehouse means to enhance performance in
terms of
Average query response time
Scan rates
I/O throughput rates
Time used per query
Memory usage per process
It is necessary to specify the measures in service level
agreement (SLA)
It is of no use trying to tune response time, if they are
already better than those required
It is essential to have realistic expectations while making
performance assessment
It is also essential that the users have feasible expectations
It is also possible that the user can write a query you had
43 4/4/19
not tuned for
Tuning Data Load
Data load is a critical part as this is the entry point into the
system
If there is a delay in transferring the data, or in arrival of data
then the entire system is affected badly. Therefore it is very
important to tune the data load first.
There are various approaches of tuning data load that are
discussed below:
The very common approach is to insert data using the SQL
Layer
In this approach, normal checks and constraints need to be
performed
When the data is inserted into the table, the code will run to
check for enough space to insert the data
If sufficient space is not available, then more space may have
to be allocated to these tables
The second approach is to bypass all these checks and
constraints and place the data directly into 4/4/19
44 the preformatted
blocks
The third approach is that while loading the data into
the table that already contains the table, we can
maintain indexes
The fourth approach says that to load the data in
tables that already contain data, drop the indexes &
recreate them when the data load is complete
The choice between the third and the fourth
approach depends on how much data is already
loaded and how many indexes need to be rebuilt
Integrity Checks
Integrity checking highly affects the performance of
the load
Integrity checks need to be limited because they
require heavy processing power
Integrity checks should be applied on the
45 source
4/4/19
Tuning Queries
Fixed Queries
Fixed queries are well defined
Regular reports
Canned queries
Common aggregations
Tuning the fixed queries in a data warehouse is same
as in a relational database system
The only difference is that the amount of data to be
queried may be different
It is good to store the most successful execution plan
while testing fixed queries
Storing these executing plan will allow us to spot
changing data size and data skew, as it will cause
the execution plan to change
46 4/4/19
Ad hoc Queries
To understand ad hoc queries, it is important to know
the ad hoc users of the data warehouse
The number of users in the group
Whether they use ad hoc queries at regular intervals
of time or unknown intervals or frequently
The maximum size of query they tend to run
The average size of query they tend to run
Whether they require drill-down access to the base
data
The elapsed login time per day
The peak time of daily usage
The number of queries they run per peak hour
47 4/4/19
Difficulties in Data Warehouse Tuning
Tuning a data warehouse is a difficult procedure due
to following reasons:
Data warehouse is dynamic; it never remains
constant
It is very difficult to predict what query the user is
going to post in the future
Business requirements change with time
Users and their profiles keep changing
The user can switch from one group to another
The data load on the warehouse also changes with
time
48 4/4/19
Testing Data Warehouse
Logistics of Testing data warehouse
Scheduling software
Day-to-day operational procedures
Backup recovery strategy
Management and scheduling tools
Overnight processing
Query performance
There are three basic levels of testing performed
on a data warehouse
Unit testing
Integration testing
System testing
49 4/4/19
Unit Testing
In unit testing, each component is separately
tested
Each module, i.e., procedure, program, SQL Script,
Unix shell is tested
This test is performed by the developer
Integration Testing
In integration testing, the various modules of the
application are brought together and then tested
against the number of inputs
It is performed to test whether the various
components do well after integration
50 4/4/19
System Testing
In system testing, the whole data warehouse
application is tested together
The purpose of system testing is to check whether
the entire system works correctly together or not
System testing is performed by the testing team
Since the size of the whole data warehouse is very
large, it is usually possible to perform minimal
system testing before the test plan can be
enacted
51 4/4/19
Developing Test Plan
Test Schedule
In this schedule, we predict the estimated time
required for the testing of the entire data warehouse
Also the data warehouse system is evolving in
nature. One may face the following issues while
creating a test schedule:
A simple problem may have a large size of query that
can take a day or more to complete, i.e., the query
does not complete in a desired time scale
There may be hardware failures such as losing a disk
or human errors such as accidentally deleting a table
or overwriting a large table
52 4/4/19
Testing Operational Environment
Security − A separate security document is required
for security testing. This document contains a list of
disallowed operations and devising tests
Scheduler − Scheduling software is required to
control the daily operations of a data warehouse. The
scheduling software requires an interface with the
data warehouse, which will need the scheduler to
control overnight processing and the management of
aggregations
Disk Configuration − Disk configuration also needs
to be tested to identify I/O bottlenecks. The test
should be performed with multiple times with
different settings
Management Tools − It is required to test all the
53 4/4/19
management tools during system testing
Testing Backup Recovery
Media failure
Loss or damage of table space , data file, log file,
control file, archive file
Instance failure
Loss or damage of table
Failure during data movement
54 4/4/19
Testing database performance
There are sets of fixed queries that need to be run
regularly and they should be tested
To test ad hoc queries, one should go through the user
requirement document and understand the business
completely
Take time to test the most awkward queries that the
business is likely to ask against different index and
aggregation strategies
Testing the Application
All the modules should be integrated correctly and work in
order to ensure that the end-to-end load, index, aggregate
and queries work as per the expectations
Each function of each manager should work correctly
It is also necessary to test the application over a period of
55 time 4/4/19
Thank
you !!!