0% found this document useful (0 votes)
30 views56 pages

Implementation: Data Warehouse

Here are the key steps to efficiently process OLAP queries on a data warehouse: 1. Parse the OLAP query to determine the dimensions, measures, and filters specified in the query. 2. Identify the cuboid(s) from the set of materialized cuboids that best match the dimensions and filters in the query. The goal is to select a cuboid that needs minimal additional aggregation to answer the query. 3. If no exact match is found, select the cuboid that requires aggregating over the fewest additional dimensions to answer the query. 4. Perform any additional aggregations needed on the selected cuboid to compute the final results for the measures specified in the query. 5.

Uploaded by

Amar Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views56 pages

Implementation: Data Warehouse

Here are the key steps to efficiently process OLAP queries on a data warehouse: 1. Parse the OLAP query to determine the dimensions, measures, and filters specified in the query. 2. Identify the cuboid(s) from the set of materialized cuboids that best match the dimensions and filters in the query. The goal is to select a cuboid that needs minimal additional aggregation to answer the query. 3. If no exact match is found, select the cuboid that requires aggregating over the fewest additional dimensions to answer the query. 4. Perform any additional aggregations needed on the selected cuboid to compute the final results for the measures specified in the query. 5.

Uploaded by

Amar Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Implementation

Data Warehouse
Introduction
 Data warehouses contain huge volumes of data
 OLAP servers demand that decision support
queries be answered in the order of seconds
 Therefore, it is crucial for data warehouse systems
to support highly efficient cube computation
techniques, access methods, and query
processing techniques

2 4/4/19
Efficient Computation of Data Cubes
 At the core of multidimensional data analysis is the
efficient computation of aggregations across many
sets of dimensions
 In SQL terms, these aggregations are referred to
as group-by’s
 Each group-by can be represented by a cuboid,
where the set of group-by’s forms a lattice of
cuboids defining a data cube
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one
cell
3 4/4/19
()
0-D (apex) cuboid

product date country 1-D cuboids

product, date product, country date, country


2-D cuboids

3-D (base) cuboid


product, date, country

4 4/4/19
Example: Data cube for sales data that contains
the following: city, item, year and sales_in_dollars
()

(city) (item) (year)

(city, item) (city, year) (item, year)

(city, item, year)

Query Examples:
• Compute the sum of total sales
• Compute the sum of sales, group by city
• Compute the sum of sales, group by item,
year
5 4/4/19
Compute Cube Operator
 A SQL Query such as
 “compute the sum of total sales” is zero dimensional
 “compute the sum of sales, group by city” is one
dimensional

 Similar to the SQL syntax, the data cube could be defined as


 define cube sales cube [city, item, year]: sum(sales in
dollars)
 This statement calculates the sum depending upon the
group by’s
 The statement “compute cube sales_cube”

 For a cube with n dimensions, there are a total of 2 n cuboids,


including the base cuboid

6 4/4/19
 How many cuboids in an n-dimensional cube with
L levels? n
T   ( Li 1)
i 1

where Li is the number of levels associated with


dimension i
 one is added to Li to include the virtual top level,
all
 For example:
 time dimension (day, month, quarter, year) has
4 conceptual levels, or 5 if we include the virtual
level all
7
 If the cube has 10 dimensions 4/4/19and each
Curse of Dimensionality
 Online analytical processing may need to access different
cuboids for different queries
 Therefore, it may seem like a good idea to compute in
advance all or at least some of the cuboids in a data cube
 Precomputation leads to fast response time and avoids
some redundant computation
 A major challenge is that the required storage space may
explode if all the cuboids in a data cube are
precomputed, especially when the cube has many
dimensions
 The storage requirements are even more excessive when
many of the dimensions have associated concept
hierarchies, each with multiple levels

8 4/4/19
Data Cube Materialization
 No materialization:
 Do not precompute any of the “nonbase”
cuboids
 This leads to computing expensive
multidimensional aggregates on-the-fly
 Which can be extremely slow

 Full materialization:
 Precompute all of the cuboids
 The resulting lattice of computed cuboids is
referred to as the full cube
 This choice typically requires huge amounts of

9
memory space in order to store all of the
4/4/19
precomputed cuboids
 Partial materialization:
 Selectively compute a proper subset of the whole set
of possible cuboids
 Alternatively, we may compute a subset of the cube,
which contains only those cells that satisfy some
user-specified criterion; also referred to as subcube
 Partial materialization represents an interesting
trade-off between storage space and response time
 Three factors:
 identify the subset of cuboids or subcubes to
materialize
 exploit the materialize cuboids or subcubes during
query processing
 efficiently update the materialized cuboids or
10 4/4/19
subcubes during load and refresh
 The selection of the subset of cuboids or subcubes
should take into account
 the queries in the workload
 their frequencies
 their accessing costs
 workload characteristics
 the cost for incremental updates
 the total storage requirements
 The selection must also consider the broad context of
physical database design such as the generation and
selection of indices
 A popular approach is to materialize the cuboids set on
which other frequently referenced cuboids are based
11 4/4/19
 Alternatively, we can compute an iceberg cube
 which is a data cube that stores only those cube
cells with an aggregate value (e.g., count) that
is above some minimum support threshold
 Another common strategy is to materialize a shell
cube
 This involves precomputing the cuboids for only
a small number of dimensions (e.g., three to
five) of a data cube

12 4/4/19
Indexing OLAP Data: Bitmap Index
 The bitmap indexing is popular because it allows
quick searching
 The bitmap index is an alternative representation of
the record ID (RID) list
 In the bitmap index for a given attribute, there is a
distinct bit
vector, Bv, for each value v in the attribute’s
domain
 If a given attribute’s domain consists of n values,
then n bits are needed for each entry in the bitmap
index (i.e., there are n bit vectors)
 If the attribute has the value v for a given row in the
data table,
 the bit representing that value is set to 1 in the
13 4/4/19
corresponding row of the bitmap index and all
Example

14 4/4/19
Example
Cust Region Type RecID Asia Europe Am erica
C1 Asia Retail 1 1 0 0
C2 Europe Dealer 2 0 1 0
C3 Asia Dealer 3 1 0 0
C4 America Retail 4 0 0 1
C5 Europe Dealer 5 0 1 0

RecID Retail Dealer


1 1 0
2 0 1
3 0 1
4 1 0
5 0 1

15 4/4/19
Indexing OLAP Data: Join Indices
 The join indexing method gained popularity from
its use in relational database query processing
 Traditional indexing maps the value in a given
column to a list of rows having that value
 In contrast, join indexing registers the joinable
rows of two relations from relational database
 Join indexing is especially useful for maintaining
the relationship between a foreign key and its
matching primary keys, from the joinable relation
 The star schema model of data warehouses
makes join indexing attractive for crosstable
search

16 4/4/19
 Between a fact table and its corresponding
dimension tables comprises the fact table’s
foreign key and the dimension table’s primary key
 Join indexing maintains relationships between
attribute values of a dimension (e.g., within a
dimension table) and the corresponding rows in
the fact table
 Join indices may span multiple dimensions to form
composite join indices
 We can use join indices to identify subcubes that
are of interest

17 4/4/19
Example

18 4/4/19
Example

19 4/4/19
Efficient Processing OLAP Queries
 Determine which operations should be performed on the available cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g.,
dice = selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP op.
 Example:
 sales_cube [time, item, location]: sum (sales_in_dollars)
 Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which cuboid should be selected to process the query?

20
Practice Question:
 Suppose that a data warehouse for Big University
consists of the four dimensions student, course,
semester, and instructor, and two measures count
and avg grade. At the lowest conceptual level (e.g.,
for a given student, course, semester, and instructor
combination), the avg grade measure stores the
actual course grade of the student. At higher
conceptual levels, avg grade stores the average
grade for the given combination.
 Draw a snowflake schema diagram for the data
warehouse
 If each dimension has five levels (including all),
how many cuboids will this cube contain (including
the base and apex cuboids)?
21 4/4/19
Data Generalization
 Data Generalization summarizes data by replacing relatively low-
level values with higher level concepts
For example: Variable age can be generalized as young, middle-
aged, senior
 It can also be done by reducing the number of dimensions
For example: Telephone number, date of birth can be removed
when collecting data for student behavior
 Concept description generates descriptions for data
characterization and comparison
 It is sometimes called class description when the concept to be
described refers to a class of objects
 Characterization provides a concise and succinct
summarization of the given data collection
 While concept or class comparison (also known as
discrimination) provides descriptions comparing two or more data
collections
Attribute Oriented Induction
 Proposed in 1989
 Basically, a query-oriented, generalization-based,
online data analysis technique.
 General Idea:
 Collect the task-relevant data (initial relation) using a
relational database query
 Perform generalization based on the examination of
the number of each attribute’s distinct values in the
relevant data set
 Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts. This
reduces the size of the generalized data set
 Interaction with users for knowledge presentation in
different forms
Basic Principles: Steps
 Data focusing: It should be performed before attribute
oriented induction. This step collects task-relevant data
based on information provided in data mining query and the
result is called the initial working relation.
 Data Generalization: It is performed by two ways:
 Attribute removal: To remove attribute A if there is a
large set of distinct values for A but
 Case1 - There is no generalization operator on A (no
concept hierarchy)
 Case2 - A’s higher level concepts are expressed in terms
of other attributes then the attribute should be removed
 Attribute generalization: If there is a large set of
distinct values for an attribute in the initial working
relation, and there exists a set of generalization operators
on the attribute, then a generalization operator should be
selected and applied to the attribute
 Attribute generalization threshold control:
 Either sets one generalization threshold for all of the
attributes, or
 Sets one threshold for each attribute
 Data mining systems typically have a default attribute
threshold value generally ranging from 2 to 8
 If a user feels that the generalization reaches too high a
level for a particular attribute, the threshold can be
increased. This corresponds to drilling down along the
attribute.
 Also, to further generalize a relation, the user can reduce
an attribute’s threshold, which corresponds to rolling up
along the attribute.
 Generalized relation threshold control:
 It sets a threshold for the generalized relation
 If the number of (distinct) tuples in the generalized relation
is greater than the threshold, further generalization should
be performed.
 Otherwise, no further generalization should be performed.
 Such a threshold may also be preset in the data mining
system (usually within a range of 10 to 30),
 For example, if a user feels that the generalized relation is
too
small, he or she can increase the threshold, which implies
drilling down.
 Otherwise, to further generalize a relation, the threshold can
be reduced, which implies rolling up.
 These two techniques can be applied in sequence:
 First apply the attribute threshold control technique to
generalize each attribute, and then apply relation
26 threshold control to further reduce the 4/4/19size of the
generalized relation
 The aggregate function, count(), is associated with
each database tuple and is initialized to 1
 Through attribute removal and attribute
generalization, tuples within the initial working
relation may be generalized, resulting in groups of
identical tuples.
 In this case, all of the identical tuples forming a
group should be merged into one tuple.
 The count of this new, generalized tuple is set to the
total number of tuples from the initial working
relation that are represented by (i.e., merged into)
the new generalized tuple.

27 4/4/19
Algorithm for Attribute Oriented
Induction
 Initial_Relation, W: Query processing of task-relevant
data, deriving the initial relation.
W← get_task_relevant_data.
 Prepare_for_Generalization (W): Scan W, based on the
analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal,
generalization, attribute threshold value.
 Prime_Generalized_Relation, P: Based on the plan in
previous step, perform generalization to the right level to
derive a “prime generalized relation”.
P← generalization (W)
For each generalized tuple, insert the tuple into a sorted
prime relation. If tuple is already in P then increase its
count.
 Presentation: User interaction: Adjust levels by drilling,
Example
Query: Describe general characteristics of graduate
students in a University database

use University_DB
mine characteristics as “Science”, “Arts”, “Engg”, etc.
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”

Corresponding SQL statement:

Select name, gender, major, birth_place, birth_date, residence,


phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Initial Working Relation

Prime Generalized Relation


Presentation of Generalized Results
• Generalized relation:
– Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
• Presentation:
– Mapping results into cross tabulation form (similar to
contingency tables).
– Visualization techniques:
– Pie charts, bar charts, curves, cubes, and other visual
forms.
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Mining Class Comparisons
 Comparison: Comparing two or more classes
 Method:
 Partition the set of relevant data into the target class and
the contrasting class
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different
classes
How is class comparison performed?
 Data collection:
 The set of relevant data in the database is collected by
query processing
 Partitioned respectively into a target class and one or a
set of contrasting classes
 Dimension relevance analysis:
 Select only the highly relevant dimensions for further
analysis
 Synchronous generalization:
 Generalization is performed on the target class to the
level controlled by a user- or expert-specified dimension
threshold, which results in a prime target class
relation
 The concepts in the contrasting class(es) are
generalized to the same level as those in the prime
33 target class relation, forming the prime contrasting
4/4/19
class(es) relation
 Presentation of the derived comparison:
 The resulting class comparison description can be
visualized in the form of tables, graphs, and rules
 This presentation usually includes a “contrasting”
measure such as count% (percentage count) that
reflects the comparison between the target and
contrasting classes

34 4/4/19
Example
 Suppose that you would like to compare the general
properties of the graduate and undergraduate students at
Big University, given the attributes name, gender, major,
birth place, birth date, residence, phone#, and gpa
 use Big University DB
mine comparison as “grad vs undergrad students”
in relevance to name, gender, major, birth place, birth date,
residence,
phone#, gpa
for “graduate students”
where status in “graduate”
versus “undergraduate students”
where status in “undergraduate”
analyze count%
from student

35 4/4/19
Initial Working Relation: target
class(grad)

Initial Working Relation: comparison


class(under grad)

36 4/4/19
Prime Generalized Relation for the Target Class

Prime Generalized Relation for the Comparison


Class

37 4/4/19
 The discriminative features of the target and contrasting
classes of a comparison description can be described
quantitatively by a quantitative discriminant rule
 Associates a statistical interestingness measure, d-
weight, with each generalized tuple in the description
 Let qa be a generalized tuple, and Cj be the target class,
where qa covers some tuples of the target class
 The d-weight for qa is the ratio of the number of tuples
from the initial target class working relation that are
covered by qa to the total number of tuples in both the
initial target class and contrasting class working relations
that are covered by qa

38 4/4/19
Example

• dweight = 90/(90+210) =
30%
• dweight = 210/(90+210) =
70%

39 4/4/19
Complex Aggregation at Multiple
Granularities
 Multi-feature cubes enable more in-depth analysis
 They can compute more complex queries of which the
measures depend on groupings of multiple aggregates
at varying granularity levels
 The queries posed can be much more elaborate and
task-specific than traditional queries
 Many complex data mining queries can be answered
by multi-feature cubes without significant increase in
computational cost, in comparison to cube
computation for simple queries with traditional data
cubes

40 4/4/19
Example:
 Query: Grouping by all subsets of item, region, month, find the
maximum price in 2010 for each group and the total sales
among all maximum price tuples
 select item, region, month, max(price), sum(R.sales)
from Purchases
where year 2010
cube by item, region, month: R
such that R.price, max(price)
 The resulting cube is a multi-feature cube in that it supports
complex data mining queries for which multiple dependent
aggregates are computed at a variety of granularities
 The sum of sales returned in this query is dependent on the set
of maximum price tuples for each group
 In general, multi-feature cubes give users the flexibility to define
sophisticated, task-specific cubes on which multidimensional
aggregation and OLAP-based mining can be performed

41 4/4/19
Data Warehouse Backend Tools
 Data warehouse systems use back-end tools and
utilities to populate and refresh their data
 These tools and utilities include the following
functions:
 Data extraction, which typically gathers data from
multiple, heterogeneous, and external sources
 Data cleaning, which detects errors in the data and
rectifies them when possible
 Data transformation, which converts data from
legacy or host format to warehouse format
 Load, which sorts, summarizes, consolidates,
computes views, checks integrity, and builds
indices and partitions

42
Refresh, which propagates the updates
4/4/19
from
the data sources to the warehouse
Tuning Data Warehouse
 Tuning a data warehouse means to enhance performance in
terms of
 Average query response time
 Scan rates
 I/O throughput rates
 Time used per query
 Memory usage per process
 It is necessary to specify the measures in service level
agreement (SLA)
 It is of no use trying to tune response time, if they are
already better than those required
 It is essential to have realistic expectations while making
performance assessment
 It is also essential that the users have feasible expectations
 It is also possible that the user can write a query you had
43 4/4/19
not tuned for
Tuning Data Load
 Data load is a critical part as this is the entry point into the
system
 If there is a delay in transferring the data, or in arrival of data
then the entire system is affected badly. Therefore it is very
important to tune the data load first.
 There are various approaches of tuning data load that are
discussed below:
 The very common approach is to insert data using the SQL
Layer
 In this approach, normal checks and constraints need to be
performed
 When the data is inserted into the table, the code will run to
check for enough space to insert the data
 If sufficient space is not available, then more space may have
to be allocated to these tables
 The second approach is to bypass all these checks and
constraints and place the data directly into 4/4/19
44 the preformatted
blocks
 The third approach is that while loading the data into
the table that already contains the table, we can
maintain indexes
 The fourth approach says that to load the data in
tables that already contain data, drop the indexes &
recreate them when the data load is complete
 The choice between the third and the fourth
approach depends on how much data is already
loaded and how many indexes need to be rebuilt
 Integrity Checks
 Integrity checking highly affects the performance of
the load
 Integrity checks need to be limited because they
require heavy processing power
 Integrity checks should be applied on the
45 source
4/4/19
Tuning Queries
 Fixed Queries
 Fixed queries are well defined
 Regular reports
 Canned queries
 Common aggregations
 Tuning the fixed queries in a data warehouse is same
as in a relational database system
 The only difference is that the amount of data to be
queried may be different
 It is good to store the most successful execution plan
while testing fixed queries
 Storing these executing plan will allow us to spot
changing data size and data skew, as it will cause
the execution plan to change
46 4/4/19
 Ad hoc Queries
 To understand ad hoc queries, it is important to know
the ad hoc users of the data warehouse
 The number of users in the group
 Whether they use ad hoc queries at regular intervals
of time or unknown intervals or frequently
 The maximum size of query they tend to run
 The average size of query they tend to run
 Whether they require drill-down access to the base
data
 The elapsed login time per day
 The peak time of daily usage
 The number of queries they run per peak hour

47 4/4/19
Difficulties in Data Warehouse Tuning
 Tuning a data warehouse is a difficult procedure due
to following reasons:
 Data warehouse is dynamic; it never remains
constant
 It is very difficult to predict what query the user is
going to post in the future
 Business requirements change with time
 Users and their profiles keep changing
 The user can switch from one group to another
 The data load on the warehouse also changes with
time

48 4/4/19
Testing Data Warehouse
 Logistics of Testing data warehouse
 Scheduling software
 Day-to-day operational procedures
 Backup recovery strategy
 Management and scheduling tools
 Overnight processing
 Query performance
 There are three basic levels of testing performed
on a data warehouse
 Unit testing
 Integration testing
 System testing

49 4/4/19
 Unit Testing
 In unit testing, each component is separately
tested
 Each module, i.e., procedure, program, SQL Script,
Unix shell is tested
 This test is performed by the developer

 Integration Testing
 In integration testing, the various modules of the
application are brought together and then tested
against the number of inputs
 It is performed to test whether the various
components do well after integration
50 4/4/19
 System Testing
 In system testing, the whole data warehouse
application is tested together
 The purpose of system testing is to check whether
the entire system works correctly together or not
 System testing is performed by the testing team
 Since the size of the whole data warehouse is very
large, it is usually possible to perform minimal
system testing before the test plan can be
enacted

51 4/4/19
Developing Test Plan
 Test Schedule
 In this schedule, we predict the estimated time
required for the testing of the entire data warehouse
 Also the data warehouse system is evolving in
nature. One may face the following issues while
creating a test schedule:
 A simple problem may have a large size of query that
can take a day or more to complete, i.e., the query
does not complete in a desired time scale
 There may be hardware failures such as losing a disk
or human errors such as accidentally deleting a table
or overwriting a large table

52 4/4/19
 Testing Operational Environment
 Security − A separate security document is required
for security testing. This document contains a list of
disallowed operations and devising tests
 Scheduler − Scheduling software is required to
control the daily operations of a data warehouse. The
scheduling software requires an interface with the
data warehouse, which will need the scheduler to
control overnight processing and the management of
aggregations
 Disk Configuration − Disk configuration also needs
to be tested to identify I/O bottlenecks. The test
should be performed with multiple times with
different settings
 Management Tools − It is required to test all the
53 4/4/19
management tools during system testing
 Testing Backup Recovery
 Media failure
 Loss or damage of table space , data file, log file,
control file, archive file
 Instance failure
 Loss or damage of table
 Failure during data movement

 Testing the Database


 Querying in parallel
 Create index in parallel
 Data load in parallel

54 4/4/19
 Testing database performance
 There are sets of fixed queries that need to be run
regularly and they should be tested
 To test ad hoc queries, one should go through the user
requirement document and understand the business
completely
 Take time to test the most awkward queries that the
business is likely to ask against different index and
aggregation strategies
 Testing the Application
 All the modules should be integrated correctly and work in
order to ensure that the end-to-end load, index, aggregate
and queries work as per the expectations
 Each function of each manager should work correctly
 It is also necessary to test the application over a period of
55 time 4/4/19
Thank
you !!!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy