0% found this document useful (0 votes)
45 views34 pages

Lecture 03

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views34 pages

Lecture 03

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

TIES443

Lecture 3

Data Warehousing
Mykola Pechenizkiy
Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443
November 3, 2006
Department of Mathematical Information Technology University of Jyvskyl
TIES443: Introduction to DM Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Topics for today


What is a data warehouse? Data warehouse architectures
Conceptual DW Modelling Physical DW Modelling

A multi-dimensional data model


Data Cubes

OLAP
12 Codds rules for OLAP Main OLAP operations
New buzzwords

Data warehouse implementation and maintenance

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse
A decision support DB that is maintained separately from the organizations operational databases. Why Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP
access methods, indexing, concurrency control, recovery

Warehousetuned for OLAP


complex OLAP queries, multidimensional view, consolidation.

Different functions and different data


Missing data: Decision support requires historical data which operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
TIES443: Introduction to DM Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Three-Tier Architecture
other Metadata

sources
Operational Extract Transform Load Refresh

Monitor & Integrator

OLAP Server

Analysis

Query/Reporting

DBs

Data Warehouse

Serve

Data Mining

Data Marts

ROLAP Server

Data Sources
TIES443: Introduction to DM

Data Storage
Lecture 3: Data Warehousing

OLAP Engine Front-End Tools


4

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Three-Tier Architecture
Warehouse database server
Almost always a relational DBMS, rarely flat files Schema design Specialized scan, indexing and join techniques Handling of aggregate views (querying and materialization) Supporting query language extensions beyond SQL Complex query processing and optimization Data partitioning and parallelism

OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array Specialized SQL servers: specialized support for SQL queries over star/snowflake schemas

Clients
Query and reporting tools Analysis tools Data mining tools
TIES443: Introduction to DM Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Warehouse Physical Architectures


Client Client Client

Central Data Warehouse

Source

Source Logical Data Warehouse Physical Data Warehouse

Centralized architecture

Source

Source Source Source

Federated architecture
TIES443: Introduction to DM Lecture 3: Data Warehousing

Tiered architecture
6

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Three Data Warehouse Models


Enterprise warehouse: collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization
Requires extensive business modeling (may take years to design and build)

Data Marts: Departmental subsets that focus on selected subjects


Marketing data mart: customer, product, sales Faster roll out, but complex integration in the long run

Virtual warehouse: views over operational DBs


Materialize selective summary views for efficient query processing Easy to build but require excess capability on operat. db servers

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse
A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseSubject-Oriented
Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

TIES443: Introduction to DM

Lecture 3: Data Warehousing

10

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied


Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted

TIES443: Introduction to DM

Lecture 3: Data Warehousing

11

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseTime Variant


The time horizon for the data warehouse is significantly longer than that of operational systems
Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element

TIES443: Introduction to DM

Lecture 3: Data Warehousing

12

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseNon-Volatile
A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment
Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data

TIES443: Introduction to DM

Lecture 3: Data Warehousing

13

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse vs. Heterogeneous DBMS


Traditional heterogeneous DB integration
Build wrappers/mediators on top of heterogeneous databases Query driven approach When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resources

Data warehouse
update-driven, high performance
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

TIES443: Introduction to DM

Lecture 3: Data Warehousing

14

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse vs. Operational DBMS


OLTP (On-Line Transaction Processing)
Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing)


Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP):


User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
TIES443: Introduction to DM Lecture 3: Data Warehousing

15

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Conceptual Modeling of Data Warehouses

ER design techniques not appropriate - design should reflect multidimensional view


Star Schema
A fact table in the middle connected to a set of dimension tables A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Snowflake Schema

Fact Constellation Schema

TIES443: Introduction to DM

Lecture 3: Data Warehousing

16

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Example of a Star Schema


Order Product ProductNO ProdName Fact Table ProdDescr Category CategoryDescription UnitPrice Date DateKey Date City CityName State Country

Order No Order Date


Customer Customer No Customer Name Customer Address City Salesperson
SalespersonID SalespersonName City Quota

OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price

TIES443: Introduction to DM

Lecture 3: Data Warehousing

17

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Star Schema
A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes Does not capture hierarchies directly Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share many dimension tables
Example: Projected expense and the actual expense may share dimensional tables

TIES443: Introduction to DM

Lecture 3: Data Warehousing

18

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Some Terms
Relation, which relates the dimensions to the measure of interest, is called the fact table (e.g. sale) Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store) Each dimension can have a set of associated attributes For each dimension, the set of associated attributes can be structured as a hierarchy

TIES443: Introduction to DM

Lecture 3: Data Warehousing

19

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

A Concept Hierarchy: Dimension (location)


all region North_America all ... Europe

country

Canada

...

Mexico

Ireland

...

France

city office
TIES443: Introduction to DM

Toronto

...

Dublin Belfield ...

...

Belfast

Blackrock
20

Lecture 3: Data Warehousing

10

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Example of a Snowflake Schema


Order

Order No Order Date


Fact Table Customer Customer No Customer Name Customer Address City Salesperson
SalespersonID SalespersonName City Quota

Product ProductNO ProdName ProdDescr Category Category UnitPrice Date DateKey Date Month City CityName State Country

Category CategoryName CategoryDescr

OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price

Month Month Year State StateName Country Year Year

TIES443: Introduction to DM

Lecture 3: Data Warehousing

21

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Example of Fact Constellation


Multiple fact tables share dimension tables
Shipping Fact Table Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Time_key

Sales Fact Table


Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales

Item item_key item_name brand type supplier_key Location location_key street city Province/street country

Item_key shipper_key from_location to_location Euros_sold unit_shipped

shipper
shipper_key shipper_name location_key shipper_type
22

Measures
TIES443: Introduction to DM Lecture 3: Data Warehousing

11

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Multidimensional Data Model


Fact relation
sale Product Client p1 c1 p2 c1 p1 c3 p2 c2 Amt 12 11 50 8

Two-dimensional cube

p1 p2

c1 12 11

c2 8

c3 50

Fact relation
sale Product Client p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 Date 1 1 1 1 2 2 Amt 12 11 50 8 44 4

3-dimensional cube

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
23

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Multidimensional Data
Sales volume as a function of product, month, and region Dimensions: Product, Location, Time
Re gi on
Hierarchical summarization paths Industry Region Year

Product

Category Country Quarter Product City Office Month Day Week

Month
TIES443: Introduction to DM Lecture 3: Data Warehousing

24

12

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

From Tables to Data Cubes


A data warehouse is based on
multidimensional data model which views data in the form of a data cube

A data cube allows data to be modeled and viewed in multiple dimensions (such as sales)
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

Definitions
an n-Dimensional base cube is called a base cuboid The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid The lattice of cuboids forms a data cube
TIES443: Introduction to DM Lecture 3: Data Warehousing

25

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Cube: A Lattice of Cuboids


all time item location supplier 1-D cuboids
time,item time,location item,location location,supplier

0-D(apex) cuboid

time,supplier time,item,location

item,supplier

2-D cuboids

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier
TIES443: Introduction to DM Lecture 3: Data Warehousing

26

13

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

A Sample Data Cube


Date
3Qtr 4Qtr sum Total annual sales of TV in Ireland Ireland France Germany sum

TIES443: Introduction to DM

Lecture 3: Data Warehousing

Country

TV PC VCR sum

Pr od u

1Qtr

2Qtr

ct

27

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Typical OLAP Operations


Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up


from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice


project and select

Pivot (rotate)
reorient the cube, visualization, 3D to series of 2D planes.

Other operations
drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) rankings time functions: e.g. time avg.
TIES443: Introduction to DM Lecture 3: Data Warehousing

28

14

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Typical OLAP Operations: Drill & Roll


Drill Down

Total Sales Total Sales per city Total Sales per city per store Total Sales per city per store per month

Drill Up

Drill Down

Total Sales Total Sales per city Total Sales per city by category
Drill Across

Drill Up

TIES443: Introduction to DM

Lecture 3: Data Warehousing

29

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

OLAP Queries: Rollup & Drill-Down

day 2

day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

p1 p2

c1 56 11

c2 4 8

c3 50

sum

c1 67

c2 12

c3 50

129
p1 p2 sum 110 19

rollup drill-down
TIES443: Introduction to DM Lecture 3: Data Warehousing

30

15

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Typical OLAP Operations: Pivoting


Pivoting can be combined with aggregation
sale prodId clientid p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

1 2 Sum

c1 23 44 67

c2 8 4 12

c3 50 50

Sum 81 48 129

p1 p2 Sum

c1 56 11 67

c2 4 8 12

c3 50 50

Sum 110 19 129

The result of pivoting is called a cross-tabulation


TIES443: Introduction to DM Lecture 3: Data Warehousing

31

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Browsing a Data Cube


Visualization OLAP capabilities Interactive manipulation

TIES443: Introduction to DM

Lecture 3: Data Warehousing

32

16

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

12 Codds Rules for OLAP


1. Multi-Dimensional Concept View
The user should be able to see the data as being multidimensional insofar as it should be easy to 'pivot' or 'slice and dice. (See later.)

2. Transparency
The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host, i.e. OLAP server should shield the user for the complexity of the data and application

3. Accessibility
OLAP should allow the user to access diverse data stores (relational, nonrelational and legacy systems) but see the data within a common 'schema provided by the OLAP tool, i.e. Users shouldnt have to know the location, type or layout of the data to access it. OLAP server should automate the mapping of the logical schema to the physical data

4. Consistent Reporting Performance


There should not be significant degradation in performance with large numbers of dimensions or large quantities of data.
TIES443: Introduction to DM Lecture 3: Data Warehousing

33

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

12 Codds Rules for OLAP


5. Client-Server Architecture
Since much of the data is on mainframes, and the users work on PCs, the OLAP tool must be able to bring the two together Different clients can be used Data sources must be transparently supported by the OLAP server

6. Generic Dimensionality
Data dimensions must all be treated equally. Functions available for one dimension must be available for others.

7. Dynamic Sparse Matrix Handling


The OLAP tool should be able to work out for itself the most efficient way to store sparse matrix data.

8. Multi User Support


access, integrity, security

TIES443: Introduction to DM

Lecture 3: Data Warehousing

34

17

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

12 Codds Rules for OLAP


9. Unrestricted Cross-Dimensional Operations
e.g., individual office overheads are allocated according to total corporate overheads divided in proportion to individual office sales. Non-additive formulas cause the problems
Contribution = Revenue - Total Costs Margin Percentage = Margin / Revenue

10. Intuitive Data Manipulation


Navigation should be done by operations on individual cells rather than menus. Dimensions defined should allow automatic reorientation, drill-down, zoom-out, etc Interface must be intuitive

11. Flexible Reporting


Row and column headings must be capable of more than one dimension each, and of displaying subsets of any dimension.

12. Unlimited Dimensions and Aggregation Levels


15 - 20 dimensions are required in modelling, and within each there may be many hierarchical levels, i.e. unlimited aggregation Rare in reporting to go beyond 12 dimensions, 6-7 is usual
TIES443: Introduction to DM Lecture 3: Data Warehousing

35

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Back-End Tools and Utilities


Data extraction:
get data from multiple, heterogeneous, and external sources

Data cleaning:
detect errors in the data and rectify them when possible

Data transformation:
convert data from legacy or host format to warehouse format: different data formats, languages, etc.

Load:
sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions

Refresh
propagate the updates from the data sources to the warehouse
TIES443: Introduction to DM Lecture 3: Data Warehousing

36

18

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Information Flows
INFLOW - Processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. UPFLOW - Processes associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data. DOWNFLOW - Processes associated with archiving and backing-up/recovery of data in the warehouse. OUTFLOW - Processes associated with making the data available to the end-users. METAFLOW - Processes associated with the management of the metadata.
TIES443: Introduction to DM Lecture 3: Data Warehousing

37

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Cleaning
why?
Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis

finding and resolving inconsistency in the source data detecting data anomalies and rectifying them early has huge payoffs Important to identify tools that work together well Long Term Solution
Change business practices and data entry tools Repository for meta-data
TIES443: Introduction to DM Lecture 3: Data Warehousing

38

19

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Cleaning Techniques


Transformation Rules
Example: translate gender to sex

Uses domain-specific knowledge to do scrubbing Parsing and fuzzy matching


Multiple data sources (can designate a preferred source)

Discover facts that flag unusual patterns (auditing)


Some dealer has never received a single complaint

TIES443: Introduction to DM

Lecture 3: Data Warehousing

39

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Load
Issues:
huge volumes of data to be loaded small time window (usually at night) when the warehouse can be taken off-line When to build indexes and summary tables allow system administrator to monitor status, cancel suspend, resume load, or change load rate restart after failure with no loss of data integrity

Techniques:
batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables sequential loads still too long (~100 days for TB) use parallelism and incremental techniques

TIES443: Introduction to DM

Lecture 3: Data Warehousing

40

20

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Refresh
when to refresh
on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after significant events refresh policy set by administrator based on user needs and traffic possibly different policies for different sources

how to refresh
Full extract from base tables
read entire source table or database: expensive

Incremental techniques
detect & propagate changes on base tables: replication servers logical correctness transactional correctness: incremental load

TIES443: Introduction to DM

Lecture 3: Data Warehousing

41

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Metadata Repository
Administrative metadata
source databases and their contents warehouse schema, view & derived data definitions dimensions, hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purging rules user profiles, user groups security: user authorization, access control

Business data
business terms and definitions ownership of data charging policies

Operational metadata
data lineage: history of migrated data and sequence of transf-s applied currency of data: active, archived, purged monitoring information: warehouse usage statistics, error reports, audit trails.
TIES443: Introduction to DM Lecture 3: Data Warehousing

42

21

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Design of a DW: A Business Analysis Framework


Four views regarding the design of a data warehouse
Top-down view: allows selection of the relevant information necessary for the data warehouse (mature) Bottom-up: Starts with experiments and prototypes (rapid) Data source view
exposes the information being captured, stored, and managed by operational systems

Data warehouse view


consists of fact tables and dimension tables

Business query view


sees the perspectives of data in the warehouse from the view of end-user

TIES443: Introduction to DM

Lecture 3: Data Warehousing

43

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse Design Process


From software engineering point of view
Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

Typical data warehouse design process


Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record

TIES443: Introduction to DM

Lecture 3: Data Warehousing

44

22

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Design: Issues to Consider


What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?

TIES443: Introduction to DM

Lecture 3: Data Warehousing

45

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Data Management: Issues to Consider


Meta-data Data sourcing Data quality Data security Granularity History- how long and how much? Performance

TIES443: Introduction to DM

Lecture 3: Data Warehousing

46

23

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Common DW Problems
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration

TIES443: Introduction to DM

Lecture 3: Data Warehousing

47

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Research Issues
Data cleaning
focus on data inconsistencies, not schema differences data mining techniques

Physical Design
design of summary tables, partitions, indexes tradeoffs in use of different indexes

Query processing
selecting appropriate summary tables dynamic optimization with feedback query optimization: cost estimation, use of transformations, search strategies partitioning query processing between OLAP server and backend server.

Warehouse Management
incremental refresh techniques computing summary tables during load failure recovery during load and refresh process management: scheduling queries, load and refresh use of workflow technology for process management
Lecture 3: Data Warehousing

TIES443: Introduction to DM

48

24

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

OLAP Mining: An Integration of DM and DW


Data mining systems, DBMS, Data warehouse systems coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling

On-line analytical mining data


integration of mining and OLAP technologies

Interactive mining multi-level knowledge


Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions


Characterized classification, first clustering and then association
TIES443: Introduction to DM Lecture 3: Data Warehousing

49

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Summary
What is a data warehouse Data warehouse architectures
Conceptual DW Modelling Physical DW Modelling

A multi-dimensional data model


Data Cubes Main OLAP operations

Data warehouse implementation and maintenance

What else did you get from this lecture?


TIES443: Introduction to DM Lecture 3: Data Warehousing

50

25

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Additional Slides

TIES443: Introduction to DM

Lecture 3: Data Warehousing

51

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Some Critics for Data Cubes


Jargon for IT professionals Metaphor for end-users
Not useful beyond introduction Users expect to see one Rubics cube
Easy to manipulate but difficult to solve

Alternatives are better


Cognos ways, Thomsens multi-dimensional domain diagrams, Buloss Adapt diagrams or simply well designed interactive reports

TIES443: Introduction to DM

Lecture 3: Data Warehousing

52

26

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Development: A Recommended Approach


Multi-Tier Data Warehouse Distributed Data Marts

Data Mart

Data Mart

Enterprise Data Warehouse

Model refinement

Model refinement

Define a high-level corporate data model


TIES443: Introduction to DM Lecture 3: Data Warehousing

53

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DMQL: Language Primitives


Nice presentation on Data Mining Query Languages can be found here: http://www.cs.wisc.edu/EDAM/slides/Data%20Mining%20Query%20Languages.ppt

Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]: <measure_list>

Dimension Definition (Dimension Table)


define dimension <dimension_name> as (<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)


First time as cube definition define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>
TIES443: Introduction to DM Lecture 3: Data Warehousing

54

27

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Defining a Star Schema in DMQL


define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as


(time_key, day, day_of_week, month, quarter, year)

define dimension item as


(item_key, item_name, brand, type, supplier_type)

define dimension branch as


(branch_key, branch_name, branch_type)

define dimension location as


(location_key, street, city, province_or_state, country)

TIES443: Introduction to DM

Lecture 3: Data Warehousing

55

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Defining a Snowflake Schema in DMQL


define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as


(time_key, day, day_of_week, month, quarter, year)

define dimension item as


(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as


(branch_key, branch_name, branch_type)

define dimension location as


(location_key, street, city(city_key, province_or_state, country))

TIES443: Introduction to DM

Lecture 3: Data Warehousing

56

28

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Defining a Fact Constellation in DMQL


define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
TIES443: Introduction to DM Lecture 3: Data Warehousing

57

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

A Star-Net Query Model


Shipping Method Customer Orders Customer CONTRACTS AIR-EXPRESS TRUCK Time ANNUALY QTRLY DAILY CITY SALES PERSON COUNTRY DISTRICT REGION DIVISION Location ORDER PRODUCT LINE Product PRODUCT ITEM PRODUCT GROUP

Each circle is called a footprint


Lecture 3: Data Warehousing

Promotion

Organization
58

TIES443: Introduction to DM

29

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Index Structures
This and the following slides on Indexing for DW are adopted with minor modifications from: http://infolab.stanford.edu/~hector/cs245/Notes12.ppt

Indexing principle:
mapping key values to records for associative direct access

Most popular indexing techniques in relational database: B+-trees For multi-dimensional data, a large number of indexing techniques have been developed: R-trees Index structures applied in warehouses
inverted lists bit map indexes join indexes text indexes
Lecture 3: Data Warehousing

TIES443: Introduction to DM

59

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Inverted Lists
18 19
r4 r18 r34 r35 r5 r19 r37 r40

20 23

20 21 22

age index Query:


Get people with age = 20 and name = fred
TIES443: Introduction to DM Lecture 3: Data Warehousing

inverted lists

data records

List for age = 20: r4, r18, r34, r35 List for name = fred: r18, r52 Answer is intersection: r18
60

...

23 25 26

rId r4 r18 r19 r34 r35 r36 r5 r41

name joe fred sally nancy tom pat dave jeff

age 20 20 21 20 20 25 21 26

30

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Indexes
Bitmap index: An indexing technique that has attracted attention in multi-dimensional database implementation table
Customer c1 c2 c3 c4 c5 c6 City Detroit Chicago Detroit Poznan Paris Paris Car Ford Honda Honda Ford BMW Nissan

TIES443: Introduction to DM

Lecture 3: Data Warehousing

61

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Indexes
The index consists of bitmaps:
Index on City:
ec1 1 2 3 4 5 6 Chicago Detroit 0 1 1 0 0 1 0 0 0 0 0 0 Paris 0 0 0 0 1 1 Poznan 0 0 0 1 0 0

Index on Car:
ec1 1 2 3 4 5 6 BMW 0 1 0 0 1 0 Ford 1 0 0 1 0 0 Honda 0 1 1 0 0 0 Nissan 0 0 0 0 0 1

bitmaps

bitmaps

Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column
TIES443: Introduction to DM Lecture 3: Data Warehousing

62

31

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Indexes
Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column

TIES443: Introduction to DM

Lecture 3: Data Warehousing

63

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Index
18 19

20 23

20 21 22

1 1 0 1 1 0 0 0 0

23 25 26

0 0 1 0 0 0 1 0 1 1

id 1 2 3 4 5 6 7 8

name age joe 20 fred 20 sally 21 nancy 20 tom 20 pat 25 dave 21 jeff 26

Query: Get people with age = 20 and name = fred List for age = 20: 1101100000 List for name = fred: 0100000001 Answer is intersection: 0100000000 Suited well for domains with small cardinality

data records

age index
TIES443: Introduction to DM

bit maps
Lecture 3: Data Warehousing

...

64

32

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Index Summary


With efficient hardware support for bitmap operations (AND, OR, XOR, NOT), bitmap index offers better access methods for certain queries e.g., selection on two attributes Some commercial products have implemented bitmap index Works poorly for high cardinality domains since the number of bitmaps increases Difficult to maintain - need reorganization when relation sizes change (new bitmaps)

TIES443: Introduction to DM

Lecture 3: Data Warehousing

65

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Join
Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
product id p1 p2 name price bolt 10 nut 5

joinTb

prodId p1 p2 p1 p2 p1 p1

name bolt nut bolt nut bolt bolt

price 10 5 10 5 10 10

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

TIES443: Introduction to DM

Lecture 3: Data Warehousing

66

33

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Join Indexes
join index
product id p1 p2 name price bolt 10 nut 5 jIndex r1,r3,r5,r6 r2,r4

sale

rId r1 r2 r3 r4 r5 r6

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

TIES443: Introduction to DM

Lecture 3: Data Warehousing

67

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Join Indexes
Traditional indexes map the value to a list of record ids. Join indexes map the tuples in the join result of two relations to the source tables. In data warehouse cases, join indexes relate the values of the dimensions of a star schema to rows in the fact table.
For a warehouse with a Sales fact table and dimension city, a join index on city maintains for each distinct city a list of RIDs of the tuples recording the sales in the city

Join indexes can span multiple dimensions

TIES443: Introduction to DM

Lecture 3: Data Warehousing

68

34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy