0% found this document useful (0 votes)
53 views18 pages

Data Warehousing 2

A data warehouse is a subject-oriented, integrated, time-variant collection of data that supports management decision making. It involves constructing and using such repositories of data. Common operations on data warehouses include OLAP (online analytical processing) which allows interactive analysis of multidimensional data. Effective indexing techniques such as bitmap indexes and bit-sliced indexes allow counting and aggregation operations to be performed efficiently on data warehouses.

Uploaded by

Asfoury Mouhcin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views18 pages

Data Warehousing 2

A data warehouse is a subject-oriented, integrated, time-variant collection of data that supports management decision making. It involves constructing and using such repositories of data. Common operations on data warehouses include OLAP (online analytical processing) which allows interactive analysis of multidimensional data. Effective indexing techniques such as bitmap indexes and bit-sliced indexes allow counting and aggregation operations to be performed efficiently on data warehouses.

Uploaded by

Asfoury Mouhcin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

What Is a Data Warehouse?

•  A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile
collection of data in support of
management s decision-making process.
– W. H. Inmon
•  Data warehousing: the process of
constructing and using data warehouses

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 1


https://classconnection.s3.amazonaws.com/922/flashcards/636922/png/image061351738272741.png

OLAP Operations

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 2


Data Warehouse Schema Design
•  Query answering efficiency
–  Subject orientation
–  Integration
•  Tradeoff between time and space
–  Universal table versus fully normalized schema

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 3


Star Schema
time item
time_key item_key
day Sales Fact Table item_name
day_of_the_week brand
month time_key type
quarter supplier_type
year item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 4
Snowflake Schema
time item
time_key item_key supplier
day Sales Fact Table item_name supplier_key
day_of_the_week brand supplier_type
month time_key type
quarter item_key supplier_key
year
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 5


Fact Constellation Shipping Fact Table

time_key
time item item_key
time_key Sales Fact Table item_key shipper_key
day item_name
day_of_the_week brand from_location
month
time_key
type
quarter supplier_type
item_key to_location
year
branch_key dollars_cost
branch location_key location units_shipped
branch_key location_key
branch_name units_sold
street shipper
branch_type
dollars_sold city shipper_key
province_or_state shipper_name
avg_sales country location_key
Measures shipper_type

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 6


(Good) Aggregate Functions
•  Distributive: there is a function G() such that
F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,…n})
–  Examples: COUNT(), MIN(), MAX(), SUM()
–  G=SUM() for COUNT()
•  Algebraic: there is an M-tuple valued function G()
and a function H() such that
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n })
–  Examples: AVG(), standard deviation, MaxN(), MinN()
–  For AVG(), G() records sum and count, H() adds these
two components and divides to produce the global
average
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 7
Holistic Aggregate Functions
•  There is no constant bound on the size of
the storage needed to describe a sub-
aggregate.
–  There is no constant M, such that an M-tuple
characterizes the computation
F({Xi,j |i=1,...,I}).
•  Examples: Median(), MostFrequent() (also
called the Mode()), and Rank()

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 8


Index Requirements in OLAP
•  Data is read only
–  (Almost) no insertion or deletion
•  Query types
–  Point query: looking up one specific tuple (rare)
–  Range query: returning the aggregate of a
(large) set of tuples, with group by
–  Complex queries: need specific algorithms and
index structures, will be discussed later

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 9


OLAP Query Example
•  In table (cust, gender, …), find the total
number of male customers
•  Method 1: scan the table once
•  Method 2: build a B+ tree index on attribute
gender, still need to access all tuples of
male customers
•  Can we get the count without scanning
many tuples, even not all tuples of male
customers?
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 10
Bitmap Index
•  For n tuples, a bitmap index has n bits and
can be packed into !n /8" bytes and !n /32"
words
•  From a bit to the row-id: the j-th bit of the p-
th byte ! row-id = p*8 +j cust gender …
Jack M …
Cathy F …
… … …
Nancy F …

1 0 … 0
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 11
Using Bitmap to Count
•  Shcount[] contains the number of bits in the
entry subscript
–  Example: shcount[01100101]=4

count = 0;
for (i = 0; i < SHNUM; i++)
count += shcount[B[i]];

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 12


Advantages of Bitmap Index
•  Efficient in space
•  Ready for logic composition
–  C = C1 AND C2
–  Bitmap operations can be used
•  Bitmap index only works for categorical data
with low cardinality
–  Naively, we need 50 bits per entry to represent
the state of a customer in US
–  How to represent a sale in dollars?
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 13
Bit-Sliced Index
•  A sale amount can be written as an integer
number of pennies, and then be
represented as a binary number of N bits
–  24 bits is good for up to $167,772.15,
appropriate for many stores
•  A bit-sliced index is N bitmaps
–  Tuple j sets in bitmap k if the k-th bit in its binary
representation is on
–  The space costs of bit-sliced index is the same
as storing the data directly

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 14


Using Indexes
SELECT SUM(sales) FROM Sales WHERE C;
–  Tuples satisfying C is identified by a bitmap B
•  Direct access to rows to calculate SUM:
scan the whole table once
•  B+ tree: find the tuples from the tree
•  Projection index: only scan attribute sales
•  Bit-sliced index: get the sum from ∑(B AND
Bk)*2k

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 15


Cost Comparison
•  Traditional value-list index (B+ tree) is costly
in both I/O and CPU time
–  Not good for OLAP
•  Bit-sliced index is efficient in I/O
•  Other case studies in [O Neil and Quass,
SIGMOD 97]

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 16


Horizontal or Vertical Storage
•  A fact table for data warehousing is often fat
–  Tens of even hundreds of dimensions/attributes
•  A query is often about only a few attributes
•  Horizontal storage: tuples are stored one by one
•  Vertical storage: tuples are stored by attributes

A1 A2 … A100 A1 A2 … A100
x1 x2 … x100 x1 x2 … x100
… … … … … … … …
z1 z2 … z100 z1 z2 … z100

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 17


Horizontal Versus Vertical
•  Find the information of tuple t
–  Typical in OLTP
–  Horizontal storage: get the whole tuple in one search
–  Vertical storage: search 100 lists
•  Find SUM(a100) GROUP BY {a22, a83}
–  Typical in OLAP
–  Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples
–  Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method
•  Projection index: vertical storage

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy