0% found this document useful (0 votes)
114 views

Unit #1 - Data Warehouse and Data Mining

Data Warehouse and Data Mining Slides (Prof. Dr. M. S. Memon , Department of CSE , QUEST Nawabshah)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

Unit #1 - Data Warehouse and Data Mining

Data Warehouse and Data Mining Slides (Prof. Dr. M. S. Memon , Department of CSE , QUEST Nawabshah)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Warehouse

and
Data Mining
Prof. Dr. M. S. Memon
sulleman@quest.edu.pk

M.S. Memon
05/20/23 Department of CSE, QUEST 1
Outline
• Introduction to Data Warehouse

• Data Warehouse versus Operational Database

• OLTP vs. DW

• Applications of DW

Source: www.stonebridgegroup.com
M.S. Memon
05/20/23 Department of CSE, QUEST 2
Data Warehouse
• Purpose of the Data Warehouse
– Value of the DATA - Realize!!!
• Data / Information is an asset
• Data / Information can be sold
• Methods to realize the VALUE – Reporting, Analysis, Data
Mining, etc
• Make better decisions!!!
– Turn data into Information
– Create competitive advantages
– Methods to support decision making process – DSS etc

M.S. Memon
05/20/23 Department of CSE, QUEST 3
Why data
warehouse?
• Bad decisions can lead to disasters
– Data Warehousing is at the base of decision support
systems
• Data warehousing is a data-driven decision-
support system
• Data warehousing helps to
– Understand the information hidden within the
organization’s data
• See data from different angles: product, client, time,
geographical area
• Get a glimpse of the future.
M.S. Memon
05/20/23 Department of CSE, QUEST 4
Why data
warehouse?
• DBMS Approach
– List of all items that were sold last month?

– List of all items purchased by Naeem?

– The total sales of the last month grouped by branch?

– How many sales transactions occurred during the month


of January?

M.S. Memon
05/20/23 Department of CSE, QUEST 5
Why data
warehouse?
• Intelligent Enterprise
– Which items sell together? Which items to stock?

– Where and how to place the items? What discounts to


offer?

– How best to target customers to increase sales at a


branch?

– Which customers are most likely to respond to the next


promotional campaign, and why?
M.S. Memon
05/20/23 Department of CSE, QUEST 6
Why data
warehouse?
• Businesses want much more …
– What happened?
– Why it happened?
– What will happen?
– What is happening?
– What do you want to happen?

M.S. Memon
05/20/23 Department of CSE, QUEST 7
What is Data warehouse?
• Basically a very large database…
– Not all very large databases are data warehouses, but
all data warehouses are pretty large databases

– Nowadays a warehouse is considered to start at around


800 GB and goes up to several TB

– It spans over several servers and needs an impressive


amount of computing power

M.S. Memon
05/20/23 Department of CSE, QUEST 8
What is Data warehouse?
• More specific, a collective data repository
– Containing snapshots of the operational data (history)
– Obtained through data cleansing ETL
(Extract-Transform- Load)
– Useful for analytics

M.S. Memon
05/20/23 Department of CSE, QUEST 9
What is Data warehouse?
• Compared to other solutions it…
– Is suitable for tactical/strategic focus

– Implies a small number of transactions

– Implies large transactions spanning over a long period


of time

M.S. Memon
05/20/23 Department of CSE, QUEST 10
Definition
• Ralph Kimball: “a copy of transaction data
specifically structured for query and analysis”

M.S. Memon
05/20/23 Department of CSE, QUEST 11
Data Warehouse (definitions)
• Used for decision making, Duplicates existing
data, Combination of hardware, specialized
software and data – Dyche
• A copy of transaction data specifically structured
for query and analysis – Kimball
• A single, complete and consistent store of data
obtained from a variety of different sources made
available to end users in a way that can be
understood and used in business context – Barry
Devlin
M.S. Memon
05/20/23 Department of CSE, QUEST 12
Data Warehouse (definitions)
• A data warehouse is a database where data is
collected for the purpose of being analyzed

• A data warehouse is used to help people make


better decisions

• A data warehouse is defined by the use to


which it is put, not its underlying architecture
M.S. Memon
05/20/23 Department of CSE, QUEST 13
Attributes of DWH
• Bill Inmon (father of data warehousing, in
1993):
A Data Warehouse is a:
• subject oriented
• integrated
• non-volatile
• time-variant
collection of data in support of management’s decisions

M.S. Memon
05/20/23 Department of CSE, QUEST 14
Data Warehouse
• Subject oriented: Data is arranged by subject
area rather than by application. Data is
organized so that all the data elements relating
to the same real-world event or object are
linked together

– Typical subject areas in DWs are Customer, Product,


Order, Claim, Account,…

M.S. Memon
05/20/23 Department of CSE, QUEST 15
Data Warehouse
• Subject oriented:
– Example: customer as subject in a DW
• DW is organized in this case by the customer
• It may consist of 10, 100 or more physical tables, all
related

M.S. Memon
05/20/23 Department of CSE, QUEST 16
Data Warehouse
• Integrated: Data is collected and consistently
stored from multiple, diverse sources of an
organization's operational systems and this data
is made consistent
– E.g. gender, measurement, conflicting keys, consistency,

M.S. Memon
05/20/23 Department of CSE, QUEST 17
Data Warehouse
• Non-volatile: Data in the data warehouse is never
over-written or deleted - once committed, the
data is static, read-only, and retained for future
reporting. Data is loaded, but not updated
– When subsequent changes occur, a new snapshot record
is written.

M.S. Memon
05/20/23 Department of CSE, QUEST 18
Data Warehouse
• Time-variant: The changes to the data in the
data warehouse are tracked and recorded so
that reports can be produced showing changes
over time.
– Different environments have different time horizons
• associated
• While for operational systems a 60-to-90 day time horizon is
normal, data warehouse has a 5-to-10 year horizon

M.S. Memon
05/20/23 Department of CSE, QUEST 19
General Definition
• More general, a DW is a

– Repository of an
organization’s
electronically stored data

– Designed to facilitate
reporting and analysis

M.S. Memon
05/20/23 Department of CSE, QUEST 20
General Definition
A complete repository of historical corporate data
extracted from transaction systems that is available
for ad-hoc access by knowledge workers
•Transaction Systems
– Management Information System (MIS)
•Ad-hoc access
– Dose not have a certain access pattern
– Queries not known in advance
– Difficult to write SQL in advance
•Knowledge workers
– Typically NOT IT literate (Executives, Analysts, Managers)
M.S. Memon
05/20/23 Department of CSE, QUEST 21
Data Warehousing
• A paradigm specifically designed for
strategic business information or decision
making

• Data warehousing is a data-driven decision-


support system

M.S. Memon
05/20/23 Department of CSE, QUEST 22
Typical Features
• DW typically…
– Reside on computers dedicated to this function
– Run on DBMS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server
– Retain data for long periods of time
– Consolidate data obtained from a variety of sources
– Are built around their own carefully designed data
model

M.S. Memon
05/20/23 Department of CSE, QUEST 23
What can be
warehoused?
• Customer records
• Customer purchases
• Click stream, web traffic
• Product records
• Product purchase records
• Inventory movement

M.S. Memon
05/20/23 Department of CSE, QUEST 24
How does it work?
Business user
needs info

Answers result
User requests
in more questions
IT people

?
Business user
may get answers
 IT people do
system analysis
and design

IT people
send reports to IT people
business user create reports

M.S. Memon
05/20/23 Department of CSE, QUEST 25
Data Warehouse vs. Operational Database

Data Warehouse Operational Database


•Subject oriented • Application oriented

•Integrated • Multiple diverse


sources

•Non-volatile • Updateable

•Time-variant • Real-time, current

M.S. Memon
05/20/23 Department of CSE, QUEST 26
On Line Transaction Processing
• OLTP (OnLine Transaction Processing):
– Also known under the name of operational data, it
represents day-to-day operational business activities:
• Purchasing, sales, production distribution, …
– Typically for data entry and retrieval transaction
processing
– Reflects only the current state of the data

M.S. Memon
05/20/23 Department of CSE, QUEST 27
On Line Transaction Processing
• OLAP (OnLine Analytical Processing):
– Represents front-end analytics based on a DW
repository
– It provides information for activities like:
• Resource planning, capital budgeting, marketing initiatives,...
– It is decision oriented

M.S. Memon
05/20/23 Department of CSE, QUEST 28
OLTP vs. DW
• Properties
Operational DB DW
Mostly updates Mostly reads
Many small transactions Queries long, complex
MB-TB of data GB-PB of data
Raw data Summarized data
Clerical users Decision makers
Up-to-date data May be slightly outdated

M.S. Memon
05/20/23 Department of CSE, QUEST 29
OLTP vs. DW
OLTP Data Warehouse
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, historical,
flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

M.S. Memon
05/20/23 Department of CSE, QUEST 30
OLTP vs. DW
• Consider a normalized database for a store,
tables would like:

M.S. Memon
05/20/23 Department of CSE, QUEST 31
OLTP vs. DW
• DW for that store would start by building the
following star schema:

M.S. Memon
05/20/23 Department of CSE, QUEST 32
OLTP vs. DW
• Basic insights from comparing OLTP and DWs
– A DW is a separate (RDBMS) installation that contains
copies of data from on-line systems
• Physically separate hardware may not be absolutely necessary
if one has lots of extra computing power, but it is
recommended
– With an optimistic locking DBMS one might even be
able to get away for a while with keeping just one copy
of its data

M.S. Memon
05/20/23 Department of CSE, QUEST 33
OLTP vs. DW
• There is an essentially different pattern of
hardware utilization between on-line and
analytical processing

M.S. Memon
05/20/23 Department of CSE, QUEST 34
Applications of DW
• Typical questions which can be answered with
DW & OLAP
– How much did sales unit A earn in January?
– How much did sales unit B earn in February?
– What was their combined sales amount for the first quarter?
• Answering these questions with SQL-queries is
difficult
– Complex query formulation necessary
– Process is likely to be slow due to complex joins and
multiple scans

M.S. Memon
05/20/23 Department of CSE, QUEST 35
Applications of DW
• Why such questions can be answered better with
a DW?
– Because in a DW tables are rearranged and pre-
aggregated (known as computing cubes)
• The tables arrangement is subject oriented, usually some star
schema

M.S. Memon
05/20/23 Department of CSE, QUEST 36
Applications of DW
• A DW is the base repository for front-end analytics
– OLAP
– KDD
– Data visualization
– Reporting

KDD (Knowledge
Discovery in
Databases) a data
mining process

M.S. Memon
05/20/23 Department of CSE, QUEST 37
Applications of DW
• OLAP is a form of information processing and thus
needs to provide timely, accurate and
understandable information
– timely is however a relative term:
• In OLTP one expects an update to go through in a matter of
seconds
• In OLAP the time to answer a query can take minutes, hours
or even longer
• There are many flavors of OLAP
– ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…

M.S. Memon
05/20/23 Department of CSE, QUEST 38
Applications of DW
– Data mining might return the following set of rules for
customers spending more than €100:
• IF AGE > 35 AND CAR = ‘MINIVAN’ THEN TOTAL SPENT >
€100
• IF SEX = ‘M’ AND ZIP = 38106 THEN TOTAL SPENT > €100
– It answers questions like
• Which products or customers are more profitable
• Which outlets have sold the least this year
– In consequence it motivates decisions like
• Which products should have their production increased
• Which customers should be targeted for special promotions
• Which outlets should be closed
M.S. Memon
05/20/23 Department of CSE, QUEST 39
DW User
• Users of DW are called DSS analysts and usually
are business persons
– Their primary job is to define and discover information
used in corporate decision-making
– The way they think
• “Give me what I say I want, and then I can tell you what I really
want”
• They work in explorative manner

M.S. Memon
05/20/23 Department of CSE, QUEST 40
DW User
– Typical explorative line of work
• “Ah! Now that I see what the possibilities are, I can tell what I
really want to see. But until I know what the possibilities are, I
cannot describe exactly what I want...”

– This usage has profound effect on the way a DW is


developed
• The classical system development life cycle assumes that the
requirements are known at the start of design
• The DSS analyst starts with existing requirements, but
factoring in new requirements is almost impossible

M.S. Memon
05/20/23 Department of CSE, QUEST 41
Lifecycle of Data warehouse

M.S. Memon
05/20/23 Department of CSE, QUEST 42
Outline
• Lifecycle of DW

• Classical SDLC vs. DW SDLC

• Operating DW

M.S. Memon
05/20/23 Department of CSE, QUEST 43
Lifecycle of DW

DW System Development Life Cycle (SDLC)


•Design
– End-user interview cycles
– Source system cataloging
– Definition of key performance indicators
– Mapping of decision-making processes underlying
information needs
– Logical and physical schema design

M.S. Memon
05/20/23 Department of CSE, QUEST 44
Lifecycle of DW
• Prototype
– Objective is to constrain and in some cases reframe
end-user requirements
• Deployment
– Development of documentation
– Training
– Operations and management processes
• Operation
– Day-to-day maintenance of the DW needs a good
management of ongoing Extraction, Transformation and
Loading (ETL)M.S.
process
Memon
05/20/23 Department of CSE, QUEST 45
Lifecycle of DW
• Enhancement needs the modification of
– HW - physical components
– Operations and management processes
– Logical schema designs

M.S. Memon
05/20/23 Department of CSE, QUEST 46
Lifecycle of DW

• Classical SDLC vs. DW SDLC

• DW SDLC is almost the opposite of classical


SDLC
M.S. Memon
05/20/23 Department of CSE, QUEST 47
Lifecycle of DW
• Classical SDLC vs. DW SDLC

• Because it is the opposite of SDLC, DW SDLC is


also called CLDS
M.S. Memon
05/20/23 Department of CSE, QUEST 48
Lifecycle of DW
• CLDS is a data driven development life cycle
• It starts with data
– Once data is at hand it is integrated and tested against
bias
– Programs are written against the data and the results
are analyzed and finally the requirements of the
system are understood
– Once requirements are understood, adjustments are
made to the design and the cycle starts all over
• “spiral development methodology”
M.S. Memon
05/20/23 Department of CSE, QUEST 49
Operating a DW
• In Operating a DW the following phases can be
identified
– Monitoring
– Extraction
– Transforming
– Loading
– Analyzing

M.S. Memon
05/20/23 Department of CSE, QUEST 50
Operating a DW: Monitoring
• Monitoring
– Surveillance of the data sources
– Identification of data modification which is relevant to the
DW
– Monitoring has an important role over the whole process
deciding on which data the next steps will be applied on
• Monitoring techniques
– Active mechanisms - Event Condition Action (ECA)
rules:

M.S. Memon
05/20/23 Department of CSE, QUEST 51
Operating a DW: Monitoring
• Monitoring techniques
– Replication mechanisms
• Snapshot:
– Local copy of data, similar to a View
– Used by Oracle 9i
• Data replication
– Replicates and maintains data in destination tables through data
propagation processes
– Used by IBM

M.S. Memon
05/20/23 Department of CSE, QUEST 52
Operating a DW: Monitoring
• Monitoring techniques
– Protocol based mechanisms
• Since DBMS write protocol data for transaction management,
the protocol can be used also for monitoring
• Difficult due to the fact that the protocol format is proprietary and
subject to change
– Application managed mechanisms
• Hard to implement for legacy systems
• Based on time stamping or data comparison

M.S. Memon
05/20/23 Department of CSE, QUEST 53
Operating a DW: Extraction
• Extraction
– Reads the data which was selected throughout the
monitoring phase and inserts it in the data structures of
the workplace
– Due to large data volume, compression can be used
– The time-point for performing extraction can be:
• Periodical:
– Weather or stock market information can be actualized more times
in a day, while product specification can be actualized in a longer
period of time
• On request:
– For example when a new item is added to a product group

M.S. Memon
05/20/23 Department of CSE, QUEST 54
Operating a DW: Extraction
• Extraction
– The time-point for performing extraction can be:
• Event driven:
– Event driven extraction can be helpful in scenarios where time,
or the number of modifications over passing a specified
threshold triggers the extraction. For example each night at
03:00 or each time 50 new modifications took place, an
extraction is performed
• Immediate:
– In some special cases like the stock market it can be necessary
that the changes propagate immediately to the warehouse
– The extraction largely depends on hardware and the
• software used for the DW and the data source
M.S. Memon
05/20/23 Department of CSE, QUEST 55
Operating a DW: Transforming
• Transforming
– Implies adapting data, schema as well as data quality
to the application requirements
– Data integration:
• Transformation in de-normalized data structures
• Handling of key attributes
• Adaptation of different types of the same data
• Conversion of encoding:
– “Buy”,“Sell”  1,2 vs. B,S  1,2
• Normalization:
– “Michael Schumacher”  “Michael, Schumacher” vs. “Schumacher Michael” 
“Michael, Schumacher”
M.S. Memon
05/20/23 Department of CSE, QUEST 56
Operating a DW: Transforming
• Transforming
– Data integration:
• Date handling:
– “MM-DD-YYYY”  “MM.DD.YYYY”
• Measurement units and scaling:
– 10 inch  25,4 cm
– 30 mph  48,279 km/h
• Save calculated values
– Price_incl_VAT = Price_excl_VAT * 1.19
• Aggregation
– Daily sums can be added into weekly ones
– Different levels of granularity can be used

M.S. Memon
05/20/23 Department of CSE, QUEST 57
Operating a DW: Transforming
• Transforming
– Data cleaning:
• Consistency check
– Delivery_date < Order_date
• Completeness
– Management of missing values as well as NULL values

M.S. Memon
05/20/23 Department of CSE, QUEST 58
Operating a DW: Loading
• Loading
– Loading usually takes place during weekends or nights
when the system is not under user stress
– Split between initial load to initialize the DW and the
periodical load to keep the DW updated
– Initial loading
• Implies big volumes of data and for this reason a bulk loader is
used
– Usually performed by partitioning, parallelization and
incremental actualization

M.S. Memon
05/20/23 Department of CSE, QUEST 59
Operating a DW: Analyzing
• Analyze
– Data access
• Useful for extracting goal oriented information:
– How many iPhones 3G were sold in the Braunschweig stores of T-
Mobile in the last 3 calendar weeks of 2008?
– Although it is a common OLTP query, it might be to complex for the
operational environment to handle
– OLAP
• Falsely used as representing DW because it is used to analyze
data contained in DW
• Used to answer requests like:
– In which district does a product group register the highest profit
– How did the profit change in comparison to the previous month?
M.S. Memon
05/20/23 Department of CSE, QUEST 60
Operating a DW: Analyzing
• Analyze
– OLAP
• Used to answer requests like:
– Mostly known as organized on a multidimensional data model
– Common operations for analyze are:
» Pivoting/Rotation
» Roll-up, Drill-down and Drill-across
» Slice and Dice
– Data mining
• Useful for identifying hidden patterns
• Refers to two separate processes:
– KDD (Knowledge Discovery in Databases)
– Prediction
M.S. Memon
05/20/23 Department of CSE, QUEST 61
Operating a DW: Analyzing
• Analyze
– Data mining
• Useful for answering questions like:
– How did the sales of this product group evolve?
• Methods and procedures for data mining
– Clustering, Classification, Regression, Association rule learning

M.S. Memon
05/20/23 Department of CSE, QUEST 62

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy