Data Mining and Data Warehousing
Data Mining and Data Warehousing
Warehousing
© Prentice Hall 1
Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated information
• How?
© Prentice Hall 2
Data Mining Definition
• Finding hidden information in a database
• Fit data to a model: descriptive or predictive
• Similar terms
• Exploratory data analysis
• Data driven discovery
• Deductive learning
© Prentice Hall 3
Data Mining Algorithm
• Objective: Fit Data to a Model
• Descriptive
• Predictive
• Preference – Technique to choose the best model
• Search – Technique to search the data
• “Query”
© Prentice Hall 4
Database Processing vs. Data Mining
Processing
• Query • Query
• Well defined • Poorly defined
• SQL • No precise query language
■ Data ■ Data
– Operational data – Not operational data
■ Output ■ Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
© Prentice Hall 5
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchase more than $10,000 in last
month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk. (association
rules)
© Prentice Hall 6
Basic Data Mining Tasks
• Classification maps data into predefined groups or
classes
• Supervised learning
• Pattern recognition
• Regression
• Prediction
• Clustering groups similar data together into
clusters.
• Unsupervised learning
• Segmentation
• Partitioning
© Prentice Hall 7
Basic Data Mining Tasks (cont’d)
• Summarization maps data into subsets with associated simple
descriptions.
• Characterization
• Generalization
• Link Analysis uncovers relationships among data.
• Affinity Analysis
• Association Rules
• Sequential Analysis determines sequential patterns.
© Prentice Hall 8
Ex: Time Series Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior
© Prentice Hall 9
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD): process of finding useful
information and patterns in data.
• Data Mining: Use of algorithms to extract the information and
patterns derived by the KDD process.
© Prentice Hall 10
KDD Process
© Prentice Hall 12
Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time
© Prentice Hall 13
IR Query Result Measures and Classification
IR Classification
© Prentice Hall 14
Dimensional Modeling
• View data in a hierarchical manner more as
business executives might
• Useful in decision support systems and mining
• Dimension: collection of logically related attributes;
axis for modeling data.
• Facts: data stored
• Ex: Dimensions – products, locations, date
Facts – quantity, unit price
© Prentice Hall 15
Relational View of Data
© Prentice Hall 16
Dimensional Modeling Queries
• Roll Up: more general dimension
• Drill Down: more specific dimension
• Dimension (Aggregation) Hierarchy
• SQL uses aggregation
• Decision Support Systems (DSS): Computer systems
and tools to assist managers in making decisions and
solving problems.
© Prentice Hall 17
Cube view of Data
© Prentice Hall
© Prentice Hall 18
Star Schema
© Prentice Hall
© Prentice Hall 19
Data Warehousing
• “Subject-oriented, integrated, time-variant, nonvolatile”
William Inmon
• Operational Data: Data used in day to day needs of
company.
• Informational Data: Supports other functions such as
planning and forecasting.
• Data mining tools often access data warehouses rather than
operational data.
© Prentice Hall 20
Operational vs. Informational
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
© Prentice Hall 21
OLAP
• OnLine Analytic Processing (OLAP): provides more complex
queries than OLTP.
• OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
• Dimensional data; cube view
• Visualization of operations:
• Slice: examine sub-cube.
• Dice: rotate cube to look at another dimension.
• Roll Up/Drill Down
© Prentice Hall 22