Department of Information Technology: Data Warehousing and Data Mining IT4204 3
Department of Information Technology: Data Warehousing and Data Mining IT4204 3
Course Title: Data Warehousing and Data Mining Course No: IT4204 Credit Hr: 3
By: TWW
Companies rely on this enterprise data to improve decision-making and to gain a competitive advantage; Data has indeed become a highly valued business asset. The huge amount of data exceeds our human ability to make comprehension on the data and to put the best decision without tools Generating and storing of large volumes of data has reached a critical mass and appropriate tools for comprehend the data becomes vital.
6
Data mining can be viewed as a result of the natural evolution of information technology. This can be more explained if we look at the evolution of database technology since 19th century.
1970s:
Relational data model, relational DBMS implementation Data modeling tools like ER diagram Indexing and data organization techniques such as B+ tree, hashing, etc Query language such as SQL User interfaces, forms and reports Query processing and optimization techniques Transaction management: recovery, concurrency control, etc Online Transaction processing (OLTP)
8
application-oriented DBMS
spatial, temporal, multimedia, active, scientific, engineering, Knowledgebase, etc.)
1990s2000s:
Data mining and data warehousing, Knowledge discovery, OLAP and Web based databases
Data mining is extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases (data warehouse) The term Data mining is a misnomer as it doesnt directly related to what is does. For exampling mining gold from rock is called Gold mining but not rock mining. Similarly oil mining is mining oil from the ground. Data mining should best describe as knowledge mining from data rather that data mining Any way, we will use the term with this understanding
10
Note that:
query processing systems, Expert systems (knowledge base systems) or Information retrieval systems are not data mining tasks
11
12
Aggregation as well as
The ability to view information from different angle
Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in depth data analysis such as
Data classification Clustering and Characterization of data changes over time
The abundance of data, coupled with the need for powerful data analysis tools has been described as data rich but information poor situation
13
15
April 7, 2012
April 7, 2012
17
Data Cleaning
Data Integration Databases
April 7, 2012
18
Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
Business Analyst
Data Analyst
Data Warehouses / Data Marts OLAP Data Sources Paper, Files, Information Providers, Database Systems, OLTP
April 7, 2012
DBA
19
Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server
Data cleaning & data integration Filtering
Databases
April 7, 2012
Data Warehouse
20
April 7, 2012
21
Potential Applications:
Retail/Marketing Analysis
Market analysis is a tool companies use in order to better understand the environment in which they operate.
It is one of the main steps in the development of a marketing plan which involves critically reviewing and organizing collected data so that it can be used in making strategic marketing decisions Retailers can use information collected through affinity programs (e.g., shoppers club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together.
April 7, 2012
22
Potential Applications:
Retail/Marketing Analysis
Companies such as banking service providers and music clubs can use data mining to create a churn analysis, to assess which customers are likely to remain as subscribers and which ones are likely to switch to a competitor The source of data can be credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
April 7, 2012
23
April 7, 2012
24
Potential Applications:
The potential application involves
Customer profiling
Retail/Marketing Analysis
Potential Applications:
The potential application involves market basket analysis:
Retail/Marketing Analysis
Market segmentation
Gathering demographic, geographic, behavioral and physiological information about a customer and cluster them for proper handling
April 7, 2012
26
Potential Applications:
Detecting outliers and manage them before they destroy the organization environment Widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
April 7, 2012
27
Potential Applications:
Examples
Auto insurance:
Money laundering:
detect suspicious money transactions in banking network
Medical insurance:
detect professional patients and ring of doctors and ring of references
Potential Applications:
Examples
Telephone call model:
destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.
Telecom can identify discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a possible multimillion dollar fraud.
Retail:
Analysts estimate that 38% of retail shrink is due to dishonest employees.
April 7, 2012
29
April 7, 2012
30
April 7, 2012
31
April 7, 2012
32
April 7, 2012
33
Potential Applications:
Text mining (news group, email, documents) and Web analysis. Intelligent query answering
Others
Sports
Astronomy
April 7, 2012
34
Potential Applications:
Finance planning and asset evaluation
cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
April 7, 2012
35
April 7, 2012
36
There are different kinds of data mining functionalities (tasks) that can be used to extract various types of pattern from data
April 7, 2012
37
April 7, 2012
38
April 7, 2012
39
April 7, 2012
40
Finding models (functions) that describe and distinguish classes or concepts for future prediction
April 7, 2012
41
Data characterization refers to summarizing the data of the class under consideration (target class) in general term
For example one may characterize the item class as a class in which 90% of the objects are computer and its peripheral
Data discrimination is description made by making comparative analysis between the target class with the other comparative class (contrasting classes)
For example one may discriminate item class from other class like customer and order class by saying the item class attributes get modified more frequently than others
April 7, 2012
42
Interpreted as any one whose age ranges from 20 to 29 and income range is from 20 to 29K likely buy PC with support 2% and confidence of 60% Support shows the probability that all the predicates in X and Y fulfill together. i.e. P(X U Y) Confidence shows if predicates in X fulfilled then the predicate in Y is also fulfilled with the stated percentage. i.e. P(Y | X)
43 April 7, 2012
43
Interpreted as if Item T contains computer it is also likely to contain software with support 1% and confidence 75%
In the above two examples, Age, Income, buys and Contains are called attributes or predicates An attribute is a value if it is after the implication sign Association rule can be Multi-dimensional (more than 1 predicate in X and Y) or single-dimensional association rule (only one predicate in both X and Y) For example the association rule in example 1 is multi-dimensional where as in example two is single dimensional
44 April 7, 2012
44
Cluster analysis group data to form new classes, e.g., cluster houses to find distribution patterns
Clustering based on the principle:
maximizing the intra-class similarity and minimizing the interclass similarity
45 April 7, 2012
45
April 7, 2012
46
47 April 7, 2012
47
48 April 7, 2012
48
April 7, 2012
49
Association algorithms dont find classification pattern and others for example
April 7, 2012
50
April 7, 2012
51
Database Technology
Statistics
Machine Learning
Data Mining
Visualization
Information Science
April 7, 2012
Other Disciplines
52
April 7, 2012
53
April 7, 2012
54
April 7, 2012
55
Major Issues:
Mining methodology and user interaction
Reflects the kind of knowledge mined, the ability to mine knowledge at different granularities, the use of domain knowledge, and knowledge visualization Mining different kinds of knowledge in databases for different users
Require different data mining functionalities and algorithms
Handling noise, incomplete data and exceptions Pattern evaluation: the interestingness problem
April 7, 2012 56
Major Issues:
Performance and scalability
This includes
Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods for huge amount of data
April 7, 2012
57
Mining information from heterogeneous databases and global information systems (WWW)
Which is a major issue to use as a source of data in data mining
April 7, 2012
58
Major Issues:
Issues related to applications and social impacts
Data mining also concerned on issues related to its application and the impact it may have on the social aspect These includes
Identifying the application of discovered knowledge which can be Domain-specific data mining tools Intelligent query answering Process control and decision making How to integrate the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy that may have social impact
April 7, 2012
59
Summary
Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Classification of data mining systems Major issues in data mining
April 7, 2012
60