0% found this document useful (0 votes)
164 views60 pages

Department of Information Technology: Data Warehousing and Data Mining IT4204 3

This document provides an introduction to a course on data warehousing and data mining. It discusses the motivation for data mining due to the large amount of data being generated. It defines data mining as extracting useful patterns from large databases and notes it is sometimes called knowledge discovery. The document outlines the typical steps in a data mining process and common applications such as retail, marketing, banking, insurance, and healthcare.

Uploaded by

Yared Berhe
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views60 pages

Department of Information Technology: Data Warehousing and Data Mining IT4204 3

This document provides an introduction to a course on data warehousing and data mining. It discusses the motivation for data mining due to the large amount of data being generated. It defines data mining as extracting useful patterns from large databases and notes it is sometimes called knowledge discovery. The document outlines the typical steps in a data mining process and common applications such as retail, marketing, banking, insurance, and healthcare.

Uploaded by

Yared Berhe
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 60

Department of Information Technology

Course Title: Data Warehousing and Data Mining Course No: IT4204 Credit Hr: 3

By: TWW

Chapter one Introduction


In this chapter we will cover the following issues in brief
Motivation: Why data mining?
What is data mining & data warehousing? Architecture of a typical DM system

Why DM: Application of data mining


Data Mining: On what kind of data? Data mining functionalities

Are all the patterns interesting?


Classification of data mining systems Major issues in data mining
2

Motivation: Necessity is the Mother of Invention


Our capacity of generating and collecting data have been increased rapidly in the last several decades Huge amount of data is available at the tip of our hand It is predicted that more data will be produced in the next year than has been generated during the entire existence of humankind! A study by IBM predicts that by 2005 the amount of information on the planet will increase from 3.2 million exabytes to 43 million exabytes (1 exabyte=250 bytes) .
3

Motivation: Necessity is the Mother of Invention


Contributing factors include
Widespread use of bar code for most commercial products, Computerization of many business, scientific, and governmental transactions, Advances in data collection tools (audio, video, satellite remote sensing, scanning, image capturing tools) Usage of WWW as a global information system (the Internet in general), Development comprehensive application software, new computing and storage technologies
4

Motivation: Necessity is the Mother of Invention


All this have made it easier to create, collect, and store all types of data. As a result it creates a problem what is called data exposition Data explosion is the problem of having huge amount of data in an enterprise stored in databases, data warehouses and other information repositories generated by automated data collection tools and mature database technology in large databases which has to be processed to make a decision. As the size of data get larger, analyzing the data becomes very difficult

Motivation: Necessity is the Mother of Invention


Data can be managed and stored in
structured relational databases; in semi-structured file systems, such as e-mail; unstructured fixed content, like documents and graphic files.

Companies rely on this enterprise data to improve decision-making and to gain a competitive advantage; Data has indeed become a highly valued business asset. The huge amount of data exceeds our human ability to make comprehension on the data and to put the best decision without tools Generating and storing of large volumes of data has reached a critical mass and appropriate tools for comprehend the data becomes vital.
6

Motivation: Necessity is the Mother of Invention


We are drowning in data, but starving for knowledge!

The Solution: Data warehousing and data mining

Data mining can be viewed as a result of the natural evolution of information technology. This can be more explained if we look at the evolution of database technology since 19th century.

Motivation: Necessity is the Mother of Invention


1960s:
Known to be the era of primitive file processing

There were activities such as


Data collection, database creation (No DBMS), Information management system (IMS), mainly using COBOL

1970s:
Relational data model, relational DBMS implementation Data modeling tools like ER diagram Indexing and data organization techniques such as B+ tree, hashing, etc Query language such as SQL User interfaces, forms and reports Query processing and optimization techniques Transaction management: recovery, concurrency control, etc Online Transaction processing (OLTP)
8

Motivation: Necessity is the Mother of Invention


1980s:
Period of advanced DB Systems
advanced data models
extended-relational, Object Oriented, Object-Relational, deductive, etc.)

application-oriented DBMS
spatial, temporal, multimedia, active, scientific, engineering, Knowledgebase, etc.)

1990s2000s:
Data mining and data warehousing, Knowledge discovery, OLAP and Web based databases

What is Data Mining & Data warehousing?

Data mining is extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases (data warehouse) The term Data mining is a misnomer as it doesnt directly related to what is does. For exampling mining gold from rock is called Gold mining but not rock mining. Similarly oil mining is mining oil from the ground. Data mining should best describe as knowledge mining from data rather that data mining Any way, we will use the term with this understanding
10

What is Data Mining & Data warehousing?


Alternative names
Knowledge discovery(mining) from databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Note that:
query processing systems, Expert systems (knowledge base systems) or Information retrieval systems are not data mining tasks

11

What is Data Mining & Data warehousing?


Data can now be stored in many different types of databases. Special DB architecture that has recently emerged is data warehouse Data warehouse is a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making process Data warehouse technology includes
Data cleansing (removing noise and inconsistent data) Data integration (combining multiple data sources into one data warehouse) On-Line Analytical Processing (OLAP)

12

What is Data Mining & Data warehousing?


OLAP is analysis technique which have functionalities such as
Summarization Consolidation

Aggregation as well as
The ability to view information from different angle

Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in depth data analysis such as
Data classification Clustering and Characterization of data changes over time

The abundance of data, coupled with the need for powerful data analysis tools has been described as data rich but information poor situation
13

What is Data Mining & Data warehousing?


Data mining (Knowledge Discovery in Databases) consists of iterative sequences of the following seven steps

1. Learning the application domain:


Learn relevant prior knowledge to be used in the DM process Learn the goals of DM application

2. Create target data set (data selection)


Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.

3. Choosing functions of data mining


summarization, classification, regression, association, clustering.
14

What is Data Mining & Data warehousing?


4. Choosing the mining algorithm(s)
Data mining: search for patterns of interest

5. Identify the relevant knowledge by measuring the mining result interestingness


Pattern evaluation

6. Present the knowledge to the user


knowledge presentation
visualization, transformation, removing redundant patterns, etc.

7. Use of discovered knowledge

15

Data mining Models


The above seven steps are usually standardized with various data models The most common standard in data mining is the CRoss-Industry Standard
Process for Data Mining (CRISP-DM),

CRISP-DM has the following steps


Business/research understanding (Learning the application domain), Data understanding (data selection for the problem) Data preparation which involves
collecting, cleaning, consolidating and amalgamating records, summarizing fields, checking for data integrity, detecting irregularities and illegal attributes, filling in for missing values, trimming outliers.

Data modeling involves


selecting data mining tools, transforming the data if the tools require it, generating samples for training and testing the model and finally using the tools to build and select a model

Evaluating the model and Deploying the model


16

April 7, 2012

Data mining Models

April 7, 2012

17

Data Mining: A KDD Process


Data mining: the core of knowledge discovery in Database Pattern Evaluation

Data Mining Task-relevant Data Selection & Transformation Data Warehouse

Data Cleaning
Data Integration Databases

April 7, 2012

18

Data Mining and Business Intelligence


Increasing potential to support business decisions End User

Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting

Business Analyst

Data Analyst

Data Warehouses / Data Marts OLAP Data Sources Paper, Files, Information Providers, Database Systems, OLTP
April 7, 2012

DBA
19

Architecture of a Typical Data Mining System


Graphical user interface

Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server
Data cleaning & data integration Filtering

Databases
April 7, 2012

Data Warehouse
20

Why Data Mining? Potential Applications


Data mining has many and varied fields of application some of which are listed below.
1 2 3 4 5 Retail/Marketing Banking Insurance and Health Care Transportation Medicine

April 7, 2012

21

Potential Applications:

Retail/Marketing Analysis

Market analysis is a tool companies use in order to better understand the environment in which they operate.

It is one of the main steps in the development of a marketing plan which involves critically reviewing and organizing collected data so that it can be used in making strategic marketing decisions Retailers can use information collected through affinity programs (e.g., shoppers club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together.

April 7, 2012

22

Potential Applications:

Retail/Marketing Analysis

Companies such as banking service providers and music clubs can use data mining to create a churn analysis, to assess which customers are likely to remain as subscribers and which ones are likely to switch to a competitor The source of data can be credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

April 7, 2012

23

Potential Applications: Retail/Marketing Analysis


The potential application involves
Target marketing analysis which involves finding clusters model customers who share the same characteristics, interest, income level, spending habits, etc to participate in specific market. Predict response to mailing, calling, advertizing campaigns Determine customer purchasing patterns over time: Finding pattern of customer on buying items: based on attribute or combination of attributes such as marital status, age, salary, citizen, type of credit card, family condition, etc Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information

April 7, 2012

24

Potential Applications:
The potential application involves
Customer profiling

Retail/Marketing Analysis

what types of customers buy what products (clustering or classification)


Find associations among customer demographic characteristics (age, name, address, profession, sex, etc) Identify buying patterns from customers

Identifying customer requirements


identifying the best products for different customers use prediction to find what factors will attract new customers

Provides summary information


various multidimensional summary reports for decision making statistical summary information (data central tendency and variation)
April 7, 2012 25

Potential Applications:
The potential application involves market basket analysis:

Retail/Marketing Analysis

identifying group of items that customers usually buy together

Market segmentation
Gathering demographic, geographic, behavioral and physiological information about a customer and cluster them for proper handling

Risk analysis and management


Forecasting, customer retention, improved underwriting, quality control, competitive analysis

April 7, 2012

26

Potential Applications:

Fraud detection and management

Detecting outliers and manage them before they destroy the organization environment Widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

April 7, 2012

27

Potential Applications:
Examples
Auto insurance:

Fraud detection and management

detect a group of people who stage accidents to collect on insurance

Money laundering:
detect suspicious money transactions in banking network

Medical insurance:
detect professional patients and ring of doctors and ring of references

Detecting inappropriate medical treatment Detecting telephone fraud


detect users of telephone line which either hijacked or stolen from a customer
April 7, 2012 28

Potential Applications:
Examples
Telephone call model:

Fraud detection and management

destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.

Telecom can identify discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a possible multimillion dollar fraud.

Retail:
Analysts estimate that 38% of retail shrink is due to dishonest employees.

April 7, 2012

29

Potential Applications: Banking


Detect patterns of fraudulent credit card use Identify `loyal' customers Predict customers likely to change their credit card affiliation Determine credit card spending by customer groups Find hidden correlations between different financial indicators Identify stock trading rules from historical market data

April 7, 2012

30

Potential Applications: Insurance and Health Care


Claims analysis - i.e which medical procedures are claimed together Predict which customers will buy new policies Identify behavior patterns of risky customers Identify fraudulent behavior

April 7, 2012

31

Potential Applications: Transportation


Determine the distribution schedules among outlets Analyze loading patterns

April 7, 2012

32

Potential Applications: Medicine


Characterize patient behavior to predict office visits Identify successful medical therapies for different illnesses

April 7, 2012

33

Potential Applications:
Text mining (news group, email, documents) and Web analysis. Intelligent query answering

Others

Sports
Astronomy

April 7, 2012

34

Potential Applications:
Finance planning and asset evaluation

Corporate Analysis and Risk Management

cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

Resource planning:
summarize and compare the resources and spending

Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

April 7, 2012

35

Data Mining: On What Kind of Data?


Relational databases Data warehouses Transactional databases Advanced DB and information repositories
Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW

April 7, 2012

36

Data Mining Functionalities


Data mining can be performed on various types of data stores and Databases Data mining functionalities are used to specify the kind of patterns to be found in data mining task The kind of pattern mined form a given data is not known for the user. Techniques should be implemented to extract various pattern from the available data so that user can choose what they need to use.

There are different kinds of data mining functionalities (tasks) that can be used to extract various types of pattern from data

April 7, 2012

37

Data Mining Functionalities


Generally data mining task can be broadly classified as
Descriptive (un supervised)
Predictive (supervised) Descriptive data mining task characterize the general properties of the data in a database. This kind of data mining is usually unsupervised Predictive data mining task perform inference on the current data in order to make prediction to the future reference This is usually supervised technique

April 7, 2012

38

Data Mining Functionalities


The supervised predictive data mining functionalities includes
Classification Regression Time series Estimation

The unsupervised descriptive data mining functionalities includes


Concept /class description: Characterization (summarization) and discrimination Association Analysis Clustering analysis Outlier analysis Evolution analysis

April 7, 2012

39

Data Mining Functionalities: Classification


Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts for the purpose of being able to use the model to predict the class of an object whose class is unknown The derived class is based on training data set and can be represented in various forms such as classification IFTHEN rule, decision tree, mathematical formulae or neural networks Prediction is the process of predicting some missing or unavailable data values rather than class labels. Finding models (functions) that describe and distinguish classes or concepts for future prediction

April 7, 2012

40

Data Mining Functionalities: Regression


Regression is the process of predicting some missing or unavailable data values rather than class labels.

Finding models (functions) that describe and distinguish classes or concepts for future prediction

April 7, 2012

41

Data Mining Functionalities


Concept/class description: Characterization and discrimination
Given a class/classes with data that belongs to the class, describe the class by making observation of its members. Hence one can describe individual classes or concepts in a summarized, concise and yet precise terms which is called class/concept description. These description can be derived via data characterization or data discrimination or both

Data characterization refers to summarizing the data of the class under consideration (target class) in general term
For example one may characterize the item class as a class in which 90% of the objects are computer and its peripheral

Data discrimination is description made by making comparative analysis between the target class with the other comparative class (contrasting classes)
For example one may discriminate item class from other class like customer and order class by saying the item class attributes get modified more frequently than others

April 7, 2012

42

Data Mining Functionalities:


Association Analysis
Association analysis is the discovery of association rules showing attributevalue conditions that occur frequently together in a given set of data Association rules are of the form XY [Support = s%, confidence = c%] where X is conjunctions of attributes and Y is conjunctions of values and interpreted as if X then it is likely to happen Y with support s% and confidence c%. For example
age(X, 20..29) ^ income(X, 20..29K) buys(X, PC) [support = 2%, confidence = 60%]

Interpreted as any one whose age ranges from 20 to 29 and income range is from 20 to 29K likely buy PC with support 2% and confidence of 60% Support shows the probability that all the predicates in X and Y fulfill together. i.e. P(X U Y) Confidence shows if predicates in X fulfilled then the predicate in Y is also fulfilled with the stated percentage. i.e. P(Y | X)
43 April 7, 2012

43

Data Mining Functionalities


Association Analysis
Example 2:
contains(T, computer) contains(T, software) [1%, 75%]

Interpreted as if Item T contains computer it is also likely to contain software with support 1% and confidence 75%
In the above two examples, Age, Income, buys and Contains are called attributes or predicates An attribute is a value if it is after the implication sign Association rule can be Multi-dimensional (more than 1 predicate in X and Y) or single-dimensional association rule (only one predicate in both X and Y) For example the association rule in example 1 is multi-dimensional where as in example two is single dimensional
44 April 7, 2012

44

Data Mining Functionalities Cluster analysis


In cluster Analysis, class labels are unknown and a group of data is given to be classified.

Cluster analysis group data to form new classes, e.g., cluster houses to find distribution patterns
Clustering based on the principle:
maximizing the intra-class similarity and minimizing the interclass similarity

45 April 7, 2012

45

Data Mining Functionalities Outlier analysis


Database may contain data object that do not comply with the general

behavior or model of the data.


These data objects are outliers. Usually outlier data items are considered as noise or exception in many data mining applications However, in some application such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier mining

April 7, 2012

46

Data Mining Functionalities Trend and evolution analysis


Describe and model regularities or trends for objects whose behavior changes over time. Though this may include characterization, discrimination, association or clustering of time related data, distinct features of such an analysis include time series data analysis, sequence or periodicity pattern matching and similarity based data analysis. It is also referred as regression analysis, sequential pattern mining,

periodicity analysis, similarity-based analysis

47 April 7, 2012

47

Are All the Discovered Patterns Interesting?


A data mining system/query may generate thousands of patterns, not all of them are interesting Questions
What makes a pattern interesting? Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns?

Answers for all the three questions will be given bellow

48 April 7, 2012

48

Question 1 Are All the Discovered Patterns Interesting?


A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm An interesting pattern represents knowledge
Measure of Interestingness measures
Two types (Objective vs. subjective)
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on users belief in the data, e.g., unexpectedness (contradicting a users belief), novelty, actionability, etc.

April 7, 2012

49

Question 2 Can We Find All Interesting Patterns?


Referred as Completeness of the data mining algorithm No single data mining system is complete but users can set a constraint on the type of pattern they are looking for in which the data mining function generate all the pattern with the specified constraints

Association algorithms dont find classification pattern and others for example

April 7, 2012

50

Question 3 Can We Find Only Interesting Patterns?


This is an Optimization problem in data mining system

it remain an challenging issue


Approaches in the research topic First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patternsmining query optimization

April 7, 2012

51

Data Mining: Confluence of Multiple Disciplines

Database Technology

Statistics

Machine Learning

Data Mining

Visualization

Information Science
April 7, 2012

Other Disciplines
52

Data Mining: Classification Schemes


General functionality
Descriptive data mining Predictive data mining

Different views, different classifications


Kinds of databases to be mined
Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted

April 7, 2012

53

A Multi-Dimensional View of Data Mining Classification


Databases to be mined Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP) oriented, machine learning, statistics, visualization, neural network, etc. Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.

April 7, 2012

54

Major Issues in Data Mining


The major issue in data mining includes the following aspects
Mining methodology and user interaction Performance and scalability Issues related to the diversity of data types Issues related to applications and social impacts

These will be briefly discussed in the following slides

April 7, 2012

55

Major Issues:
Mining methodology and user interaction
Reflects the kind of knowledge mined, the ability to mine knowledge at different granularities, the use of domain knowledge, and knowledge visualization Mining different kinds of knowledge in databases for different users
Require different data mining functionalities and algorithms

Interactive mining of knowledge at multiple levels of abstraction


Enable to verify the discovered knowledge is relevant or not for the intended purpose

Incorporation of background knowledge


Background knowledge is vital in refining the data mining performance and evaluate its relevance

Data mining query languages and ad-hoc data mining


Such query language enable to retrieve knowledge from a data warehouse just like what SQL does from the database in a DBMS

Presentation and visualization of data mining results


Result should be presented visually or using high level natural language

Handling noise, incomplete data and exceptions Pattern evaluation: the interestingness problem
April 7, 2012 56

Major Issues:
Performance and scalability
This includes
Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods for huge amount of data

April 7, 2012

57

Major Issues: Issues related to the diversity of data types


This involves Handling relational and complex types of data.
Relational data need efficient and effective data mining system as it is widely used Complex data refers to data such as hyper text, multimedia, spatial, temporal and transactional which further require attention to do data mining on such data

Mining information from heterogeneous databases and global information systems (WWW)
Which is a major issue to use as a source of data in data mining

April 7, 2012

58

Major Issues:
Issues related to applications and social impacts
Data mining also concerned on issues related to its application and the impact it may have on the social aspect These includes
Identifying the application of discovered knowledge which can be Domain-specific data mining tools Intelligent query answering Process control and decision making How to integrate the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy that may have social impact

April 7, 2012

59

Summary
Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Classification of data mining systems Major issues in data mining

April 7, 2012

60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy