DMDW-Unit II
DMDW-Unit II
2.Introduction
Data mining is the process of discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems. Data mining is an
interdisciplinary subfield of computer science and statistics with an overall goal to extract
information (with intelligent methods) from a data set and transform the information into a
comprehensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process or KDD.
1 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Data mining is among such important steps involved in the knowledge discovery process
that encompasses data selection, data cleaning, and preprocessing, data transformation and
reduction, data algorithm choice, and finally post-processing and the interpretation of the
discovered knowledge.
The KDD process tends to be highly iterative and interactive. Data mining analysis tends to
work up from the data and best techniques are developed with an orientation towards large volumes
of data.
2 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Stages of KDD
3 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
4 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
i. Association Rules:
This data mining technique helps to find the association between two or more Items. It
discovers a hidden pattern in the data set. Association rules are created by searching data for
frequent if-then patterns and using the criteria support and confidence to identify the most
important relationships. Support is an indication of how frequently the items appear in the data.
Confidence indicates the number of times the if-then statements are found true. A third metric,
called lift, can be used to compare confidence with expected confidence.
Eg: If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items.
ii. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data.
iii. Classification:
Classification involves finding rules that partition the data into disjoint groups. This
analysis is used to retrieve important and relevant information about data, and metadata. This data
mining method helps to classify data in different classes.
Classification Models: Decision Tree, Neural Networks, Genetic Algorithms and Statistical Model
Applications: Credit Card Analysis, Banking and Medical.
iv. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence of
other variables.
v. Outer detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can be
used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc.
vi. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.
5 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
vii. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances in a right
sequence for predicting a future event.
2.7 Other Mining Problems
Sequence Mining
Sequential pattern mining is the mining of frequently appearing series events or
subsequences as patterns. An instance of a sequential pattern is users who purchase a Canon digital
camera are to purchase an HP color printer within a month.
Webmining
Web Mining is the process of Data Mining techniques to automatically discover and
extract information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage patterns.
Text Mining
Text mining is a component of data mining that deals specifically with unstructured text
data. It involves the use of natural language processing (NLP) techniques to extract useful
information and insights from large amounts of unstructured text data.
Spatial Data Mining
A spatial database saves a huge amount of space-related data, including maps, preprocessed
remote sensing or medical imaging records, and VLSI chip design data. Spatial data mining refers
to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly
stored in spatial databases. Such mining demands the unification of data mining with spatial
database technologies.
2.8 Issues and Challenges in DM
DataMining System depend on databases to supply the raw input and raises problems. Such
as that databases tend to supply the raw input and this raises problems. The difficulties in Datamining
are classified into
i. Limited Information
ii. Noise or Missing Data
iii. User Interaction and Prior Knowledge
iv. Uncertainity
v. Size Updates and Relevant Fields
6 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
7 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
ii. Insurance
Data mining helps insurance companies to price their products profitable and promote new
offers to their new or existing customers.
iii. Education
Data mining benefits educators to access student data, predict achievement levels and find
students or groups of students which need extra attention.
For example, students who are weak in maths subject.
iv. Manufacturing
With the help of Data Mining Manufacturers can predict wear and tear of production assets.
They can anticipate maintenance which helps them reduce them to minimize downtime.
v. Banking
Data mining helps finance sector to get a view of market risks and manage regulatory
compliance. It helps banks to identify probable defaulters to decide whether to issue credit
cards, loans, etc.
vi. Retail
Data Mining techniques help retail malls and grocery stores identify and arrange most
sellable items in the most attentive positions. It helps store owners to comes up with the offer
which encourages customers to increase their spending.
vii. Service Providers
Service providers like mobile phone and utility industries use Data Mining to predict the
reasons when a customer leaves their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each customer a probability score and
offers incentives.
viii.E-Commerce
E-commerce websites use Data Mining to offer cross-sells and up-sells through their
websites. One of the most famous names is Amazon, who use Data mining techniques to get more
customers into their eCommerce store.
ix. Super Markets
Data Mining allows supermarket's develope rules to predict if their shoppers were likely to
be expecting. By evaluating their buying pattern, they could find woman customers who are most
8 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
likely pregnant. They can start targeting products like baby powder, baby shop, diapers and so on.
x. Crime Investigation
Data Mining helps crime investigation agencies to deploy police workforce (where is a
crime most likely to happen and when?), who to search at a border crossing etc.
xi. Bioinformatics
Data Mining helps to mine biological data from massive datasets gathered in biology and
medicine.
2.12 Association Rules
Introduction:
Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in large databases. It is intended to identify strong rules
discovered in databases using some measures of interestingness
What is an association rule:
Association rules are if-then statements that help to show the probability of relationships
between data items within large data sets in various types of databases. Association rule mining has
a number of applications and is widely used to help discover sales correlations in transactional data
or in medical data sets.
Association rules are created by searching data for frequent if-then patterns and usingThe
criteria support and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the data. Confidence indicates the number of times
the if-then statements are found true. A third metric, called lift, can be used to compare confidence
with expected confidence
Example:
Market Basket Analysis:
This process analyzes customer buying habits by finding associations between the different
items that customers place in their shopping baskets. The discovery of such association scan help
retailers develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread (and what kind of bread) on the same trip to the supermarket. Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.
9 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
10 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Apriori Algorithm
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to
explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for
each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on,
until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the
database. A two-step process is followed in Apriori consisting of candidate Generation and
Prunning Process
11 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Example
12 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
13 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
14 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
15 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Example:
16 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
17 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Example:
18 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
PART B
PART C
19 CS Department MTNC