0% found this document useful (0 votes)
99 views

DMDW-Unit II

Data mining is the process of analyzing large datasets to discover patterns and establish relationships. It involves techniques from machine learning, statistics, and database systems. The overall goal is to extract useful information from data to transform it into an understandable structure. Data mining is a key step in the knowledge discovery process. It allows organizations to predict trends and identify patterns in large amounts of data. Some common techniques include association rule learning, clustering, classification, regression, and prediction. Data mining has various applications in domains like communications, insurance, education, and manufacturing to gain insights, predict customer behavior, and identify at-risk groups.

Uploaded by

Devika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

DMDW-Unit II

Data mining is the process of analyzing large datasets to discover patterns and establish relationships. It involves techniques from machine learning, statistics, and database systems. The overall goal is to extract useful information from data to transform it into an understandable structure. Data mining is a key step in the knowledge discovery process. It allows organizations to predict trends and identify patterns in large amounts of data. Some common techniques include association rule learning, clustering, classification, regression, and prediction. Data mining has various applications in domains like communications, insurance, education, and manufacturing to gain insights, predict customer behavior, and identify at-risk groups.

Uploaded by

Devika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 19

DATA MINING AND WAREHOUSING- 18UITE64

2.Introduction
Data mining is the process of discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems. Data mining is an
interdisciplinary subfield of computer science and statistics with an overall goal to extract
information (with intelligent methods) from a data set and transform the information into a
comprehensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process or KDD.

2.1 What Is Data Mining?


Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. The key properties of
data mining are Automatic discovery of patterns Prediction of likely outcomes Creation of
actionable information Focus on large datasets and databases

2.2 Data mining definition


Data mining is the process of sorting through large data sets to identify patterns and
establish relationships to solve problems through data analysis. Data mining tools allow enterprises
to predict future trends.

1 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.3 KDD vs. Data Mining


KDD refers to the overall process of discovering useful knowledge from data. It involves
the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as
knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and
projections of the data prior to the data mining step. Data mining refers to the application of
algorithms for extracting patterns from data without the additional steps of the KDD process

Data mining is among such important steps involved in the knowledge discovery process
that encompasses data selection, data cleaning, and preprocessing, data transformation and
reduction, data algorithm choice, and finally post-processing and the interpretation of the
discovered knowledge.

The KDD process tends to be highly iterative and interactive. Data mining analysis tends to
work up from the data and best techniques are developed with an orientation towards large volumes
of data.

2 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

Stages of KDD

3 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.4 DBMS vs. DM


DBMS is a full-fledged system for housing and managing a set of digital databases.
However Data Mining is a technique or a concept in computer science, which deals with extracting
useful and previously unknown information from raw data. Most of the times, these raw data are
stored in very large databases. Therefore Data miners use the existing functionalities of DBMS to
handle, manage and even preprocess raw data before and during the Data mining process.
However, a DBMS system alone cannot be used to analyze data. But, some DBMS at present have
inbuilt data analyzing tools or capabilities.
2. 5 Other Related Areas
 Statistics
 Machine Learning
 Supervised Learning
 Unsupervised Learning
 Mathematical Processing
2.6 DM Techniques
Two fundamental goals of datamining are prediction and description.DM techniques are classified
into
A. User guided or Verification driven Datamining
B. Discovery driven or Automatic discovery of Rules
A. Verification Model
In this model of Datamining the user makes the hypothesis and test the hypothesis on the
data to verify its validity.
Eg. In a super market limited budget for a mailing campaign to launch a new product, to
identify the customer.
B. Discovery Model
The system automatically discovers important information hidden in the data.
Eg. In a Super market to discover the particular group of customers by the mailing
campaign.

4 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

i. Association Rules:
This data mining technique helps to find the association between two or more Items. It
discovers a hidden pattern in the data set. Association rules are created by searching data for
frequent if-then patterns and using the criteria support and confidence to identify the most
important relationships. Support is an indication of how frequently the items appear in the data.
Confidence indicates the number of times the if-then statements are found true. A third metric,
called lift, can be used to compare confidence with expected confidence.
Eg: If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items.
ii. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data.
iii. Classification:
Classification involves finding rules that partition the data into disjoint groups. This
analysis is used to retrieve important and relevant information about data, and metadata. This data
mining method helps to classify data in different classes.
Classification Models: Decision Tree, Neural Networks, Genetic Algorithms and Statistical Model
Applications: Credit Card Analysis, Banking and Medical.
iv. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence of
other variables.
v. Outer detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can be
used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc.
vi. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.

5 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

vii. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances in a right
sequence for predicting a future event.
2.7 Other Mining Problems
Sequence Mining
Sequential pattern mining is the mining of frequently appearing series events or
subsequences as patterns. An instance of a sequential pattern is users who purchase a Canon digital
camera are to purchase an HP color printer within a month.
Webmining
Web Mining is the process of Data Mining techniques to automatically discover and
extract information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage patterns.
Text Mining
Text mining is a component of data mining that deals specifically with unstructured text
data. It involves the use of natural language processing (NLP) techniques to extract useful
information and insights from large amounts of unstructured text data.
Spatial Data Mining
A spatial database saves a huge amount of space-related data, including maps, preprocessed
remote sensing or medical imaging records, and VLSI chip design data. Spatial data mining refers
to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly
stored in spatial databases. Such mining demands the unification of data mining with spatial
database technologies.
2.8 Issues and Challenges in DM
DataMining System depend on databases to supply the raw input and raises problems. Such
as that databases tend to supply the raw input and this raises problems. The difficulties in Datamining
are classified into
i. Limited Information
ii. Noise or Missing Data
iii. User Interaction and Prior Knowledge
iv. Uncertainity
v. Size Updates and Relevant Fields

6 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.10 DM Application Areas


 Business and E-Commerce Data
 Scientific Engineering and Healthcare Data
2.11 Data Mining Applications-Case Studies
i. Communications
Data mining techniques are used in communication sector to predict customer behavior to
offer highly targeted and relevant campaigns.

7 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

ii. Insurance
Data mining helps insurance companies to price their products profitable and promote new
offers to their new or existing customers.
iii. Education
Data mining benefits educators to access student data, predict achievement levels and find
students or groups of students which need extra attention.
For example, students who are weak in maths subject.
iv. Manufacturing
With the help of Data Mining Manufacturers can predict wear and tear of production assets.
They can anticipate maintenance which helps them reduce them to minimize downtime.
v. Banking
Data mining helps finance sector to get a view of market risks and manage regulatory
compliance. It helps banks to identify probable defaulters to decide whether to issue credit
cards, loans, etc.
vi. Retail
Data Mining techniques help retail malls and grocery stores identify and arrange most
sellable items in the most attentive positions. It helps store owners to comes up with the offer
which encourages customers to increase their spending.
vii. Service Providers
Service providers like mobile phone and utility industries use Data Mining to predict the
reasons when a customer leaves their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each customer a probability score and
offers incentives.
viii.E-Commerce
E-commerce websites use Data Mining to offer cross-sells and up-sells through their
websites. One of the most famous names is Amazon, who use Data mining techniques to get more
customers into their eCommerce store.
ix. Super Markets
Data Mining allows supermarket's develope rules to predict if their shoppers were likely to
be expecting. By evaluating their buying pattern, they could find woman customers who are most

8 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

likely pregnant. They can start targeting products like baby powder, baby shop, diapers and so on.
x. Crime Investigation
Data Mining helps crime investigation agencies to deploy police workforce (where is a
crime most likely to happen and when?), who to search at a border crossing etc.
xi. Bioinformatics
Data Mining helps to mine biological data from massive datasets gathered in biology and
medicine.
2.12 Association Rules
Introduction:
Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in large databases. It is intended to identify strong rules
discovered in databases using some measures of interestingness
What is an association rule:
Association rules are if-then statements that help to show the probability of relationships
between data items within large data sets in various types of databases. Association rule mining has
a number of applications and is widely used to help discover sales correlations in transactional data
or in medical data sets.
Association rules are created by searching data for frequent if-then patterns and usingThe
criteria support and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the data. Confidence indicates the number of times
the if-then statements are found true. A third metric, called lift, can be used to compare confidence
with expected confidence
Example:
Market Basket Analysis:
This process analyzes customer buying habits by finding associations between the different
items that customers place in their shopping baskets. The discovery of such association scan help
retailers develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread (and what kind of bread) on the same trip to the supermarket. Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.

9 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.13 Apriori Algorithm


Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that
the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative
approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First,
the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each
item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on,
until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the
database. A two-step process is followed in Apriori consisting of join and prune action.
Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori
algorithm is its anti-monotonicity of support measure. Apriori assumes that Before we start
understanding the algorithm, go through some definitions which are explained in my previous post.

10 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

Apriori Algorithm
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to
explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for
each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on,
until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the
database. A two-step process is followed in Apriori consisting of candidate Generation and
Prunning Process

11 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

Example

12 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

13 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.14 Partition Algorithm


The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the problem
specification concise, we can assume that the number of clusters is given as background
knowledge. This parameter is the starting point for partitioning methods. It attempts to find the
frequent item sets in a bottom – up manner but, at the same time, it maintains a list of maximal
frequent item sets. While making a database pass, it also counts the support of these candidate
maximal frequent item sets to see if any one of these is actually frequent.
In that event, it can conclude that all the subsets of these frequent
sets are going to be frequent and, hence, they are not verified for the support count in the next pass.
If we are lucky, we may discover a very large maximal frequent item set very early in the
algorithm. If this set subsumes all the candidate sets of level k, then we need not proceed further
and thus we save many database passes. Clearly, the pincer – search has an advantage over a priori
algorithm when the largest frequent item set is long

14 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.15 Pincer Search Algorithm


Pincer Search Algorithm
An Efficient Algorithm for Discovering the Maximum Frequent Set. Discovering frequent
itemsets is a key problem in important data mining applications, such as the discovery of
association rules, strong rules, episodes, and minimal keys.
Algorithm

15 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

Example:

16 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

2.16 Border Algorithm.

17 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

Example:

18 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

PART B

1. What is Datamining? Compare DBMS Vs DM


2. Describe the stages of KDD
3. Brief about Issues and Challenges in Data Mining
4. Discuss the application of Datamining
5. Write the Partition Algorithm and explain
6. Make a note on Border Algorithm

PART C

1. Elaborate Data Mining Techniques


2. Explain the Apriori Algorithm with example
3. Illustrate the Pincer Search Algorithm with example

19 CS Department MTNC

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy