0% found this document useful (0 votes)
30 views22 pages

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views22 pages

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining Definition

1. Data Mining is defined as extracting information from huge sets of data.


2. Data mining is the procedure of mining knowledge from data.

Primary goal of data mining is to discover hidden patterns. Predict future trends and make
more informed business decisions. Data mining is also called Knowledge Discovery in
Database (KDD).
There are tonnes of information available on various platforms, but very little knowledge is
accessible.

Which Technologies are used in data mining

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.
Focus is on the discovery of useful knowledge, rather than simply finding patterns in data
Techniques
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining,
6. Pattern evaluation
7. knowledge representation and visualization.

i. Data cleaning (remove noise and inconsistent data)


ii. Data integration (multiple data sources maybe combined)
iii. Data selection (data relevant to the analysis task are retrieved from
database)
iv. Data transformation (data transformed or consolidated into forms)
appropriate for mining) (Done with data preprocessing)
v. Data mining (an essential process where intelligent methods are
applied to extract data patterns)
vi. Pattern evaluation (identify the truly interesting patterns)
vii. Knowledge presentation (mined knowledge is presented to the
user with visualization or representation techniques)

Advantages of KDD
1. Improves decision-making.
2. Increased efficiency
3. Better customer service
4. Fraud detection

Disadvantages of KDD
1. Privacy concerns
2. Complexity
3. Data Quality
4. High cost.
Data Mining Applications
➢ Financial Data Analysis
➢ Retail Industry
➢ Tele communication Industry
➢ Biological Data Analysis
➢ Other Scientific Applications
➢ Intrusion Detection

Major Issues in data mining:


The major issues can divided into five groups:
1. Mining Methodology
2. User Interaction
3. Efficiency and scalability
4. Diverse Data Types Issues
5. Data mining society

Data Preprocessing in Data Mining


Data preprocessing is an important step in the data mining process. Data preprocessing is to
improve the quality of the data and to make it more suitable for the specific data mining task.
Major Tasks in Data Preprocessing
There are 4 major tasks in data preprocessing –
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation

Data Preprocessing Techniques


1. Data cleaning can be applied to remove noise and correct inconsistencies in the data.
2. Data integration merges data from multiple sources into coherent data store, such as
a data warehouse.
3. Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering, for instance. These techniques are not mutually exclusive;
they may work together.
4. Data transformations, such as normalization, may be applied.
1.Data Cleaning

Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate
data from the datasets, and it also replaces the missing values. Here are some techniques
for data cleaning:

Handling Missing Values

• Standard values like “Not Available” or “NA” can be used to replace the missing
values.

• Missing values can also be filled manually, but it is not recommended when that
dataset is big.

• The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.

• While using regression or decision tree algorithms, the missing value can be
replaced by the most probable value.
Handling Noisy Data

Noisy generally means random error or containing unnecessary data points. Handling
noisy data is one of the most important steps as it leads to the optimization of the model
we are using Here are some of the methods to handle noisy data.

• Binning: This method is to smooth or handle noisy data. First, the data is sorted
then, and then the sorted values are separated and stored in the form of bins. There
are three methods for smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using
minimum and maximum values of the bin values are taken, and the closest
boundary value replaces the values.

• Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.

• Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.

• Combined computer and human inspection − The outliers can also be


recognized with the support of computer and human inspection. The outliers
pattern can be descriptive or garbage. Patterns having astonishment value can be
output to a list.
2.Data Integration

The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components of data management. There are some problems to
be considered during data integration.

• Schema integration: Integrates meta data(a set of data that describes other data)
from different sources.

• Entity identification problem: Identifying entities from multiple databases. For


example, the system or the user should know the student id of one database and
student name of another database belonging to the same entity.
• Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. The attribute values from one database may
differ from another database. For example, the date format may differ, like
“MM/DD/YYYY” or “DD/MM/YYYY”.
3.Data Reduction

This process helps in the reduction of the volume of the data, which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the data reduction techniques are dimensionality reduction,
numerosity reduction, and data compression.

• Dimensionality reduction: This process is necessary for real-world applications


as the data size is big. In this process, the reduction of random variables or
attributes is done so that the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space, and computation
time is reduced. When the data is highly dimensional, a problem called the “Curse
of Dimensionality” occurs.

• Numerosity Reduction: In this method, the representation of the data is made


smaller by reducing the volume. There will not be any loss of data in this reduction.

• Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.

• Data Cube Aggregation: Data cube aggregation involves summarizing the data in
a data cube by aggregating data across one or more dimensions. This technique is
useful when analyzing large datasets with many dimensions, as it can help reduce
the size of the data by collapsing it into a smaller number of dimensions.
4.Data Transformation

The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements.

There are some methods for data transformation .

• Smoothing: With the help of algorithms, we can remove noise from the dataset,
which helps in knowing the important features of the dataset. By smoothing, we
can find even a simple change that helps in prediction.

• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good, the results are more relevant.

• Discretization: The continuous data here is split into intervals. Discretization


reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, or 6 pm-8 pm).

• Normalization: It is the method of scaling the data so that it can be represented in


a smaller range. Example ranging from -1.0 to 1.0.
There are main methods for data normalization :

min-max normalization

z-score normalization
Functionalities of Data Mining
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. Data mining tasks can be classified into two types:
descriptive and predictive. Descriptive mining tasks define the common features of the
data in the database, and the predictive mining tasks act in inference on the current
information to develop predictions.

Data mining is extensively used in many areas or sectors. It is used to predict and
characterize data. But the ultimate objective in Data Mining Functionalities is to observe
the various trends in data mining. There are several data mining functionalities that the
organized and scientific methods offer, such as:

1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class or a
concept. A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance sale and
non-sale products. There are two concepts here, one that helps with grouping and the other
that helps in differentiating.

o Data Characterization: This refers to the summary of general characteristics or


features of the class, resulting in specific rules that define a target class. A data analysis
technique called Attribute-oriented Induction is employed on the data set for achieving
characterization.
o Data Discrimination: Discrimination is used to separate distinct data sets based on
the disparity in attribute values. It compares features of a class with features of one or
more contrasting classes.g., bar charts, curves and pie charts.
2. Mining Frequent Patterns

One of the functions of data mining is finding data patterns. Frequent patterns are things that
are discovered to be most common in data. Various types of frequency can be found in the
dataset.

o Frequent item set:This term refers to a group of items that are commonly found
together, such as milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by
a cover.
3. Association Analysis

It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used
for determining the association rules:

o It provides which identifies the common item set in the database.


o Confidence is the conditional probability that an item occurs when another item occurs
in a transaction.
4. Classification

Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to
predict a class or essentially classify a collection of items. A training set containing items
whose properties are known is used to train the system to predict the category of items from
an unknown collection of items.

5. Prediction

It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.

o Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future
event that might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a popular data


mining functionality. It is similar to classification, but the classes are not predefined. Data
attributes represent the classes. Similar data are grouped together, with the difference being
that a class label is not known. Clustering algorithms group data based on similar features
and dissimilarities.

7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are too many outliers,
you cannot trust the data or draw patterns. An outlier analysis determines if there is something
out of turn in the data and whether it indicates a situation that a business needs to consider
and take measures to mitigate. An outlier analysis of the data that cannot be grouped into any
classes by the algorithms is pulled up.

8. Evolution and Deviation Analysis

Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.

9. Correlation Analysis

Correlation is a mathematical technique for determining whether and how strongly two
attributes is related to one another. It refers to the various types of data structures, such as
trees and graphs, that can be combined with an item set or subsequence. It determines how
well two numerically measured continuous variables are linked. Researchers can use this type
of analysis to see if there are any possible correlations between variables in their study.
Data Discretization & Concept Hierarchy Generation

Hierarchy concept refers to a sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level concepts to high-level
concepts

Concept hierarchy generation is a process that builds upon discretization to further abstract
the data. It's like creating a tree where leaves represent the most specific information, and
branches represent more general concepts.

There are two types of hierarchy: top-down mapping and the second one is bottom-up
mapping.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information
Types of Concept Hierarchies
1.Schema Hierarchy
2.Set-Grouping Hierarchy
3.Operation-Derived Hierarchy
4.Rule-based Hierarchy

Need of Concept Hierarchy in Data Mining


There are several reasons why a concept hierarchy is useful in data mining:
a. Improved Data Analysis
b. Improved Data Visualization and Exploration:
c. Improved Algorithm Performance
d. Data Cleaning and Pre-processing
e. Domain Knowledge
Applications of Concept Hierarchy
1. Data Warehousing
2. Business Intelligence
3. Online Retail
4. Healthcare
5. Natural Language Processing
6. Fraud Detection
Unit 3
Association rules
Market basket analysis
Association rules also known as Frequent item sets
Relationships between items in a dataset that occur frequently together (milk,
bread)
If a dataset contains 10 transactions and the item set (milk, bread) appears in 5 of
those transactions, the support count for (milk, bread) is 5
Association rule mining algorithms, such as (1) Apriori
(2) FP-Growth, are used

FP-Growth
FP-growth. FP stands for frequent pattern. The FP-growth algorithm uses a tree
structure, called an FP-tree, to map out relationships between individual items to
find the most frequently recurring patterns.
Example 1
minimum support be 3.
Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}

Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 3
U 1
Y 3

Solution
Ordered-Item set for the current transaction
Step 1 Inserting the set {K, E, M, O, Y}
Initialize the support count for each item as 1.

Step 2 Inserting the set {K, E, O, Y}


Step 3 Inserting the set {K, E, M}

Step 4 Inserting the set {K, M, Y}


Step 5 Inserting the set {K, E, O}

Step 6 Conditional Pattern Base


Step 7 Conditional Frequent Pattern Tree is built

Step 8 Frequent Pattern rules

Apriori Algorithm
1. Association rules between objects
2. Analyzes that people who bought product A also bought product B.
3. Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
4. For example, the items customers but at a Big Bazar.
5. Algorithm uses a breadth-first search and Hash Tree

you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled
together. It shows that the shopkeeper makes it comfortable for the customers to
buy these products in the same place.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift

Steps for Apriori Algorithm


Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and


select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than
the threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Example 1
Solution:
Step-1 frequency of each itemset individually in the dataset

Item Set Support_Count


A 6
B 7
C 6
D 2
E 1

Step 2 greater support count that the Minimum Support (2)


Item Set Support_Count
A 6
B 7
C 6
D 2

Step 3 we will create the pair of the item set

Step 4 greater support count that the Minimum Support (2)


Step 5 we will create the pair of the item sets(3)
Item Set Support_Count
{A, B,C} 2
{B,C,D} 0
{A,C,D} 0
{A,B,D} 1

Itemset that has support count equal to the minimum support count. So, only one
combination, i.e., {A, B, C}.

Step-6: Finding the association rules for the subsets


Sup{(A ^B) ^C}/sup(A ^B)=
A ^B → C 2
2/4=0.5=50%

Minimum Efficiency =50%


Disadvantages of Apriori Algorithm
1. The apriori algorithm works slow compared to other algorithms.
2. The overall performance can be reduced as it scans the database for
multiple times.
3. The time complexity and space complexity of the apriori algorithm is
O(2D), which is very high.
Application of Apriori Algorithm
1.Market basket analysis

2.Product recommendations

3.Healthcare

4.Forestry

5.Autocomplete Tool

6.Education

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy