0% found this document useful (0 votes)

30 views22 pages

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh

Uploaded by

anuragsiddharth04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views22 pages

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh

Uploaded by

anuragsiddharth04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Mining Definition

1. Data Mining is defined as extracting information from huge sets of data.

2. Data mining is the procedure of mining knowledge from data.

Primary goal of data mining is to discover hidden patterns. Predict future trends and make
more informed business decisions. Data mining is also called Knowledge Discovery in
Database (KDD).
There are tonnes of information available on various platforms, but very little knowledge is
accessible.

Which Technologies are used in data mining

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.
Focus is on the discovery of useful knowledge, rather than simply finding patterns in data
Techniques
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining,
6. Pattern evaluation
7. knowledge representation and visualization.

i. Data cleaning (remove noise and inconsistent data)

ii. Data integration (multiple data sources maybe combined)
iii. Data selection (data relevant to the analysis task are retrieved from
database)
iv. Data transformation (data transformed or consolidated into forms)
appropriate for mining) (Done with data preprocessing)
v. Data mining (an essential process where intelligent methods are
applied to extract data patterns)
vi. Pattern evaluation (identify the truly interesting patterns)
vii. Knowledge presentation (mined knowledge is presented to the
user with visualization or representation techniques)

Advantages of KDD
1. Improves decision-making.
2. Increased efficiency
3. Better customer service
4. Fraud detection

Disadvantages of KDD
1. Privacy concerns
2. Complexity
3. Data Quality
4. High cost.
Data Mining Applications
➢ Financial Data Analysis
➢ Retail Industry
➢ Tele communication Industry
➢ Biological Data Analysis
➢ Other Scientific Applications
➢ Intrusion Detection

Major Issues in data mining:

The major issues can divided into five groups:
1. Mining Methodology
2. User Interaction
3. Efficiency and scalability
4. Diverse Data Types Issues
5. Data mining society

Data Preprocessing in Data Mining

Data preprocessing is an important step in the data mining process. Data preprocessing is to
improve the quality of the data and to make it more suitable for the specific data mining task.
Major Tasks in Data Preprocessing
There are 4 major tasks in data preprocessing –
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation

Data Preprocessing Techniques

1. Data cleaning can be applied to remove noise and correct inconsistencies in the data.
2. Data integration merges data from multiple sources into coherent data store, such as
a data warehouse.
3. Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering, for instance. These techniques are not mutually exclusive;
they may work together.
4. Data transformations, such as normalization, may be applied.
1.Data Cleaning

Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate
data from the datasets, and it also replaces the missing values. Here are some techniques
for data cleaning:

Handling Missing Values

• Standard values like “Not Available” or “NA” can be used to replace the missing
values.

• Missing values can also be filled manually, but it is not recommended when that
dataset is big.

• The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.

• While using regression or decision tree algorithms, the missing value can be
replaced by the most probable value.
Handling Noisy Data

Noisy generally means random error or containing unnecessary data points. Handling
noisy data is one of the most important steps as it leads to the optimization of the model
we are using Here are some of the methods to handle noisy data.

• Binning: This method is to smooth or handle noisy data. First, the data is sorted
then, and then the sorted values are separated and stored in the form of bins. There
are three methods for smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using
minimum and maximum values of the bin values are taken, and the closest
boundary value replaces the values.

• Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.

• Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.

• Combined computer and human inspection − The outliers can also be

recognized with the support of computer and human inspection. The outliers
pattern can be descriptive or garbage. Patterns having astonishment value can be
output to a list.
2.Data Integration

The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components of data management. There are some problems to
be considered during data integration.

• Schema integration: Integrates meta data(a set of data that describes other data)
from different sources.

• Entity identification problem: Identifying entities from multiple databases. For

example, the system or the user should know the student id of one database and
student name of another database belonging to the same entity.
• Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. The attribute values from one database may
differ from another database. For example, the date format may differ, like
“MM/DD/YYYY” or “DD/MM/YYYY”.
3.Data Reduction

This process helps in the reduction of the volume of the data, which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the data reduction techniques are dimensionality reduction,
numerosity reduction, and data compression.

• Dimensionality reduction: This process is necessary for real-world applications

as the data size is big. In this process, the reduction of random variables or
attributes is done so that the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space, and computation
time is reduced. When the data is highly dimensional, a problem called the “Curse
of Dimensionality” occurs.

• Numerosity Reduction: In this method, the representation of the data is made

smaller by reducing the volume. There will not be any loss of data in this reduction.

• Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.

• Data Cube Aggregation: Data cube aggregation involves summarizing the data in
a data cube by aggregating data across one or more dimensions. This technique is
useful when analyzing large datasets with many dimensions, as it can help reduce
the size of the data by collapsing it into a smaller number of dimensions.
4.Data Transformation

The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements.

There are some methods for data transformation .

• Smoothing: With the help of algorithms, we can remove noise from the dataset,
which helps in knowing the important features of the dataset. By smoothing, we
can find even a simple change that helps in prediction.

• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good, the results are more relevant.

• Discretization: The continuous data here is split into intervals. Discretization

reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, or 6 pm-8 pm).

• Normalization: It is the method of scaling the data so that it can be represented in

a smaller range. Example ranging from -1.0 to 1.0.
There are main methods for data normalization :

min-max normalization

z-score normalization
Functionalities of Data Mining
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. Data mining tasks can be classified into two types:
descriptive and predictive. Descriptive mining tasks define the common features of the
data in the database, and the predictive mining tasks act in inference on the current
information to develop predictions.

Data mining is extensively used in many areas or sectors. It is used to predict and
characterize data. But the ultimate objective in Data Mining Functionalities is to observe
the various trends in data mining. There are several data mining functionalities that the
organized and scientific methods offer, such as:

1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class or a
concept. A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance sale and
non-sale products. There are two concepts here, one that helps with grouping and the other
that helps in differentiating.

o Data Characterization: This refers to the summary of general characteristics or

features of the class, resulting in specific rules that define a target class. A data analysis
technique called Attribute-oriented Induction is employed on the data set for achieving
characterization.
o Data Discrimination: Discrimination is used to separate distinct data sets based on
the disparity in attribute values. It compares features of a class with features of one or
more contrasting classes.g., bar charts, curves and pie charts.
2. Mining Frequent Patterns

One of the functions of data mining is finding data patterns. Frequent patterns are things that
are discovered to be most common in data. Various types of frequency can be found in the
dataset.

o Frequent item set:This term refers to a group of items that are commonly found
together, such as milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by
a cover.
3. Association Analysis

It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used
for determining the association rules:

o It provides which identifies the common item set in the database.

o Confidence is the conditional probability that an item occurs when another item occurs
in a transaction.
4. Classification

Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to
predict a class or essentially classify a collection of items. A training set containing items
whose properties are known is used to train the system to predict the category of items from
an unknown collection of items.

5. Prediction

It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.

o Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future
event that might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a popular data

mining functionality. It is similar to classification, but the classes are not predefined. Data
attributes represent the classes. Similar data are grouped together, with the difference being
that a class label is not known. Clustering algorithms group data based on similar features
and dissimilarities.

7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are too many outliers,
you cannot trust the data or draw patterns. An outlier analysis determines if there is something
out of turn in the data and whether it indicates a situation that a business needs to consider
and take measures to mitigate. An outlier analysis of the data that cannot be grouped into any
classes by the algorithms is pulled up.

8. Evolution and Deviation Analysis

Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.

9. Correlation Analysis

Correlation is a mathematical technique for determining whether and how strongly two
attributes is related to one another. It refers to the various types of data structures, such as
trees and graphs, that can be combined with an item set or subsequence. It determines how
well two numerically measured continuous variables are linked. Researchers can use this type
of analysis to see if there are any possible correlations between variables in their study.
Data Discretization & Concept Hierarchy Generation

Hierarchy concept refers to a sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level concepts to high-level
concepts

Concept hierarchy generation is a process that builds upon discretization to further abstract
the data. It's like creating a tree where leaves represent the most specific information, and
branches represent more general concepts.

There are two types of hierarchy: top-down mapping and the second one is bottom-up
mapping.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information
Types of Concept Hierarchies
1.Schema Hierarchy
2.Set-Grouping Hierarchy
3.Operation-Derived Hierarchy
4.Rule-based Hierarchy

Need of Concept Hierarchy in Data Mining

There are several reasons why a concept hierarchy is useful in data mining:
a. Improved Data Analysis
b. Improved Data Visualization and Exploration:
c. Improved Algorithm Performance
d. Data Cleaning and Pre-processing
e. Domain Knowledge
Applications of Concept Hierarchy
1. Data Warehousing
2. Business Intelligence
3. Online Retail
4. Healthcare
5. Natural Language Processing
6. Fraud Detection
Unit 3
Association rules
Market basket analysis
Association rules also known as Frequent item sets
Relationships between items in a dataset that occur frequently together (milk,
bread)
If a dataset contains 10 transactions and the item set (milk, bread) appears in 5 of
those transactions, the support count for (milk, bread) is 5
Association rule mining algorithms, such as (1) Apriori
(2) FP-Growth, are used

FP-Growth
FP-growth. FP stands for frequent pattern. The FP-growth algorithm uses a tree
structure, called an FP-tree, to map out relationships between individual items to
find the most frequently recurring patterns.
Example 1
minimum support be 3.
Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}

Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 3
U 1
Y 3

Solution
Ordered-Item set for the current transaction
Step 1 Inserting the set {K, E, M, O, Y}
Initialize the support count for each item as 1.

Step 2 Inserting the set {K, E, O, Y}

Step 3 Inserting the set {K, E, M}

Step 4 Inserting the set {K, M, Y}

Step 5 Inserting the set {K, E, O}

Step 6 Conditional Pattern Base

Step 7 Conditional Frequent Pattern Tree is built

Step 8 Frequent Pattern rules

Apriori Algorithm
1. Association rules between objects
2. Analyzes that people who bought product A also bought product B.
3. Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
4. For example, the items customers but at a Big Bazar.
5. Algorithm uses a breadth-first search and Hash Tree

you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled
together. It shows that the shopkeeper makes it comfortable for the customers to
buy these products in the same place.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and

select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than
the threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Example 1
Solution:
Step-1 frequency of each itemset individually in the dataset

Item Set Support_Count

A 6
B 7
C 6
D 2
E 1

Step 2 greater support count that the Minimum Support (2)

Item Set Support_Count
A 6
B 7
C 6
D 2

Step 3 we will create the pair of the item set

Step 4 greater support count that the Minimum Support (2)

Step 5 we will create the pair of the item sets(3)
Item Set Support_Count
{A, B,C} 2
{B,C,D} 0
{A,C,D} 0
{A,B,D} 1

Itemset that has support count equal to the minimum support count. So, only one
combination, i.e., {A, B, C}.

Step-6: Finding the association rules for the subsets

Sup{(A ^B) ^C}/sup(A ^B)=
A ^B → C 2
2/4=0.5=50%

Minimum Efficiency =50%

Disadvantages of Apriori Algorithm
1. The apriori algorithm works slow compared to other algorithms.
2. The overall performance can be reduced as it scans the database for
multiple times.
3. The time complexity and space complexity of the apriori algorithm is
O(2D), which is very high.
Application of Apriori Algorithm
1.Market basket analysis

2.Product recommendations

3.Healthcare

4.Forestry

5.Autocomplete Tool

6.Education

Modern Teaching Methods
75% (4)
Modern Teaching Methods
10 pages
Spm-Unit Ii
No ratings yet
Spm-Unit Ii
84 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Study Material I
No ratings yet
Study Material I
140 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
Datawarehouse&Data Mining - ALL
No ratings yet
Datawarehouse&Data Mining - ALL
46 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
DM QB Ans
No ratings yet
DM QB Ans
47 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Data Mining
No ratings yet
Data Mining
22 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Sohel Portfolio
No ratings yet
Sohel Portfolio
43 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Master Thesis - Order To Cash
No ratings yet
Master Thesis - Order To Cash
53 pages
Unit 2 Preprocessing in Data Mining
No ratings yet
Unit 2 Preprocessing in Data Mining
6 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Mining
No ratings yet
Data Mining
15 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit 3
No ratings yet
Unit 3
22 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Understanding The Basics of Essbase Data and Cubes Operations - Jane Story
No ratings yet
Understanding The Basics of Essbase Data and Cubes Operations - Jane Story
30 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unidad de Corte 5510
No ratings yet
Unidad de Corte 5510
20 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Down 2
No ratings yet
Down 2
61 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Rev Mapeh10 2ndperiod
No ratings yet
Rev Mapeh10 2ndperiod
5 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
Lab Manual For Production Technology: January 2020
No ratings yet
Lab Manual For Production Technology: January 2020
68 pages
Unit 4 Intro DM
No ratings yet
Unit 4 Intro DM
30 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
EAPP 12 2nd Quarter
No ratings yet
EAPP 12 2nd Quarter
23 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Manual
No ratings yet
Manual
64 pages
PBL PPT Suraj
No ratings yet
PBL PPT Suraj
15 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Week 3
No ratings yet
Week 3
23 pages
Painting Crew Supervisor Interview Question
No ratings yet
Painting Crew Supervisor Interview Question
6 pages
ControlCase Compliance Manager Start-Up Manual v1.1
No ratings yet
ControlCase Compliance Manager Start-Up Manual v1.1
19 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
GA05 Guide To LEED Certification Commercial
No ratings yet
GA05 Guide To LEED Certification Commercial
10 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Modelling of The Pressure Drop in Tangential Inlet Cyclone Separators
No ratings yet
Modelling of The Pressure Drop in Tangential Inlet Cyclone Separators
10 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
FlexRig Fleet International
No ratings yet
FlexRig Fleet International
2 pages
EM-80/EM-300 MDS 5150A/LIT Actuator System: Applications
No ratings yet
EM-80/EM-300 MDS 5150A/LIT Actuator System: Applications
5 pages
Desktop Surveillance Assessment (DHSP) Issue 1
No ratings yet
Desktop Surveillance Assessment (DHSP) Issue 1
19 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Rapoo C1612 Brochure
No ratings yet
Rapoo C1612 Brochure
4 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Ticket Muenchen Berlin 3165580741
No ratings yet
Ticket Muenchen Berlin 3165580741
1 page
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
LAB6
50% (2)
LAB6
5 pages
Module 2
No ratings yet
Module 2
8 pages
Chapter 4 - Identity and Access Management Part 1 - Section A
No ratings yet
Chapter 4 - Identity and Access Management Part 1 - Section A
8 pages
The Process of Animation
No ratings yet
The Process of Animation
7 pages
Chapter 1 - Shining Resonance Refrain Walkthrough - Neoseeker
No ratings yet
Chapter 1 - Shining Resonance Refrain Walkthrough - Neoseeker
6 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
MIGO Code
No ratings yet
MIGO Code
5 pages
820.9.5X MLA Style (8th Edition) - S17
No ratings yet
820.9.5X MLA Style (8th Edition) - S17
6 pages
IEC 61850 Process Bus
No ratings yet
IEC 61850 Process Bus
3 pages
Welcome To Transport Department Government of Tel 3
No ratings yet
Welcome To Transport Department Government of Tel 3
1 page
Algeria (DZA) : Administrative Boundary Common Operational Database (COD-AB)
No ratings yet
Algeria (DZA) : Administrative Boundary Common Operational Database (COD-AB)
3 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
International Journal of Data Science and Analytics (IJDSA)
No ratings yet
International Journal of Data Science and Analytics (IJDSA)
2 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh

Uploaded by

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh

Uploaded by

Data Mining Definition

1. Data Mining is defined as extracting information from huge sets of data.

Which Technologies are used in data mining

i. Data cleaning (remove noise and inconsistent data)

Major Issues in data mining:

Data Preprocessing in Data Mining

Data Preprocessing Techniques

Handling Missing Values

• Combined computer and human inspection − The outliers can also be

• Entity identification problem: Identifying entities from multiple databases. For

• Dimensionality reduction: This process is necessary for real-world applications

• Numerosity Reduction: In this method, the representation of the data is made

There are some methods for data transformation .

• Discretization: The continuous data here is split into intervals. Discretization

• Normalization: It is the method of scaling the data so that it can be represented in

o Data Characterization: This refers to the summary of general characteristics or

o It provides which identifies the common item set in the database.

In image processing, pattern recognition and bioinformatics, clustering is a popular data

8. Evolution and Deviation Analysis

Need of Concept Hierarchy in Data Mining

Step 2 Inserting the set {K, E, O, Y}

Step 4 Inserting the set {K, M, Y}

Step 6 Conditional Pattern Base

Step 8 Frequent Pattern rules

Steps for Apriori Algorithm

Step-1: Determine the support of itemsets in the transactional database, and

Step-4: Sort the rules as the decreasing order of lift.

Item Set Support_Count

Step 2 greater support count that the Minimum Support (2)

Step 3 we will create the pair of the item set

Step 4 greater support count that the Minimum Support (2)

Step-6: Finding the association rules for the subsets

Minimum Efficiency =50%

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.