Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
Primary goal of data mining is to discover hidden patterns. Predict future trends and make
more informed business decisions. Data mining is also called Knowledge Discovery in
Database (KDD).
There are tonnes of information available on various platforms, but very little knowledge is
accessible.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.
Focus is on the discovery of useful knowledge, rather than simply finding patterns in data
Techniques
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining,
6. Pattern evaluation
7. knowledge representation and visualization.
Advantages of KDD
1. Improves decision-making.
2. Increased efficiency
3. Better customer service
4. Fraud detection
Disadvantages of KDD
1. Privacy concerns
2. Complexity
3. Data Quality
4. High cost.
Data Mining Applications
➢ Financial Data Analysis
➢ Retail Industry
➢ Tele communication Industry
➢ Biological Data Analysis
➢ Other Scientific Applications
➢ Intrusion Detection
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate
data from the datasets, and it also replaces the missing values. Here are some techniques
for data cleaning:
• Standard values like “Not Available” or “NA” can be used to replace the missing
values.
• Missing values can also be filled manually, but it is not recommended when that
dataset is big.
• The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.
• While using regression or decision tree algorithms, the missing value can be
replaced by the most probable value.
Handling Noisy Data
Noisy generally means random error or containing unnecessary data points. Handling
noisy data is one of the most important steps as it leads to the optimization of the model
we are using Here are some of the methods to handle noisy data.
• Binning: This method is to smooth or handle noisy data. First, the data is sorted
then, and then the sorted values are separated and stored in the form of bins. There
are three methods for smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using
minimum and maximum values of the bin values are taken, and the closest
boundary value replaces the values.
• Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.
• Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components of data management. There are some problems to
be considered during data integration.
• Schema integration: Integrates meta data(a set of data that describes other data)
from different sources.
This process helps in the reduction of the volume of the data, which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the data reduction techniques are dimensionality reduction,
numerosity reduction, and data compression.
• Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.
• Data Cube Aggregation: Data cube aggregation involves summarizing the data in
a data cube by aggregating data across one or more dimensions. This technique is
useful when analyzing large datasets with many dimensions, as it can help reduce
the size of the data by collapsing it into a smaller number of dimensions.
4.Data Transformation
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements.
• Smoothing: With the help of algorithms, we can remove noise from the dataset,
which helps in knowing the important features of the dataset. By smoothing, we
can find even a simple change that helps in prediction.
• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good, the results are more relevant.
min-max normalization
z-score normalization
Functionalities of Data Mining
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. Data mining tasks can be classified into two types:
descriptive and predictive. Descriptive mining tasks define the common features of the
data in the database, and the predictive mining tasks act in inference on the current
information to develop predictions.
Data mining is extensively used in many areas or sectors. It is used to predict and
characterize data. But the ultimate objective in Data Mining Functionalities is to observe
the various trends in data mining. There are several data mining functionalities that the
organized and scientific methods offer, such as:
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a
concept. A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance sale and
non-sale products. There are two concepts here, one that helps with grouping and the other
that helps in differentiating.
One of the functions of data mining is finding data patterns. Frequent patterns are things that
are discovered to be most common in data. Various types of frequency can be found in the
dataset.
o Frequent item set:This term refers to a group of items that are commonly found
together, such as milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by
a cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used
for determining the association rules:
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to
predict a class or essentially classify a collection of items. A training set containing items
whose properties are known is used to train the system to predict the category of items from
an unknown collection of items.
5. Prediction
It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.
o Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future
event that might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis
7. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers,
you cannot trust the data or draw patterns. An outlier analysis determines if there is something
out of turn in the data and whether it indicates a situation that a business needs to consider
and take measures to mitigate. An outlier analysis of the data that cannot be grouped into any
classes by the algorithms is pulled up.
Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two
attributes is related to one another. It refers to the various types of data structures, such as
trees and graphs, that can be combined with an item set or subsequence. It determines how
well two numerically measured continuous variables are linked. Researchers can use this type
of analysis to see if there are any possible correlations between variables in their study.
Data Discretization & Concept Hierarchy Generation
Hierarchy concept refers to a sequence of mappings with a set of more general concepts to
complex concepts. It means mapping is done from low-level concepts to high-level
concepts
Concept hierarchy generation is a process that builds upon discretization to further abstract
the data. It's like creating a tree where leaves represent the most specific information, and
branches represent more general concepts.
There are two types of hierarchy: top-down mapping and the second one is bottom-up
mapping.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information
Types of Concept Hierarchies
1.Schema Hierarchy
2.Set-Grouping Hierarchy
3.Operation-Derived Hierarchy
4.Rule-based Hierarchy
FP-Growth
FP-growth. FP stands for frequent pattern. The FP-growth algorithm uses a tree
structure, called an FP-tree, to map out relationships between individual items to
find the most frequently recurring patterns.
Example 1
minimum support be 3.
Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 3
U 1
Y 3
Solution
Ordered-Item set for the current transaction
Step 1 Inserting the set {K, E, M, O, Y}
Initialize the support count for each item as 1.
Apriori Algorithm
1. Association rules between objects
2. Analyzes that people who bought product A also bought product B.
3. Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
4. For example, the items customers but at a Big Bazar.
5. Algorithm uses a breadth-first search and Hash Tree
you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled
together. It shows that the shopkeeper makes it comfortable for the customers to
buy these products in the same place.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than
the threshold or minimum confidence.
Example 1
Solution:
Step-1 frequency of each itemset individually in the dataset
Itemset that has support count equal to the minimum support count. So, only one
combination, i.e., {A, B, C}.
2.Product recommendations
3.Healthcare
4.Forestry
5.Autocomplete Tool
6.Education