DWDM 3 Unit Notes
DWDM 3 Unit Notes
(KOE093)
UNIT-III
Data Mining
Data mining is the process of discovering patterns and relationships in large datasets using
techniques such as machine learning and statistical analysis. The goal of data mining is to
extract useful information from large datasets and use it to make predictions or inform
decision-making. Data mining is important because it allows organizations to uncover
insights and trends in their data that would be difficult or impossible to discover manually.
This can help organizations make better decisions, improve their operations, and gain a
competitive advantage. Data mining is also a rapidly growing field, with many new
techniques and applications being developed every year.
2. Fraud Detection: Data mining is widely used in the financial industry to detect and
prevent fraud. It involves analyzing data on transactions and customer behavior to
identify patterns or anomalies that may indicate fraudulent activity.
5. Network Intrusion Detection: Data mining is used in the cyber security industry to
detect network intrusions and prevent cyber attacks. It involves analyzing data on
network traffic and behavior to identify patterns that may indicate an attempted
intrusion, and using this information to alert security teams and prevent attacks
Data Preprocessing: Data preprocessing is the process of preparing data for analysis.
This typically involves cleaning and transforming the data to remove errors,
inconsistencies, and irrelevant information, and to make it suitable for analysis. Data
preprocessing is an important step in data mining, as it ensures that the data is of high
quality and is ready for analysis.
Data Mining Algorithms: Data mining algorithms are the algorithms and models that
are used to perform data mining. These algorithms can include supervised and
unsupervised learning algorithms, such as regression, classification, and clustering, as
well as more specialized algorithms for specific tasks, such as association rule mining
and anomaly detection. Data mining algorithms are applied to the data to extract useful
insights and information from it.
Data Visualization: Data visualization is the process of presenting data and insights
in a clear and effective manner, typically using charts, graphs, and other
visualizations. Data visualization is an important part of data mining, as it allows data
miners to communicate their findings and insights to others in a way that is easy to
understand and interpret.
Overall, a data mining architecture typically includes several key components, which work
together to perform data mining tasks and extract useful insights and information from
data. These components include data sources, data preprocessing, data mining algorithms,
and data visualization, and are essential for enabling effective and efficient data mining.
There are many different types of data mining, but they can generally be grouped into
three broad categories: descriptive, predictive, and prescriptive.
Descriptive data mining involves summarizing and describing the characteristics of a
data set. This type of data mining is often used to explore and understand the data,
identify patterns and trends, and summarize the data in a meaningful way.
Predictive data mining involves using data to build models that can make predictions
or forecasts about future events or outcomes. This type of data mining is often used to
identify and model relationships between different variables, and to make predictions
about future events or outcomes based on those relationships.
Prescriptive data mining involves using data and models to make recommendations
or suggestions about actions or decisions. This type of data mining is often used to
optimize processes, allocate resources, or make other decisions that can help
organizations achieve their goals.
Next, the data is prepared for analysis. This involves cleaning the data, transforming
it into a usable format, and checking for errors or inconsistencies.
Once the data is prepared, you can begin exploring it to gain insights and understand
its characteristics. This typically involves using visualization and summary statistics
to understand the distribution, patterns, and trends in the data.
The next step is to build models that can be used to make predictions or forecasts
based on the data. This involves choosing an appropriate modeling technique, fitting
the model to the data, and evaluating its performance.
After the model is built, it is important to validate its performance to ensure that it is
accurate and reliable. This typically involves using a separate data set (called a
validation set) to evaluate the model’s performance and make any necessary
adjustments.
The final step in the data mining process is to evaluate the results of the model and
determine its effectiveness in solving the problem or achieving the goals. This
involves measuring the model’s performance, comparing it to other models or
approaches, and making any necessary changes or improvements.
Data preprocessing is an important step in the data mining process that involves cleaning
and transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can
be used for data integration.
Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of
the analysis results. The specific steps involved in data preprocessing may vary depending
on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size
of the dataset while preserving the important information. This is done to improve the
efficiency of data analysis and to avoid overfitting of the model. Some common steps
involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It
can be done using techniques such as random sampling, stratified sampling, and systematic
sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression,
JPEG compression, and gzip compression.
Data Reduction in Data Mining
INTRODUCTION:
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large
amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data
mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work with,
rather than using the entire dataset. This can be useful for reducing the size of a dataset
while still preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features
in the dataset, either by removing features that are not relevant or by combining
multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete
data by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the
dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy
and the size of the data. The more data is reduced, the less accurate the model will be
and the less generalizable it will be.
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Data reduction in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of machine
learning algorithms by reducing the size of the dataset. This can make it faster and
more practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of
machine learning algorithms by removing irrelevant or redundant information from the
dataset. This can help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated
with large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of
the results by removing irrelevant or redundant information from the dataset.
Disadvantages:
Prepared By: