0% found this document useful (0 votes)
13 views10 pages

DWDM 3 Unit Notes

Data mining is the process of discovering patterns in large datasets to extract useful information for decision-making. Key applications include market basket analysis, fraud detection, and customer segmentation, while the data mining architecture involves components such as data sources, preprocessing, algorithms, and visualization. The process includes defining the problem, preparing data, exploring it, modeling, validating, implementing, and evaluating results, with preprocessing steps like cleaning, integration, transformation, and reduction being crucial for effective analysis.

Uploaded by

Anami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

DWDM 3 Unit Notes

Data mining is the process of discovering patterns in large datasets to extract useful information for decision-making. Key applications include market basket analysis, fraud detection, and customer segmentation, while the data mining architecture involves components such as data sources, preprocessing, algorithms, and visualization. The process includes defining the problem, preparing data, exploring it, modeling, validating, implementing, and evaluating results, with preprocessing steps like cleaning, integration, transformation, and reduction being crucial for effective analysis.

Uploaded by

Anami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA WAREHOUSINGING & DATA MINING

(KOE093)
UNIT-III
Data Mining
Data mining is the process of discovering patterns and relationships in large datasets using
techniques such as machine learning and statistical analysis. The goal of data mining is to
extract useful information from large datasets and use it to make predictions or inform
decision-making. Data mining is important because it allows organizations to uncover
insights and trends in their data that would be difficult or impossible to discover manually.

This can help organizations make better decisions, improve their operations, and gain a
competitive advantage. Data mining is also a rapidly growing field, with many new
techniques and applications being developed every year.

5 Use Cases of Data Mining


Data mining has a wide range of applications and uses cases across many industries and
domains. Some of the most common use cases of data mining include:
1. Market Basket Analysis: Market basket analysis is a common use case of data
mining in the retail and e-commerce industries. It involves analyzing data on customer
purchases to identify items that are frequently purchased together, and using this
information to make recommendations or suggestions to customers.

2. Fraud Detection: Data mining is widely used in the financial industry to detect and
prevent fraud. It involves analyzing data on transactions and customer behavior to
identify patterns or anomalies that may indicate fraudulent activity.

3. Customer Segmentation: Data mining is commonly used in the marketing and


advertising industries to segment customers into different groups based on their
characteristics and behavior. This information can then be used to tailor marketing and
advertising campaigns to specific segments of customers.

4. Predictive Maintenance: Data mining is increasingly used in the manufacturing and


industrial sectors to predict when equipment or machinery is likely to fail or require
maintenance. It involves analyzing data on the performance and usage of equipment to
identify patterns that can indicate potential failures, and using this information to
schedule maintenance and prevent downtime.

5. Network Intrusion Detection: Data mining is used in the cyber security industry to
detect network intrusions and prevent cyber attacks. It involves analyzing data on
network traffic and behavior to identify patterns that may indicate an attempted
intrusion, and using this information to alert security teams and prevent attacks

Data Mining Architecture


Data mining architecture refers to the overall design and structure of a data mining system.
A data mining architecture typically includes several key components, which work
together to perform data mining tasks and extract useful insights and information from
data. Some of the key components of a typical data mining architecture include:
 Data Sources: Data sources are the sources of data that are used in data mining. These
can include structured and unstructured data from databases, files, sensors, and other
sources. Data sources provide the raw data that is used in data mining and can be
processed, cleaned, and transformed to create a usable data set for analysis.

 Data Preprocessing: Data preprocessing is the process of preparing data for analysis.
This typically involves cleaning and transforming the data to remove errors,
inconsistencies, and irrelevant information, and to make it suitable for analysis. Data
preprocessing is an important step in data mining, as it ensures that the data is of high
quality and is ready for analysis.

 Data Mining Algorithms: Data mining algorithms are the algorithms and models that
are used to perform data mining. These algorithms can include supervised and
unsupervised learning algorithms, such as regression, classification, and clustering, as
well as more specialized algorithms for specific tasks, such as association rule mining
and anomaly detection. Data mining algorithms are applied to the data to extract useful
insights and information from it.

 Data Visualization: Data visualization is the process of presenting data and insights
in a clear and effective manner, typically using charts, graphs, and other
visualizations. Data visualization is an important part of data mining, as it allows data
miners to communicate their findings and insights to others in a way that is easy to
understand and interpret.

Overall, a data mining architecture typically includes several key components, which work
together to perform data mining tasks and extract useful insights and information from
data. These components include data sources, data preprocessing, data mining algorithms,
and data visualization, and are essential for enabling effective and efficient data mining.

3 Types of Data Mining

There are many different types of data mining, but they can generally be grouped into
three broad categories: descriptive, predictive, and prescriptive.
 Descriptive data mining involves summarizing and describing the characteristics of a
data set. This type of data mining is often used to explore and understand the data,
identify patterns and trends, and summarize the data in a meaningful way.

 Predictive data mining involves using data to build models that can make predictions
or forecasts about future events or outcomes. This type of data mining is often used to
identify and model relationships between different variables, and to make predictions
about future events or outcomes based on those relationships.

 Prescriptive data mining involves using data and models to make recommendations
or suggestions about actions or decisions. This type of data mining is often used to
optimize processes, allocate resources, or make other decisions that can help
organizations achieve their goals.

How Does Data Mining Work?


Data mining is the process of extracting useful information and insights from large data
sets. It typically involves several steps, including defining the problem, preparing the
data, exploring the data, modeling the data, validating the model, implementing the
model, and evaluating the results. Let’s understand the process of Data Mining in the
following phases:
 The process of data mining typically begins with defining the problem or question
that you want to answer with your data. This involves understanding the business
context and goals and identifying the data that is relevant to the problem.

 Next, the data is prepared for analysis. This involves cleaning the data, transforming
it into a usable format, and checking for errors or inconsistencies.

 Once the data is prepared, you can begin exploring it to gain insights and understand
its characteristics. This typically involves using visualization and summary statistics
to understand the distribution, patterns, and trends in the data.

 The next step is to build models that can be used to make predictions or forecasts
based on the data. This involves choosing an appropriate modeling technique, fitting
the model to the data, and evaluating its performance.

 After the model is built, it is important to validate its performance to ensure that it is
accurate and reliable. This typically involves using a separate data set (called a
validation set) to evaluate the model’s performance and make any necessary
adjustments.

 Once the model has been validated, it can be implemented in a production


environment to make predictions or recommendations. This involves deploying the
model and integrating it into the organization’s existing systems and processes.

 The final step in the data mining process is to evaluate the results of the model and
determine its effectiveness in solving the problem or achieving the goals. This
involves measuring the model’s performance, comparing it to other models or
approaches, and making any necessary changes or improvements.

Data Preprocessing in Data Mining


Data preprocessing is an important step in the data mining process. It refers to the
cleaning, transforming, and integrating of data in order to make it ready for analysis. The
goal of data preprocessing is to improve the quality of the data and to make it more
suitable for the specific data mining task.
Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning
and transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can
be used for data integration.
Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of
the analysis results. The specific steps involved in data preprocessing may vary depending
on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size
of the dataset while preserving the important information. This is done to improve the
efficiency of data analysis and to avoid overfitting of the model. Some common steps
involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It
can be done using techniques such as random sampling, stratified sampling, and systematic
sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression,
JPEG compression, and gzip compression.
Data Reduction in Data Mining

INTRODUCTION:

Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large
amount of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data
mining, including:

1. Data Sampling: This technique involves selecting a subset of the data to work with,
rather than using the entire dataset. This can be useful for reducing the size of a dataset
while still preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features
in the dataset, either by removing features that are not relevant or by combining
multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete
data by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the
dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy
and the size of the data. The more data is reduced, the less accurate the model will be
and the less generalizable it will be.

Methods of data reduction:


These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the
information you gathered for your analysis for the years 2012 to 2014, that data includes
the revenue of your company every three months. They involve you in the annual sales,
rather than the quarterly average, So we can summarize the data in such a way that the
resulting data summarizes the total sales per year instead of per quarter. It summarizes the
data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant
features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the
original attributes on the set based on their relevance to other attributes. We know it as
a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each
point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes, saving time and making
the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original
data from the compressed data.
 Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-
data compression, the decompressed data may differ from the original data but are
useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or
smaller representations of the data instead of actual data, it is important to only store the
model parameter. Or non-parametric methods such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes by labels of
small intervals. This means that mining results are shown in a concise, and easily
understandable way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat this method up to the end, then the process
is known as top-down discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through
a combination of the neighborhood values in the interval, that process is called
bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43
for age) with high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts.
The number of categorical counterparts depends on the number of bins specified by
the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the
attribute X, into disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the
number of bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data


Mining :

Data reduction in data mining can have a number of advantages and disadvantages.

Advantages:

1. Improved efficiency: Data reduction can help to improve the efficiency of machine
learning algorithms by reducing the size of the dataset. This can make it faster and
more practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of
machine learning algorithms by removing irrelevant or redundant information from the
dataset. This can help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated
with large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of
the results by removing irrelevant or redundant information from the dataset.

Disadvantages:

1. Loss of information: Data reduction can result in a loss of information, if important


data is removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing
the size of the dataset can also remove important information that is needed for
accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed
to understand the results.
4. Additional computational costs: Data reduction can add additional computational costs
to the data mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can
improve the efficiency and performance of machine learning algorithms by reducing
the size of the dataset. However, it can also result in a loss of information, and make it
harder to interpret the results. It’s important to weigh the pros and cons of data
reduction and carefully assess the risks and benefits before implementing it.

Prepared By:

Manoj Kumar Sharma


Assistant Professor
Department of CSE
VGI

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy