0% found this document useful (0 votes)

23 views21 pages

Data Mining New

The document discusses the architecture of data mining systems. It describes the key components which include a data source, data mining engine, data warehouse server, pattern evaluation module, graphical user interface, and knowledge base. It also explains the processes involved in knowledge discovery in databases (KDD).

Uploaded by

harshitu302001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views21 pages

Data Mining New

Uploaded by

harshitu302001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Unit:-3

Data Mining Architecture

The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.

Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to
be successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.

Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data mining procedure because the data
may not be complete and accurate. So, the first data requires to be cleaned and unified.
More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server. These procedures are
not as easy as we think. Several methods may be performed on the data as part of
selection, integration, and cleaning.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains
several modules for operating data mining tasks, including association, characterization,
classification, clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation
of the pattern by using a threshold value. It collaborates with the data mining engine to
focus the search on exciting patterns.

This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake threshold
to filter out discovered patterns. On the other hand, the pattern evaluation module might
be coordinated with the mining module, depending on the implementation of the data
mining techniques used. For efficient data mining, it is abnormally suggested to push the
evaluation of pattern stake as much as possible into the mining procedure to confine the
search to only fascinating patterns.

Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system
and the user. This module helps the user to easily and efficiently use the system without
knowing the complexity of the process. This module cooperates with the data mining
system when the user specifies a query or a task and displays the results.

Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns. The knowledge base may
even contain user views and data from user experiences that might be helpful in the data
mining process. The data mining engine may receive inputs from the knowledge base to
make the result more accurate and reliable. The pattern assessment module regularly
interacts with the knowledge base to get inputs, and also update it.

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets.
The KDD process is an iterative process and it requires multiple iterations of the above
steps to extract accurate knowledge from the data.The following steps are included in
KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a
common source(DataWarehouse). Data integration using Data Migration tools, Data
Synchronization tools and ETL(Extract-Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection. For this we can use Neural network, Decision
Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate
form required by mining procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially
useful. It transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing
knowledge based on given measures. It find interestingness score of each pattern,
and uses summarization and Visualization to make data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make
decisions.

Note: KDD is an iterative process where evaluation measures can be enhanced,

mining can be refined, new data can be integrated and transformed in order to get
different and more appropriate results.Preprocessing of databases consists of Data
cleaning and Data Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that
can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and
makes the data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of
their customers’ needs and preferences, which can help them provide better
customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying
patterns and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast
future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about
individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as
bias or discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to
the extent that it negatively impacts the performance of the model on new unseen
data.

Difference between KDD and Data Mining

Parameter KDD Data Mining

KDD refers to a process of identifying Data Mining refers to a process of

valid, novel, potentially useful, and extracting useful and valuable
Definition
ultimately understandable patterns and information or patterns from large
relationships in data. data sets.

To extract useful information from

Objective To find useful knowledge from data.
data.
Parameter KDD Data Mining

Data cleaning, data integration, data

Association rules, classification,
selection, data transformation, data
Techniques clustering, regression, decision
mining, pattern evaluation, and
Used trees, neural networks, and
knowledge representation and
dimensionality reduction.
visualization.

Structured information, such as rules Patterns, associations, or insights

Output and models, that can be used to make that can be used to improve
decisions or predictions. decision-making or understanding.

Focus is on the discovery of useful Data mining focus is on the

Focus knowledge, rather than simply finding discovery of patterns or
patterns in data. relationships in data.

Domain expertise is important in Domain expertise is less critical in

Role of
KDD, as it helps in defining the goals data mining, as the algorithms are
domain
of the process, choosing appropriate designed to identify patterns
expertise
data, and interpreting the results. without relying on prior knowledge

Data Preprocessing in Data Mining


Data preprocessing is an important step in the data mining process. It refers to the
cleaning, transforming, and integrating of data in order to make it ready for analysis. The
goal of data preprocessing is to improve the quality of the data and to make it more
suitable for the specific data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning
and transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and data
fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.

Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms
that require categorical data. Discretization can be achieved through techniques such as
equal width binning, equal frequency binning, and clustering.

Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of
the analysis results. The specific steps involved in data preprocessing may vary
depending on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the
results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in
a useful and efficient format.
Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

1. Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size
of the dataset while preserving the important information. This is done to improve the
efficiency of data analysis and to avoid overfitting of the model. Some common steps
involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).

Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization
(NMF).

Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It
can be done using techniques such as random sampling, stratified sampling, and
systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering
is often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.

Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression,
JPEG compression, and gzip compression.

Methods of data reduction:

These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the
information you gathered for your analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months. They involve you in the annual sales, rather than
the quarterly average, So we can summarize the data in such a way that the resulting data
summarizes the total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the
original attributes on the set based on their relevance to other attributes. We know it as a p-
value in statistics.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point,
it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes, saving time and making the
process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
 Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original data but are useful
enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric methods such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat this method up to the end, then the process is
known as top-down discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for
age) with high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :

Data reduction in data mining can have a number of advantages and disadvantages.

Advantages:

1. Improved efficiency: Data reduction can help to improve the efficiency of machine
learning algorithms by reducing the size of the dataset. This can make it faster and more
practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This
can help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of the
results by removing irrelevant or redundant information from the dataset.

Disadvantages:

1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the
size of the dataset can also remove important information that is needed for accurate
predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.
4. Additional computational costs: Data reduction can add additional computational costs to
the data mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can improve
the efficiency and performance of machine learning algorithms by reducing the size of the
dataset. However, it can also result in a loss of information, and make it harder to interpret
the results. It’s important to weigh the pros and cons of data reduction and carefully assess
the risks and benefits before implementing it.

Decision Tree

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

What is a Decision Tree?

A decision tree is a non-parametric supervised learning algorithm for classification and

regression tasks. It has a hierarchical tree structure consisting of a root node, branches,

internal nodes, and leaf nodes. Decision trees are used for classification and regression

tasks, providing easy-to-understand models.

A decision tree is a hierarchical model used in decision support that depicts decisions and

their potential outcomes, incorporating chance events, resource expenses, and utility. This

algorithmic model utilizes conditional control statements and is non -parametric, supervised

learning, useful for both classification and regression tasks. The tree structure is comprised

of a root node, branches, internal nodes, and leaf nodes, forming a hierarchical, tree-like

structure.
It is a tool that has applications spanning several different areas. Decision trees can be used

for classification as well as regression problems. The name itself sugges ts that it uses a

flowchart like a tree structure to show the predictions that result from a series of feature -

based splits. It starts with a root node and ends with a decision made by leaves.

Decision Tree Terminologies

Before learning more about decision trees let’s get familiar with some of the terminologies:

 Root Node: The initial node at the beginning of a decision tree, where the entire

population or dataset starts dividing based on various features or conditions.

 Decision Nodes: Nodes resulting from the splitting of root nodes are known as

decision nodes. These nodes represent intermediate decisions or conditions within

the tree.
 Leaf Nodes: Nodes where further splitting is not possible, often indicating the final

classification or outcome. Leaf nodes are also referred to as terminal nodes.

 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section

of a decision tree is referred to as a sub-tree. It represents a specific portion of the

decision tree.

 Pruning: The process of removing or cutting down specific nodes in a decision tree

to prevent overfitting and simplify the model.

 Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch

or sub-tree. It represents a specific path of decisions and outcomes within the tree.

 Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is

known as a parent node, and the sub-nodes emerging from it are referred to as child

nodes. The parent node represents a decision or condition, while the child nodes

represent the potential outcomes or further decisions based on that condition.

Example of Decision Tree

Let’s understand decision trees with the help of an example:

Decision trees are upside down which means the root is at the top and then this root is split

into various several nodes. Decision trees are nothing but a bunch of if -else statements in

layman terms. It checks if the condition is true and if it is then it goes to the next node

attached to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or

rainy? If yes then it will go to the next feature which is humidity and wind. It will again

check if there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person

may go and play.

Did you notice anything in the above flowchart? We see that if the weather is cloudy then

we must go to play. Why didn’t it split more? Why did it stop there?

To answer this question, we need to know about few more concepts like entropy,

information gain, and Gini index. But in simple terms, I can say here that the output for the

training dataset is always yes for cloudy weather, since there is no disorderliness here we

don’t need to split the node further.

The goal of machine learning is to decrease uncertainty or disorders from the dataset and for

this, we use decision trees.

Now you must be thinking how do I know what should be the root node? what should be the

decision node? when should I stop splitting? To decide this, there is a metric called

“Entropy” which is the amount of uncertainty in the dataset.

How decision tree algorithms work?

Decision Tree algorithm works in simpler steps

1. Starting at the Root: The algorithm begins at the top, called the “root node,”

representing the entire dataset.

2. Asking the Best Questions: It looks for the most important feature or question that

splits the data into the most distinct groups. This is like asking a question at a fork in

the tree.

3. Branching Out: Based on the answer to that question, it divides the data into smaller

subsets, creating new branches. Each branch represents a possible route through the

tree.

4. Repeating the Process: The algorithm continues asking questions and splitting the

data at each branch until it reaches the final “leaf nodes,” representing the predicted

outcomes or classifications.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which
was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm,
there is no backtracking; the trees are constructed in a top-down recursive divide-and-
conquer manner.

Generating a decision tree form training tuples of data partition D

Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then

return N as leaf node labeled with class C;

if attribute_list is empty then

return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)

to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and

multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute

for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition

let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches

There are two approaches to prune a tree −

 Pre-pruning − The tree is pruned by halting its construction early.

 Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity
The cost complexity is measured by the following two parameters −

 Number of leaves in the tree, and

 Error rate of the tree.

Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
KDD-Knowledge Discovery in Databases
No ratings yet
KDD-Knowledge Discovery in Databases
5 pages
Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Data Mining 5 Semester Bca
No ratings yet
Data Mining 5 Semester Bca
44 pages
Data Mining Notes
No ratings yet
Data Mining Notes
21 pages
R23!3!1 DWDM Final Syllabus On 21-06-2025
No ratings yet
R23!3!1 DWDM Final Syllabus On 21-06-2025
5 pages
Fund Data Science
No ratings yet
Fund Data Science
91 pages
Data Mining A Conceptual Overview
No ratings yet
Data Mining A Conceptual Overview
32 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
30 pages
06 - Decision Trees
100% (1)
06 - Decision Trees
83 pages
Data Mining Unit-1
No ratings yet
Data Mining Unit-1
59 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Module 4
No ratings yet
Module 4
54 pages
DM GTU Study Material Presentations Unit-4 21052021124323PM
No ratings yet
DM GTU Study Material Presentations Unit-4 21052021124323PM
28 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
ML Unit-I
No ratings yet
ML Unit-I
121 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
Decision Trees
No ratings yet
Decision Trees
14 pages
DMW Module 3
No ratings yet
DMW Module 3
112 pages
DMW ALLinONE
No ratings yet
DMW ALLinONE
64 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
11 pages
Data Mining - Reference - 1
No ratings yet
Data Mining - Reference - 1
91 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Great Compiled Notes Data Mining V1
No ratings yet
Great Compiled Notes Data Mining V1
92 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
Unit-1 Notes Onl
No ratings yet
Unit-1 Notes Onl
25 pages
Chapater 1 Data Mining 2025
No ratings yet
Chapater 1 Data Mining 2025
7 pages
DWDM Lab Manual Final As On 09-04-2021 R18
No ratings yet
DWDM Lab Manual Final As On 09-04-2021 R18
88 pages
Machine Learning (Se204A) Lab Manual
No ratings yet
Machine Learning (Se204A) Lab Manual
27 pages
Notes For DMDWH - Module1
No ratings yet
Notes For DMDWH - Module1
21 pages
Csis355 Classifications 1
No ratings yet
Csis355 Classifications 1
70 pages
Data Mining PPT
No ratings yet
Data Mining PPT
17 pages
Pilot Study Using Decision Trees To Diagnose The Efficacy of Virtual Offshore Egress Training
No ratings yet
Pilot Study Using Decision Trees To Diagnose The Efficacy of Virtual Offshore Egress Training
15 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
AI Algorithms, Data Structures, and Idioms in Prolog, Lisp (PDFDrive) - 264-363
No ratings yet
AI Algorithms, Data Structures, and Idioms in Prolog, Lisp (PDFDrive) - 264-363
100 pages
Data Mining
No ratings yet
Data Mining
15 pages
2023-24 ML Notes 2
No ratings yet
2023-24 ML Notes 2
16 pages
Unit 1
No ratings yet
Unit 1
43 pages
01 Data Warehouse
No ratings yet
01 Data Warehouse
15 pages
04cali 67
No ratings yet
04cali 67
8 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
AIH Lab2
No ratings yet
AIH Lab2
10 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
100% (2)
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
21 pages
Data Mining Lecture One - Docx1
No ratings yet
Data Mining Lecture One - Docx1
12 pages
ML LAB Viva Questions With Answers
No ratings yet
ML LAB Viva Questions With Answers
10 pages
Data Mining
No ratings yet
Data Mining
25 pages
Data Mining Questions 1st Unit
No ratings yet
Data Mining Questions 1st Unit
6 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Wao
No ratings yet
Wao
9 pages
DM Module 1
No ratings yet
DM Module 1
11 pages
DWM 4
No ratings yet
DWM 4
23 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Ai&ml M-3
No ratings yet
Ai&ml M-3
6 pages
Machine Learning (6CS4-02) Unit-1 Notes
No ratings yet
Machine Learning (6CS4-02) Unit-1 Notes
34 pages
Unit-2 Bi
No ratings yet
Unit-2 Bi
26 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
A) Data Cleaning
No ratings yet
A) Data Cleaning
7 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Machine Learning Laboratory (21AIL66)
No ratings yet
Machine Learning Laboratory (21AIL66)
7 pages
Lesson 3
No ratings yet
Lesson 3
17 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
IML-IITKGP - Assignment 2 Solution
No ratings yet
IML-IITKGP - Assignment 2 Solution
11 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
KDD
No ratings yet
KDD
3 pages
Unit 1
No ratings yet
Unit 1
11 pages
Decisiontree 2
No ratings yet
Decisiontree 2
16 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Data Mining Versus Knowledge Discovery I
No ratings yet
Data Mining Versus Knowledge Discovery I
3 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Knowledge Discovery in Databases (KDD) : An Overview
No ratings yet
Knowledge Discovery in Databases (KDD) : An Overview
4 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Introduction To Data Mining-1
100% (1)
Introduction To Data Mining-1
24 pages
DM Notes
No ratings yet
DM Notes
4 pages
ID3 Algorithm
100% (1)
ID3 Algorithm
3 pages
Data Mining: UNIT-5 G.Kamal
No ratings yet
Data Mining: UNIT-5 G.Kamal
11 pages
So sánh thuật toán cây quyết định ID3 và C45
No ratings yet
So sánh thuật toán cây quyết định ID3 và C45
7 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining New

Uploaded by

Data Mining New

Uploaded by

Unit:-3

Data Mining Architecture

Database or Data Warehouse Server:

Data Mining Engine:

Pattern Evaluation Module:

Graphical User Interface:

Note: KDD is an iterative process where evaluation measures can be enhanced,

Difference between KDD and Data Mining

KDD refers to a process of identifying Data Mining refers to a process of

To extract useful information from

Data cleaning, data integration, data

Structured information, such as rules Patterns, associations, or insights

Focus is on the discovery of useful Data mining focus is on the

Domain expertise is important in Domain expertise is less critical in

Data Preprocessing in Data Mining

Some common steps in data preprocessing include:

 (a). Missing Data:

1. Ignore the tuples:

2. Fill the Missing values:

 (b). Noisy Data:

4. Concept Hierarchy Generation:

Methods of data reduction:

Final reduced attribute set: {X1, X2, X5}

Step-1: {X1, X2, X3, X4, X5}

Final reduced attribute set: {X1, X2, X5}

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :

 It does not require any domain knowledge.

What is a Decision Tree?

A decision tree is a non-parametric supervised learning algorithm for classification and

tasks, providing easy-to-understand models.

Decision Tree Terminologies

population or dataset starts dividing based on various features or conditions.

decision nodes. These nodes represent intermediate decisions or conditions within

classification or outcome. Leaf nodes are also referred to as terminal nodes.

 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section

of a decision tree is referred to as a sub-tree. It represents a specific portion of the

to prevent overfitting and simplify the model.

 Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch

represent the potential outcomes or further decisions based on that condition.

Example of Decision Tree

Let’s understand decision trees with the help of an example:

attached to that decision.

may go and play.

don’t need to split the node further.

this, we use decision trees.

“Entropy” which is the amount of uncertainty in the dataset.

How decision tree algorithms work?

Decision Tree algorithm works in simpler steps

representing the entire dataset.

Decision Tree Induction Algorithm

Generating a decision tree form training tuples of data partition D

if tuples in D are all of the same class, C then

if attribute_list is empty then

apply attribute_selection_method(D, attribute_list)

if splitting_attribute is discrete-valued and

attribute_list = splitting attribute; // remove splitting attribute

// partition the tuples and grow subtrees for each partition

Tree Pruning Approaches

There are two approaches to prune a tree −

 Pre-pruning − The tree is pruned by halting its construction early.

 Number of leaves in the tree, and

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.