0% found this document useful (0 votes)

15 views32 pages

Classification

The document discusses classification and clustering in data mining, emphasizing the importance of extracting valuable insights from large datasets. It outlines various methods, including decision trees and K-means clustering, and their applications across sectors like finance, healthcare, and education. Additionally, it highlights the role of data mining in improving business strategies and decision-making processes.

Uploaded by

Mitu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views32 pages

Classification

Uploaded by

Mitu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Classification

• What is classification?
• Data mining refers to the process of extracting important data
from raw data. It analyses the data patterns in huge sets of data
with the help of several software. Ever since the development of
data mining, it is being incorporated by researchers in the
research and development field.
• With Data mining, businesses are found to gain more profit. It has
not only helped in understanding customer demand but also in
developing effective strategies to enforce overall business
turnover. It has helped in determining business objectives for
making clear decisions.
• Data collection and data warehousing, and computer processing
are some of the strongest pillars of data mining. Data mining
utilizes the concept of mathematical algorithms to segment the
data and assess the possibility of occurrence of future events.
• Building the classifier or Model:
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built the training set made up of database tuples and their
associated class labels.
• Each tuple that constitutes the traning set referred to as a category or
class. These tuples can also be as sample, object or data points.

Name Age Income Loan decision

Mahesh Youth Low Risky
Vishal Youth Low Risky
Chirag Middle age High Safe
Amit Middle Low Risky
Abdul Senior Low Safe
Sweta Senior Medium Safe
Paresh Middle High safe
• Decision Tree Method
• Decision tree-based classification methods are a type of machine
learning technique that builds a tree-like model to classify new data
points based on their features. The goal of decision tree-based
classification is to create a model that accurately predicts the class label
of a new observation by dividing the data into smaller and smaller
subsets, each characterized by a set of features.
• The decision tree is built using training data, with a set of features and a
known class label representing each data point. The tree is constructed
by recursively splitting the data based on the most informative feature
until the subsets become homogeneous concerning class labels or a
stopping criterion is met. At each split, the feature that best separates
the data is selected based on a criterion such as information gain or
Gini index. Once the decision tree is built, it can be used to classify new
data points by traversing the tree based on the values of their features
until reaching a leaf node corresponding to a class label.
• Decision tree Algorithm:
• Step-1: Begin the tree with the root node, says S, which
contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible
values for the best attributes.
• Step-4: Generate the decision tree node, which contains the
best attribute.
• Step-5: Recursively make new decision trees using the
subsets of the dataset created in step -3. Continue this
process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
• Information gain:
• The attribute with the highest information gain is chosen as the
splitting attribute. This attribute minimizes the information needed
to classify the tuples in the resulting partitions. Let D, the data
partition, be a training set of class-labeled tuples. let class label
attribute has m distinct values defining m distinct classes, Ci (for i =
1,..., m). Let Ci,D be the set of tuples of class Ci in D. Let |D| and |
Ci,D| denote the number of tuples in D and Ci,D, respectively.
• Then the expected information needed to classify a tuple in D is
given by • where pi is the nonzero probability that an arbitrary
tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. Info(D)
is the average amount of information needed to identify the class
label of a tuple in D. Info(D) is also known as the entropy of D. •
Now, suppose we have to partition the tuples in D on some
attribute A having v distinct values, {a1 , a2 ,..., av }. Then the
expected information required to classify the tuple from D based on
attribute A is:
• Example:
• For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
• Tree Pruning
• Pre-pruning Approach
 In the pre-pruning approach, a tree is “pruned” by labored its
construction early (e.g., by determining not to further divide or
partition the subset of training samples at a provided node).
Upon halting, the node turns into a leaf. The leaf can influence
the most common class between the subset samples, or the
probability distribution of those samples.
• Post-pruning Approach
 The post-pruning approach eliminates branches from a
“completely grown” tree. A tree node is pruned by eliminating
its branches. The price complexity pruning algorithm is an
instance of the post-pruning approach. The pruned node turns
into a leaf and is labeled by the most common class between
its previous branches.
• What is cluster Analysis?
• Clustering is an unsupervised Machine Learning-
based Algorithm that comprises a group of data
points into clusters so that the objects belong to
the same group.
• Clustering helps to splits data into several
subsets. Each of these subsets contains data
similar to each other, and these subsets are
called clusters. Now that the data from our
customer base is divided into clusters, we can
make an informed decision about who we think
is best suited for this product.
• Overview of basic clustering methods:
• Partitioning Method: It is used to make partitions on the data
in order to form clusters. If “n” partitions are done on “p”
objects of the database then each partition is represented by
a cluster and n < p. The two conditions which need to be
satisfied with this Partitioning Clustering Method are:
1. One objective should only belong to only one group.
2. There should be no group without even a single purpose.
• Hierarchical Method: In this method, a hierarchical
decomposition of the given set of data objects is created. We
can classify hierarchical methods and will be able to know the
purpose of classification on the basis of how the hierarchical
decomposition is formed. There are two types of approaches
for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is
also known as the bottom-up approach. Initially, the
given data is divided into which objects form separate
groups. Thereafter it keeps on merging the objects or the
groups that are close to one another which means that
they exhibit similar properties. This merging process
continues until the termination condition holds.
• Divisive Approach: The divisive approach is also known
as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster.
The group of individual clusters is divided into small
clusters by continuous iteration. The iteration continues
until the condition of termination is met or until each
cluster contains one object.
• Density-Based Method: The density-based method mainly
focuses on density. In this method, the given cluster will keep
on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data
point within a given cluster. The radius of a given cluster has
to contain at least a minimum number of points.
•Density-Based Clustering Methods

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It

depends on a density-based notion of cluster. It also identifies clusters of arbitrary
size in the spatial database with outliers.
•EPS: It is considered as the maximum radius of the neighborhood.
•MinPts: MinPts refers to the minimum number of points in an Eps
neighborhood of that point.
•NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
•A point i is considered as the directly density reachable from a point
k with respect to Eps, MinPts if
•i belongs to NEps(k)
•NEps (k) >= MinPts
• Partitioning Method
• Partitioning Method: This clustering method classifies the information
into multiple groups based on the characteristics and similarity of the
data. Its the data analysts to specify the number of clusters that has to
be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data
in which each partition represents a cluster and a particular region.
There are many algorithms that come under partitioning method
some of the popular ones are K-Mean, PAM(K-Medoids), CLARA
algorithm (Clustering Large Applications) etc. K-Mean (A centroid
based Technique): The K means algorithm takes the input parameter K
from the user and partitions the dataset containing N objects into K
clusters so that resulting similarity among the data objects inside the
group (intracluster) is high but the similarity of data objects with the
data objects from outside the cluster is low (intercluster).
• Algorithm: K mean:
• Input: K: The number of clusters in which the dataset
has to be divided D: A dataset containing N number of
objects
• Output: A dataset of K clusters
• Method:
• Randomly assign K objects from the dataset(D) as
cluster centres(C)
• (Re) Assign each object to which object is most similar
based upon mean values.
• Update Cluster means, i.e., Recalculate the mean of
each cluster with the updated values.
• Repeat Step 2 until no change occurs.
• Step-1: Select the number K to decide the number of
clusters.
• Step-2: Select random K points or centroids. (It can be
other from the input dataset).
• Step-3: Assign each data point to their closest centroid,
which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of
each cluster.
• Step-5: Repeat the third steps, which means reassign each
datapoint to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else
go to FINISH.
• Step-7: The model is ready.
• Figure – K-mean ClusteringExample: Suppose
we want to group the visitors to a website
using just their age as follows:
• 16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41,
42, 43, 44, 45, 61, 62, 66
• K=2
• Centroid(C1) = 16 [16]
• Centroid(C2) = 22 [22]
• Note: These two points are chosen randomly from
the dataset.
• Iteration-1:
• C1 = 16.33 [16, 16, 17] C2 = 37.25 [20, 20, 21, 21, 22,
23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66] Iteration-2:
• C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23] C2 =
46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
• C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29] C2
= 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66] Iteration-
4:
• C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29] C2
= 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
• Data mining Application
• Financial/Banking Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely
to be interested in a new credit product.
• Credit card fraud detection.
• Identify ‘Loyal’ customers.
• Extraction of information related to customers.
• Determine credit card spending by customer groups.
• Healthcare and Insurance: A Pharmaceutical sector can examine its new
deals force activity and their outcomes to improve the focusing of high-
value physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the Insurance
sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify
fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed together.
• Identify successful medical therapies for different illnesses.
• Characterizes patient behavior to predict office visits.
• Transportation: A diversified transportation company with a large direct sales
force can apply data mining to identify the best prospects for its services. A large
consumer merchandise organization can apply information mining to improve its
business cycle to retailers.
• Determine the distribution schedules among outlets.
• Analyze loading patterns.

• Research: A data mining technique can perform predictions, classification, clustering,

associations, and grouping of data with perfection in the research area. Rules generated by
data mining are unique to find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model is a strategy to
measure the precision of the proposed model. It is called Train/Test because we split the
data set into two sets: a training data set and a testing data set. A training data set used to
design the training model whereas testing data set is used in the testing model. Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
• Market Basket Analysis: Market Basket Analysis is a
technique that gives the careful study of purchases done by a
customer in a supermarket. This concept identifies the
pattern of frequent purchase items by customers. This
analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this
analysis task. Example:
• Data mining concepts are in use for Sales and marketing to
provide better customer service, to improve cross-selling
opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and
prediction of likely defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining
concept for identifying inappropriate or unusual behavior etc.
• Education: For analyzing the education sector,
data mining uses Educational Data Mining (EDM)
method. This method generates patterns that can
be used both by learners and educators. By using
data mining EDM we can perform some
educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
• Business Transactions: Every business industry is memorized
for perpetuity. Such transactions are usually time-related and
can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time
frame for competitive decision-making is definitely the most
important problem to solve for businesses that struggle to
survive in a highly competitive world. Data mining helps to
analyze these business transactions and identify marketing
approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most popular
Big Data use cases in business)
• Scientific Analysis: Scientific simulations are generating bulks of data every day.
This includes data collected from nuclear laboratories, data about human
psychology, etc. Data mining techniques are capable of the analysis of these
data. Now we can capture and store more new data faster than we can analyze
the old data already accumulated. Example of scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support.
• Intrusion Detection: A network intrusion refers to any unauthorized activity on
a digital network. Network intrusions often involve stealing
valuable network resources. Data mining technique plays a vital role in
searching intrusion detection, network attacks, and anomalies. These
techniques help in selecting and refining useful and relevant information from
large data sets. Data mining technique helps in classify relevant data for
Intrusion Detection System. Intrusion Detection system generates alarms for the
network traffic about the foreign invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection

Unit 5
No ratings yet
Unit 5
27 pages
Recommendation Systems
No ratings yet
Recommendation Systems
27 pages
DM Unit-4
No ratings yet
DM Unit-4
75 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Chapter 03
No ratings yet
Chapter 03
16 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
60 pages
Data Mining
No ratings yet
Data Mining
68 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
27 pages
Unit VII
No ratings yet
Unit VII
30 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Chapter 7
No ratings yet
Chapter 7
3 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Chapter-V CLASSIFICATION & CLUSTERING
No ratings yet
Chapter-V CLASSIFICATION & CLUSTERING
153 pages
Aiml Prof
No ratings yet
Aiml Prof
8 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Clustering
No ratings yet
Clustering
45 pages
ML 8
No ratings yet
ML 8
5 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
ML Unit-3
No ratings yet
ML Unit-3
22 pages
Dbms Unit 3
No ratings yet
Dbms Unit 3
40 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Clustering
No ratings yet
Clustering
11 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Soil Nutrient Analysis
No ratings yet
Soil Nutrient Analysis
9 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Decision Tree
No ratings yet
Decision Tree
16 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Early Release 2nd Edition Aurélien Géron Download PDF
100% (3)
Hands On Machine Learning With Scikit Learn and TensorFlow Early Release 2nd Edition Aurélien Géron Download PDF
65 pages
Algorithms New
No ratings yet
Algorithms New
8 pages
Python Programming & Data Science Lab Manual
No ratings yet
Python Programming & Data Science Lab Manual
25 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Unit 2 ML
No ratings yet
Unit 2 ML
11 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Clustering
No ratings yet
Clustering
7 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Clustering
No ratings yet
Clustering
11 pages
Accelerate Your Workflow With Data Analytics
0% (1)
Accelerate Your Workflow With Data Analytics
49 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Random Forest
No ratings yet
Random Forest
25 pages
18.4 Evaluating and Choosing The Best Hypothesis: Model Selection: Complexity vs. Goodness of Fit
No ratings yet
18.4 Evaluating and Choosing The Best Hypothesis: Model Selection: Complexity vs. Goodness of Fit
8 pages
Comparison of Different Clustering Algorithms Using WEKA Tool
No ratings yet
Comparison of Different Clustering Algorithms Using WEKA Tool
3 pages
Les 3 DWM
No ratings yet
Les 3 DWM
21 pages
6 الى13 داتا ماينق
No ratings yet
6 الى13 داتا ماينق
19 pages
Study of Clustering Methods in Data Mini PDF
No ratings yet
Study of Clustering Methods in Data Mini PDF
5 pages
Decision Trees: Artificial Intelligence: A Modern Approach, 3rd Ed
No ratings yet
Decision Trees: Artificial Intelligence: A Modern Approach, 3rd Ed
47 pages
Clustering
No ratings yet
Clustering
8 pages
Applications and Trends in Data Mining
100% (1)
Applications and Trends in Data Mining
20 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Fake Review Detection
No ratings yet
Fake Review Detection
27 pages
Analysis and Prediction of Crime Against Woman Using Machine Learning Techniques
No ratings yet
Analysis and Prediction of Crime Against Woman Using Machine Learning Techniques
6 pages
Essemble ML Forecastproduction
No ratings yet
Essemble ML Forecastproduction
20 pages
An AI-Based Medical Chatbot Model For Infectious Disease Prediction
No ratings yet
An AI-Based Medical Chatbot Model For Infectious Disease Prediction
15 pages
Unit 4 MCQ
No ratings yet
Unit 4 MCQ
10 pages
Inceptez Fullstack Datascience, Bigdata and Cloud 2021
No ratings yet
Inceptez Fullstack Datascience, Bigdata and Cloud 2021
36 pages
Rathore 2024 Machine Learning Applications in Human Resource Ma
No ratings yet
Rathore 2024 Machine Learning Applications in Human Resource Ma
12 pages
Malicious URL Detection Using Machine Learning 2
No ratings yet
Malicious URL Detection Using Machine Learning 2
24 pages
Integrated Long-Term Stock Selection Models Based On Feature Selection and Machine Learning Algorithms For China Stock Market
No ratings yet
Integrated Long-Term Stock Selection Models Based On Feature Selection and Machine Learning Algorithms For China Stock Market
14 pages
Computer Science Textbook Solutions - 6
No ratings yet
Computer Science Textbook Solutions - 6
30 pages
Decision Trees
No ratings yet
Decision Trees
19 pages
Unit 4
No ratings yet
Unit 4
4 pages
Risk Management
No ratings yet
Risk Management
17 pages
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
No ratings yet
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
8 pages
Performance Analysis of Decision Tree Classifiers
100% (1)
Performance Analysis of Decision Tree Classifiers
9 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
Predicting Bitcoin Returns Using High-Dimensional
No ratings yet
Predicting Bitcoin Returns Using High-Dimensional
16 pages
Data Mining in Banking and Finance
100% (1)
Data Mining in Banking and Finance
14 pages
Machine Learning Algorithms in Bipedal Robot Control
No ratings yet
Machine Learning Algorithms in Bipedal Robot Control
16 pages
Classification and Popularity Assessment of English Songs Based On Audio Features
No ratings yet
Classification and Popularity Assessment of English Songs Based On Audio Features
3 pages
HW1 Final
No ratings yet
HW1 Final
4 pages
A Novel Method For Building Regression Tree Models For QSAR Based On Artificial Ant Colony Systems
No ratings yet
A Novel Method For Building Regression Tree Models For QSAR Based On Artificial Ant Colony Systems
5 pages
Data Mining Functionalities:: - Characterization and Discrimination
No ratings yet
Data Mining Functionalities:: - Characterization and Discrimination
21 pages
Data Warehousing and DatabySRS
No ratings yet
Data Warehousing and DatabySRS
8 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Classification

Uploaded by

Classification

Uploaded by

Classification

Name Age Income Loan decision

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It

• Research: A data mining technique can perform predictions, classification, clustering,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.