0% found this document useful (0 votes)
15 views32 pages

Classification

The document discusses classification and clustering in data mining, emphasizing the importance of extracting valuable insights from large datasets. It outlines various methods, including decision trees and K-means clustering, and their applications across sectors like finance, healthcare, and education. Additionally, it highlights the role of data mining in improving business strategies and decision-making processes.

Uploaded by

Mitu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views32 pages

Classification

The document discusses classification and clustering in data mining, emphasizing the importance of extracting valuable insights from large datasets. It outlines various methods, including decision trees and K-means clustering, and their applications across sectors like finance, healthcare, and education. Additionally, it highlights the role of data mining in improving business strategies and decision-making processes.

Uploaded by

Mitu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Classification

• What is classification?
• Data mining refers to the process of extracting important data
from raw data. It analyses the data patterns in huge sets of data
with the help of several software. Ever since the development of
data mining, it is being incorporated by researchers in the
research and development field.
• With Data mining, businesses are found to gain more profit. It has
not only helped in understanding customer demand but also in
developing effective strategies to enforce overall business
turnover. It has helped in determining business objectives for
making clear decisions.
• Data collection and data warehousing, and computer processing
are some of the strongest pillars of data mining. Data mining
utilizes the concept of mathematical algorithms to segment the
data and assess the possibility of occurrence of future events.
• Building the classifier or Model:
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built the training set made up of database tuples and their
associated class labels.
• Each tuple that constitutes the traning set referred to as a category or
class. These tuples can also be as sample, object or data points.

Name Age Income Loan decision


Mahesh Youth Low Risky
Vishal Youth Low Risky
Chirag Middle age High Safe
Amit Middle Low Risky
Abdul Senior Low Safe
Sweta Senior Medium Safe
Paresh Middle High safe
• Decision Tree Method
• Decision tree-based classification methods are a type of machine
learning technique that builds a tree-like model to classify new data
points based on their features. The goal of decision tree-based
classification is to create a model that accurately predicts the class label
of a new observation by dividing the data into smaller and smaller
subsets, each characterized by a set of features.
• The decision tree is built using training data, with a set of features and a
known class label representing each data point. The tree is constructed
by recursively splitting the data based on the most informative feature
until the subsets become homogeneous concerning class labels or a
stopping criterion is met. At each split, the feature that best separates
the data is selected based on a criterion such as information gain or
Gini index. Once the decision tree is built, it can be used to classify new
data points by traversing the tree based on the values of their features
until reaching a leaf node corresponding to a class label.
• Decision tree Algorithm:
• Step-1: Begin the tree with the root node, says S, which
contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible
values for the best attributes.
• Step-4: Generate the decision tree node, which contains the
best attribute.
• Step-5: Recursively make new decision trees using the
subsets of the dataset created in step -3. Continue this
process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
• Information gain:
• The attribute with the highest information gain is chosen as the
splitting attribute. This attribute minimizes the information needed
to classify the tuples in the resulting partitions. Let D, the data
partition, be a training set of class-labeled tuples. let class label
attribute has m distinct values defining m distinct classes, Ci (for i =
1,..., m). Let Ci,D be the set of tuples of class Ci in D. Let |D| and |
Ci,D| denote the number of tuples in D and Ci,D, respectively.
• Then the expected information needed to classify a tuple in D is
given by • where pi is the nonzero probability that an arbitrary
tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. Info(D)
is the average amount of information needed to identify the class
label of a tuple in D. Info(D) is also known as the entropy of D. •
Now, suppose we have to partition the tuples in D on some
attribute A having v distinct values, {a1 , a2 ,..., av }. Then the
expected information required to classify the tuple from D based on
attribute A is:
• Example:
• For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
• Tree Pruning
• Pre-pruning Approach
 In the pre-pruning approach, a tree is “pruned” by labored its
construction early (e.g., by determining not to further divide or
partition the subset of training samples at a provided node).
Upon halting, the node turns into a leaf. The leaf can influence
the most common class between the subset samples, or the
probability distribution of those samples.
• Post-pruning Approach
 The post-pruning approach eliminates branches from a
“completely grown” tree. A tree node is pruned by eliminating
its branches. The price complexity pruning algorithm is an
instance of the post-pruning approach. The pruned node turns
into a leaf and is labeled by the most common class between
its previous branches.
• What is cluster Analysis?
• Clustering is an unsupervised Machine Learning-
based Algorithm that comprises a group of data
points into clusters so that the objects belong to
the same group.
• Clustering helps to splits data into several
subsets. Each of these subsets contains data
similar to each other, and these subsets are
called clusters. Now that the data from our
customer base is divided into clusters, we can
make an informed decision about who we think
is best suited for this product.
• Overview of basic clustering methods:
• Partitioning Method: It is used to make partitions on the data
in order to form clusters. If “n” partitions are done on “p”
objects of the database then each partition is represented by
a cluster and n < p. The two conditions which need to be
satisfied with this Partitioning Clustering Method are:
1. One objective should only belong to only one group.
2. There should be no group without even a single purpose.
• Hierarchical Method: In this method, a hierarchical
decomposition of the given set of data objects is created. We
can classify hierarchical methods and will be able to know the
purpose of classification on the basis of how the hierarchical
decomposition is formed. There are two types of approaches
for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is
also known as the bottom-up approach. Initially, the
given data is divided into which objects form separate
groups. Thereafter it keeps on merging the objects or the
groups that are close to one another which means that
they exhibit similar properties. This merging process
continues until the termination condition holds.
• Divisive Approach: The divisive approach is also known
as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster.
The group of individual clusters is divided into small
clusters by continuous iteration. The iteration continues
until the condition of termination is met or until each
cluster contains one object.
• Density-Based Method: The density-based method mainly
focuses on density. In this method, the given cluster will keep
on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data
point within a given cluster. The radius of a given cluster has
to contain at least a minimum number of points.
•Density-Based Clustering Methods

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It


depends on a density-based notion of cluster. It also identifies clusters of arbitrary
size in the spatial database with outliers.
•EPS: It is considered as the maximum radius of the neighborhood.
•MinPts: MinPts refers to the minimum number of points in an Eps
neighborhood of that point.
•NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
•A point i is considered as the directly density reachable from a point
k with respect to Eps, MinPts if
•i belongs to NEps(k)
•NEps (k) >= MinPts
• Partitioning Method
• Partitioning Method: This clustering method classifies the information
into multiple groups based on the characteristics and similarity of the
data. Its the data analysts to specify the number of clusters that has to
be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data
in which each partition represents a cluster and a particular region.
There are many algorithms that come under partitioning method
some of the popular ones are K-Mean, PAM(K-Medoids), CLARA
algorithm (Clustering Large Applications) etc. K-Mean (A centroid
based Technique): The K means algorithm takes the input parameter K
from the user and partitions the dataset containing N objects into K
clusters so that resulting similarity among the data objects inside the
group (intracluster) is high but the similarity of data objects with the
data objects from outside the cluster is low (intercluster).
• Algorithm: K mean:
• Input: K: The number of clusters in which the dataset
has to be divided D: A dataset containing N number of
objects
• Output: A dataset of K clusters
• Method:
• Randomly assign K objects from the dataset(D) as
cluster centres(C)
• (Re) Assign each object to which object is most similar
based upon mean values.
• Update Cluster means, i.e., Recalculate the mean of
each cluster with the updated values.
• Repeat Step 2 until no change occurs.
• Step-1: Select the number K to decide the number of
clusters.
• Step-2: Select random K points or centroids. (It can be
other from the input dataset).
• Step-3: Assign each data point to their closest centroid,
which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of
each cluster.
• Step-5: Repeat the third steps, which means reassign each
datapoint to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else
go to FINISH.
• Step-7: The model is ready.
• Figure – K-mean ClusteringExample: Suppose
we want to group the visitors to a website
using just their age as follows:
• 16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41,
42, 43, 44, 45, 61, 62, 66
• K=2
• Centroid(C1) = 16 [16]
• Centroid(C2) = 22 [22]
• Note: These two points are chosen randomly from
the dataset.
• Iteration-1:
• C1 = 16.33 [16, 16, 17] C2 = 37.25 [20, 20, 21, 21, 22,
23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66] Iteration-2:
• C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23] C2 =
46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
• C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29] C2
= 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66] Iteration-
4:
• C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29] C2
= 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
• Data mining Application
• Financial/Banking Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely
to be interested in a new credit product.
• Credit card fraud detection.
• Identify ‘Loyal’ customers.
• Extraction of information related to customers.
• Determine credit card spending by customer groups.
• Healthcare and Insurance: A Pharmaceutical sector can examine its new
deals force activity and their outcomes to improve the focusing of high-
value physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the Insurance
sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify
fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed together.
• Identify successful medical therapies for different illnesses.
• Characterizes patient behavior to predict office visits.
• Transportation: A diversified transportation company with a large direct sales
force can apply data mining to identify the best prospects for its services. A large
consumer merchandise organization can apply information mining to improve its
business cycle to retailers.
• Determine the distribution schedules among outlets.
• Analyze loading patterns.

• Research: A data mining technique can perform predictions, classification, clustering,


associations, and grouping of data with perfection in the research area. Rules generated by
data mining are unique to find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model is a strategy to
measure the precision of the proposed model. It is called Train/Test because we split the
data set into two sets: a training data set and a testing data set. A training data set used to
design the training model whereas testing data set is used in the testing model. Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
• Market Basket Analysis: Market Basket Analysis is a
technique that gives the careful study of purchases done by a
customer in a supermarket. This concept identifies the
pattern of frequent purchase items by customers. This
analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this
analysis task. Example:
• Data mining concepts are in use for Sales and marketing to
provide better customer service, to improve cross-selling
opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and
prediction of likely defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining
concept for identifying inappropriate or unusual behavior etc.
• Education: For analyzing the education sector,
data mining uses Educational Data Mining (EDM)
method. This method generates patterns that can
be used both by learners and educators. By using
data mining EDM we can perform some
educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
• Business Transactions: Every business industry is memorized
for perpetuity. Such transactions are usually time-related and
can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time
frame for competitive decision-making is definitely the most
important problem to solve for businesses that struggle to
survive in a highly competitive world. Data mining helps to
analyze these business transactions and identify marketing
approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most popular
Big Data use cases in business)
• Scientific Analysis: Scientific simulations are generating bulks of data every day.
This includes data collected from nuclear laboratories, data about human
psychology, etc. Data mining techniques are capable of the analysis of these
data. Now we can capture and store more new data faster than we can analyze
the old data already accumulated. Example of scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support.
• Intrusion Detection: A network intrusion refers to any unauthorized activity on
a digital network. Network intrusions often involve stealing
valuable network resources. Data mining technique plays a vital role in
searching intrusion detection, network attacks, and anomalies. These
techniques help in selecting and refining useful and relevant information from
large data sets. Data mining technique helps in classify relevant data for
Intrusion Detection System. Intrusion Detection system generates alarms for the
network traffic about the foreign invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy