Classification Unit-4
Classification Unit-4
What is Classification?
There are two forms of data analysis that can be used for
extracting models describing important classes or to predict
future data trends. These two forms are as follows −
Classification
Predictions
Classification models predict categorical class labels; and
prediction models predict continuous valued functions.
Classification
Prediction
In this step, the classifier is used for classification. Here the test
data is used to estimate the accuracy of classification rules. The
classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
Example:
The splitting criteria are used to best partition the dataset. These
measures provide a ranking to the attributes for partitioning the
training tuples.
1. Information Gain
This method is the main method that is used to build decision
trees. It reduces the information that is required to classify the
tuples. It reduces the number of tests that are needed to classify
the given tuple. The attribute with the highest information gain
is selected.
Gini Index
Gini index says, if we select two items from a population at
random then they must be of same class and probability for this
is 1 if population is pure.
Tree Pruning
1. Partitioning Method
2. Hierarchical Method
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method
1. Partitioning Methods
Suppose we are given a database of ‘n’ objects and the
partitioning method constructs ‘k’ partition of data. Each
partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following
requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember
For a given number of partitions (say k), the partitioning
method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
2. Hierarchical Methods
This method creates a hierarchical decomposition of the given
set of data objects. We can classify hierarchical methods on the
basis of how the hierarchical decomposition is formed. There
are two approaches here −
Agglomerative Approach
This approach is also known as the bottom-up approach. In
this, we start with each object forming a separate group. It
keeps on merging the objects or groups that are close to one
another. It keeps on doing so until all of the groups are merged
into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In
this, we start with all of the objects in the same cluster. In the
continuous iteration, a cluster is split up into smaller clusters. It
is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.
3. Density-based Method
This method is based on the notion of density. The basic idea
is to continue growing the given cluster as long as the density in
the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.
4. Grid-based Method
In this, the objects together form a grid. The object space is
quantized into finite number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each
dimension in the quantized space.