Unit 3 DM
Unit 3 DM
• Classification
• Prediction
• We use classification and prediction to extract a model,
representing the data classes to predict future data
trends. Classification predicts the categorical labels of
data with the prediction models. Classification models
predict categorical class labels, and prediction models
predict continuous-valued functions
• What is Classification?
• Classification is to identify the category or the class label of a
new observation. First, a set of data is used as training data.
• The set of input data and the corresponding outputs are given
to the algorithm. So, the training data set includes the input
data and their associated class labels.
• Using the training dataset, the algorithm derives a model or
the classifier.
• How does Classification Works?
• The functioning of classification with the assistance of the
bank loan application has been mentioned above.
• There are two stages in the data classification system:
classifier or model creation and classification classifier.
• Developing the Classifier or model creation: This level is the
learning stage or the learning process. The classification algorithms
construct the classifier in this stage.
• A classifier is constructed from a training set composed of the
records of databases and their corresponding class names.
• Each category that makes up the training set is referred to as a
category or class. We may also refer to these records as samples,
objects, or data points.
• Applying classifier for classification:
• The classifier is used for classification at this level. The test data are
used here to estimate the accuracy of the classification algorithm. If
the consistency is deemed sufficient, the classification rules can be
expanded to cover new data records. It includes:
– Sentiment Analysis:
– Sentiment analysis is highly helpful in social media monitoring. We can use it to
extract social media insights.
– The accurate trained models provide consistently accurate outcomes and result
in a fraction of the time.
– Document Classification:
– We can use document classification to organize the documents into sections
according to the content.
– Document classification refers to text classification; we can classify the words in
the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
– Image Classification: Image classification is used for the trained categories of an
image. These could be the caption of the image, a statistical value, a theme
– Machine Learning Classification: It uses the statistically demonstrable algorithm
rules to execute analytical tasks that would take humans hundreds of more
hours to perform.
What is Prediction ?
• What is Prediction?
• Another process of data analysis is prediction. It is used to
find a numerical output. Same as in classification, the training
dataset contains the inputs and corresponding numerical
output values.
• The algorithm derives the model or a predictor according to
the training dataset.
• The model should find a numerical output when the new data
is given. Unlike in classification, this method does not have a
class label. The model predicts a continuous-valued function
or ordered value.
Classification and Prediction Issues
Classification Prediction
ption is that the new data comes from a distribution similar to the data we used to construct our decision tree. In many instances, t
In classification, the model can be known In prediction, the model can be known as
as the classifier. the predictor.
For example, the grouping of patients For example, We can think of prediction
based on their medical records can be as predicting the correct treatment for a
considered a classification. particular disease for a person.
What are the various Issues regarding Classification and Prediction in data mining?
• Data cleaning −
• This defines the pre-processing of data to eliminate or reduce noise by using smoothing
methods and the operation of missing values (e.g., by restoring a missing value with the
most generally appearing value for that attribute, or with the best probable value
established on statistics).
• Relevance analysis − There are various attributes in the data that can be irrelevant to the
classification or prediction task. For instance, data recording the day of the week on
which a bank loan software was filled is improbable to be relevant to the success of the
software. Moreover, some different attributes can be redundant.
• Therefore, relevance analysis can be implemented on the data to delete some irrelevant
or redundant attributes from the learning procedure. In machine learning, this step is
referred to as feature selection. It contains such attributes that can otherwise slow
down, and likely mislead the learning step.
• Normalization includes scaling all values for a given attribute so that they decline inside a
small specified area, including -1.0 to 1.0, or 0 to 1.0. In these approaches that apply
distance measurements, for instance, this can avoid attributes with originally high ranges
(such as, income) from
• Data transformation − The data can be generalized to a larger-level
approach. Concept hierarchies can be used for these goals. This is
especially helpful for continuous-valued attributes. For instance,
mathematical values for the attribute income can be generalized to the
discrete field including low, medium, and high.
• Likewise, nominal-valued attributes, such as the street, can be generalized
to larger-level concepts, such as the city.
• Normalization includes scaling all values for a given attribute so that they
decline inside a small specified area, including -1.0 to 1.0, or 0 to 1.0
CLASSIFICATION BY DECISION TREE INDUCTION