DM - 06 Mar 2025
DM - 06 Mar 2025
Classification and prediction are two methods used to mine the data which helps to
analyze the new data and to explore more about unknown data.
Classification is the process of finding a good model that has the ability to categorize the
available data and to predict the class of unknown data.
Examples of Classification -
Abank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a comparny needs to analyze a customer with a given profile,
who will buy a new computer.
Inprediction, we identify or predict the missing or unavailable data for a newobservation
based on the previous data that we have and based on the future assumptions. In
prediction. the output is a continuous value.
Example on prediction
o Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company.
o Predicting the value of a house depending on the facts such as the number of rooms, the
total area, etc.,
Regression isgenerally used for prediction.
How does Classification Works?
There are two stages in the data classification system:
o Classifier model creation:- The classification algorithms construct the classifier in this
stage. Aclassifier is constructed from atraining set composed of the records of databases
and their corresponding class names. Each category that makes up the training set is
referredto as acategory or class. We may also refer to these records as samples, objects,
or data points.
o Application of classifier for classification:- The test data are used here to estimate the
accuracy of the classification algorithm. If the consistency is deemed sutficient, the
classification rules can be expanded to cover new data records. Example applications
are:- document classification, sentiment analysis, image classification etc.
Data Classification Process: The data classification process can
be categorized into five steps:
reate the goals of dataclassification, strategy,
data classification. workflows, and architecture of
Classify confidential details that we store.
Using marks by data labelling.
To imprOVe
Dala is protection and obedience, use effects.
complex, and a continuous method is a
classitication.
What is Data Classification Lifecycle?
The data classification lifc cycle produces an excellent struclure for controlling the tlow of data
to an enterprise. Businesses necd to account for data security and compliance at each level. With
the help of dataclassification, we can perform it at every stage, from origin to deletion. The data
life-cycle has the following stages, such as:
I. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by
tagging based on in-house protection policies and agreement ruleS.
J. Storage: Here, we have the obtained data, including access controls and
4. Sharing: Data is continuallydistributed among agents, consumers, and encryption.
various devices and platforms. co-workers from
5. Archive: Here, data is eventually
archived within an industry's storage systems.
6. Publication: Through the publication of data,
and download in the form of
it can reach customers. They can then view
dashboards.
Issues regarding classification and
prediction:
1. Data Cleaning: Data cleaning involves removing the noise
values. The noise is removed and treatment of mising
by applying smoothing
missing value withtechniques,
missing values is solved by replacing a and the problem of
tvalue for that attribute.
2. Relevance
the most commonly
occurring
Analysis: The database may also have
analysis is used to know whether any two given irrelevant attributes. Correlation
3. Data
Transformation attributes
and reduction: The data can are related.
following methods. be transformed by any of the
o
Normalization: The data is
involves scaling all values for atransformed using normalization.
given attribute to make Normalization
specified range. Normalization is used when the them fall within a small
neural
involving measurements are used in the learning step. networks or the methods
Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
Here are the criteria for comparing the methods
ofClassification and Prediction, such as:
Accuracy: The accuracy of the classifier can be referred to as the ability of
to predict the class label correctly, and the the classitier
accuracy
how well a given predictor can estimate the unknown
of the predictor can be referred to as
value.
Speed: The speed of the method depends on the
using the classifier or predictor. computational cost of generating and
Robustncss: Robustness is the ability to make correct predictions or classifications, In
the context of data mining. robustness is the ability of the classifier or predictor to make
correct predictions from incoming unknown data.
Scalability: Scalability refers to an increase or decrease in the performance of the
classifier or predictor based on the given data.
Interpretability: Interpretability is how readily we can understand the reasoning behind
predictionsor classification made by the predictor or classifier.
young senior
middie
aged
no yes no yes
(Decision Tree)
Alg
par
Very random
dataset
High Entropy
Less random
dataset
Less Entropy
2. Information Gain:
Entropy-E1
Information
gain(l1)-E1-E2
Size Where E1>E2
Entropy-E2
Input:
1) Data partition. D. which is a set of training tuples and their associated class labels:
3) Attribute selection method. a procedure to determine the splitting criterion that "best"
partitions the datatuples into individual classes. This criterion consists of a splitting attribute
and. possibly.a split point or splitting subset.
Output: Adecision tree.
Method:
g Create anode N;
(step 2ftuples in D
are all of the same class, Cthen
2.(Sep3Return'N as a leaf node labeled with the class C;
4iSsep 4) If attribyte list is empty then
S(STer 5) Return Nas a leaf node labeled with the majority class in D; I/ majority voting
L.Step) ApplyAtribute selection method(D, attribute list) to find the "best" splitting criterion:
R
HT
M
Ron fes
Hot
13
stn
s L t S-]
f
9
o1C5)
o.97(
Sovrncsth,oJ, rtnay (Saerat)-e
Rainy
ertrrj(Snr-Enray
hi
(s
x-97/
o.9183 - x o , |3
.o289
valirss(HoniA1)
7
Cins, RS
D
conny Sol, D, DE, D,
t SD3, D7, Dl2, D I 3 ] t, o
Rein: b4, 3,DE, DIO DIY3 3t,-] L
DI43
Roin
(3t,2
finsA
P1nnis
D
Hh
Nannstn
At*nibte Ten
V o l T a : Hot,
EndnrgSsna*
Ssung lit,3-J,
Sustot, AJ, Entep1Suat)
S ielt,I-)
(Ssnr
(Syr:
S
votus(Hit
t,3- ,Ent
ot,3-J, Entrg(Smiga)
et,o-],
Enir)
'ectnsniyh)-énapyl's
b.97
2t,3-] EntpO
*, -], trarpg(stnar)
9r83
Now, se
)o. 97
Laind) : 6.o 9
AS
Gain(Ssury
Aformetn of
be
Vots f Norno
a.
for D
Nrna
Lot,3 - Norny
plo Tans
Nertma
Nort
of Re-in
3
Skcins+,-]
Attaibte
ErdnàShct ) -
( Smie ): : = o . 9 ( s 3
tdtey (ey) I.e
Seosy t*, 1-],
(s| Entrerylv)
Endneny(SRein) .)
Etrang (syot) (Sai)
(Seac)
e47- xo.b-3
F R e i n ) - a zo-17
Sxon,-], Entnarg( n e n ) - e g e
Gqain( Saain Hniditg) Entrsry
teugh, Aionaot
o.97- X1.0- x 3
AHrhute:
hain(sRain
qein(xain
nee of
Beth
l9t, s J
Monr
No
Deeistn Tree
bie
)
(i7)