0% found this document useful (0 votes)
9 views13 pages

DM - 06 Mar 2025

The document discusses classification and prediction as data mining methods, detailing their processes, examples, and the data classification lifecycle. It outlines the steps involved in classification, including classifier model creation and application, as well as the challenges faced in data cleaning and transformation. Additionally, it introduces decision tree induction as a supervised learning method for classification, explaining its structure, benefits, and the algorithm used to generate decision trees.

Uploaded by

Smruti Somyak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

DM - 06 Mar 2025

The document discusses classification and prediction as data mining methods, detailing their processes, examples, and the data classification lifecycle. It outlines the steps involved in classification, including classifier model creation and application, as well as the challenges faced in data cleaning and transformation. Additionally, it introduces decision tree induction as a supervised learning method for classification, explaining its structure, benefits, and the algorithm used to generate decision trees.

Uploaded by

Smruti Somyak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Classification and Prediction

Classification and prediction are two methods used to mine the data which helps to
analyze the new data and to explore more about unknown data.
Classification is the process of finding a good model that has the ability to categorize the
available data and to predict the class of unknown data.
Examples of Classification -
Abank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a comparny needs to analyze a customer with a given profile,
who will buy a new computer.
Inprediction, we identify or predict the missing or unavailable data for a newobservation
based on the previous data that we have and based on the future assumptions. In
prediction. the output is a continuous value.
Example on prediction
o Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company.
o Predicting the value of a house depending on the facts such as the number of rooms, the
total area, etc.,
Regression isgenerally used for prediction.
How does Classification Works?
There are two stages in the data classification system:

o Classifier model creation:- The classification algorithms construct the classifier in this
stage. Aclassifier is constructed from atraining set composed of the records of databases
and their corresponding class names. Each category that makes up the training set is
referredto as acategory or class. We may also refer to these records as samples, objects,
or data points.
o Application of classifier for classification:- The test data are used here to estimate the
accuracy of the classification algorithm. If the consistency is deemed sutficient, the
classification rules can be expanded to cover new data records. Example applications
are:- document classification, sentiment analysis, image classification etc.
Data Classification Process: The data classification process can
be categorized into five steps:
reate the goals of dataclassification, strategy,
data classification. workflows, and architecture of
Classify confidential details that we store.
Using marks by data labelling.
To imprOVe
Dala is protection and obedience, use effects.
complex, and a continuous method is a
classitication.
What is Data Classification Lifecycle?

The data classification lifc cycle produces an excellent struclure for controlling the tlow of data
to an enterprise. Businesses necd to account for data security and compliance at each level. With
the help of dataclassification, we can perform it at every stage, from origin to deletion. The data
life-cycle has the following stages, such as:
I. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by
tagging based on in-house protection policies and agreement ruleS.
J. Storage: Here, we have the obtained data, including access controls and
4. Sharing: Data is continuallydistributed among agents, consumers, and encryption.
various devices and platforms. co-workers from
5. Archive: Here, data is eventually
archived within an industry's storage systems.
6. Publication: Through the publication of data,
and download in the form of
it can reach customers. They can then view
dashboards.
Issues regarding classification and
prediction:
1. Data Cleaning: Data cleaning involves removing the noise
values. The noise is removed and treatment of mising
by applying smoothing
missing value withtechniques,
missing values is solved by replacing a and the problem of
tvalue for that attribute.
2. Relevance
the most commonly
occurring
Analysis: The database may also have
analysis is used to know whether any two given irrelevant attributes. Correlation
3. Data
Transformation attributes
and reduction: The data can are related.
following methods. be transformed by any of the
o
Normalization: The data is
involves scaling all values for atransformed using normalization.
given attribute to make Normalization
specified range. Normalization is used when the them fall within a small
neural
involving measurements are used in the learning step. networks or the methods
Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
Here are the criteria for comparing the methods
ofClassification and Prediction, such as:
Accuracy: The accuracy of the classifier can be referred to as the ability of
to predict the class label correctly, and the the classitier
accuracy
how well a given predictor can estimate the unknown
of the predictor can be referred to as
value.
Speed: The speed of the method depends on the
using the classifier or predictor. computational cost of generating and
Robustncss: Robustness is the ability to make correct predictions or classifications, In
the context of data mining. robustness is the ability of the classifier or predictor to make
correct predictions from incoming unknown data.
Scalability: Scalability refers to an increase or decrease in the performance of the
classifier or predictor based on the given data.
Interpretability: Interpretability is how readily we can understand the reasoning behind
predictionsor classification made by the predictor or classifier.

Classification by Decision Tree Induction


Decision Tree is a supervised learning method used in data mining for classification and
regression methods. lIt is a tree that helps us in decision-making purposes.
The decision tree creates classification or regression models as a tree structure. It
separates a data set into smaller subsets, and at the same time, the decision tree is steadily
developed. The final tree is a tree with the decision nodes and leaf nodes. A decision
node has at least two branches. The leaf nodes showa classification or decision. We can't
accomplish more split on leaf nodes-The uppermost decision node in a tree that relates to
the best predictor called the root node.
Decision trees can deal with both categorical and numerical data.
The benefits of having a decision tree are as follows -
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and
an fast.
hs CART
age

young senior
middie
aged

Student? yes Credit_rating?

yes fair excellent

no yes no yes

(Decision Tree)

Tree may be binary or non-binary.


I Key factors:
1. Entropy:
and C4
Tniropy refers to acommonway to measure impurity. In the decision tree, it measures the
randomncss or impurity in data sets.
constru

Alg
par
Very random
dataset

High Entropy

Less random
dataset

Less Entropy

2. Information Gain:

Information Gain refers to the decline in entropy after


Entropy Reduction. Building a decision tree is the dataset is split. It is
highest data gain. all about discovering attributes thatalso called
return the

Entropy-E1
Information
gain(l1)-E1-E2
Size Where E1>E2

Entropy-E2

Decision Tree Induction Algorithm


A machine rescarcher namned J. Ross
Quinlan in 1980 developed a decision tree algorithm known
as iD3 (iterative Dichotomiser). ater, he presented C4.5, which was the
successor of |D3. ID3
there is no backtracking: the trees are
and C4.5 adopt a grecdy approach. In this algorithm, manner.
constructed in a top-downrecursive divide-and-conquer
data
Algorithm: ienerate adecision tree. Generate adecision tree from the training tuples of
partition D.

Input:
1) Data partition. D. which is a set of training tuples and their associated class labels:

2) Attribute list. the set of candidate attributes:

3) Attribute selection method. a procedure to determine the splitting criterion that "best"
partitions the datatuples into individual classes. This criterion consists of a splitting attribute
and. possibly.a split point or splitting subset.
Output: Adecision tree.
Method:

g Create anode N;
(step 2ftuples in D
are all of the same class, Cthen
2.(Sep3Return'N as a leaf node labeled with the class C;
4iSsep 4) If attribyte list is empty then
S(STer 5) Return Nas a leaf node labeled with the majority class in D; I/ majority voting
L.Step) ApplyAtribute selection method(D, attribute list) to find the "best" splitting criterion:

7 ($ryp ) Label node Nwith splitting criterion;


S.5top 8) If splitting attribute is discrete-valued and multiway splits allowed, then l/ not restricted
(ho binary trees
4. (6tep#Atribute list - attribute list - splitting attribute; // remove splitting attribute
to.(typ for each outcome j of splittingcriterion // partition the tuples and growsubtrees for
cach partition

| step )let D} be the set of data tuples in D satisfying outcomej;l/a partition

1). 8ep te) 1fD, is empty then


(3 Step13} Attach a leaf labeled with the majority class in D to node N;
Stent4) Else attach the node returned by Generate decision tree(Dj, attribute list) to node N:
end for

IS. Step 15) Return N;


There are three inputs to the algorithm:
The first one is dimension (D), a list of attributes, and a for sel
Sird
stra

R
HT

M
Ron fes

Hot
13
stn

s L t S-]
f
9
o1C5)
o.97(
Sovrncsth,oJ, rtnay (Saerat)-e
Rainy
ertrrj(Snr-Enray
hi

(s
x-97/

Seosrt3+, i-J, nirpylsaoa)-g


EntrpISHet)

o.9183 - x o , |3
.o289

valirss(HoniA1)

7
Cins, RS

D
conny Sol, D, DE, D,
t SD3, D7, Dl2, D I 3 ] t, o
Rein: b4, 3,DE, DIO DIY3 3t,-] L
DI43

Roin

(3t,2
finsA
P1nnis
D

Hh
Nannstn
At*nibte Ten
V o l T a : Hot,
EndnrgSsna*
Ssung lit,3-J,
Sustot, AJ, Entep1Suat)
S ielt,I-)

(Ssnr
(Syr:
S

votus(Hit
t,3- ,Ent
ot,3-J, Entrg(Smiga)
et,o-],
Enir)
'ectnsniyh)-énapyl's
b.97

2t,3-] EntpO
*, -], trarpg(stnar)
9r83
Now, se

)o. 97
Laind) : 6.o 9
AS
Gain(Ssury
Aformetn of
be
Vots f Norno
a.

for D
Nrna

Lot,3 - Norny

plo Tans

Nertma

Nort

of Re-in
3

Skcins+,-]
Attaibte

ErdnàShct ) -
( Smie ): : = o . 9 ( s 3
tdtey (ey) I.e
Seosy t*, 1-],
(s| Entrerylv)
Endneny(SRein) .)
Etrang (syot) (Sai)
(Seac)
e47- xo.b-3

F R e i n ) - a zo-17

Sxon,-], Entnarg( n e n ) - e g e
Gqain( Saain Hniditg) Entrsry
teugh, Aionaot
o.97- X1.0- x 3

AHrhute:

GesnlSRoin, wine): -97 - xo.0-xo:= 17

hain(sRain
qein(xain
nee of
Beth
l9t, s J

Monr

No

Deeistn Tree
bie
)
(i7)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy