Unit - 3 Data Mining
Unit - 3 Data Mining
higher uncertainty —
« Lower entropy => lower uncertainty ao
= Conditional Entropy den
= H(VIX) = Ex PQ) IX = x) m=2
<&Attribute Selection Measure:
_Information Gain (ID3/C4.5)
= Select the attribute with the highest information gain
= Let p, be the probability that an arbitrary tuple in D belongs to
class C,, estimated by |C, 51/|D1
= Expected information (entropy) needed to classify a tuple in D:
Info (D) =~ p, log 3(2,)
= Information needed (after using A to split D'Thto v partitions) to
classify D: .1D,1
Info ,(D) = Y) +> Info (D ,)
fi 1DI
= Information gained by branching on attribute A
Gain(A) = Info(D) - Info ,(D)
Attribute Selection: Information Gain
Class P: buys_computer = “yes” Info, (0) =£12,3) +4164)
I Class N: buys_computer = “no” ea 14
se se 5 .
Flow.) =0.940 + 7g 1G2)= 0.694
> 1(2,3ymeans “age <=30" has 5 out of
14 samples, with 2 yes’es and 3
o's. Hence
Gain(age) = Info(D) ~ Info,,.(D) = 0.246
Similarly,
Gairkincome) = 0.029
Gair(student) =0.151
Gairfcredit _rating) = 0.048
“
<%Computing Information-Gain for
Continuous-Valued Attributes
= Let attribute A be a continuous-valued attribute
= Must determine the best split point for A
= Sort the value A in increasing order
= Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
= (a\+a,,,)/2 is the midpoint between the values of a, and a,,1
«= The point with the minimum expected information
requirement for A is selected as the split-point for A
= Split:
= D1is the set of tuples in D satisfying A < split-point, and D2 is
the set of tuples in D satisfying A > split-point
as
Gain Ratio for Attribute Selection (C4.5)
= Information gain measure is biased towards attributes
large number of values
= C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
|D
8
ith a
Splitinfo. 4(D) = -
= GainRatio(A) = Gain(A)/Splitinfo(A)
aE 6
x.
SplitiefomeemeD) = —2 x Ios2(4) - & x on (2
fx toea (8) = 2.807
= gain_ratio(income) = 0.029/1.557 = 0.019
= The attribute with the maximum gain ratio is selected as the
splitting attribute
16
<)>i Index (CART, IBM IntelligentMiner)
= Ifa data set D contains examples from n classes, gini index,
i n
gini(D) is defined as ot (Dyat- £ ph
where p; is the relative frequency of class jin D
= Ifa data set D is split on A into two subsets D, and Dz, the gini
index gini(D) defined as
gini ,(D)= Oyen (Dit Uhm (D 2)
= Reduction in Impurity:
Agini (A) = gini (D)- gini ,(D)
= The attribute provides the smallest ginigo,(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Index
Computation of Gi
= Ex, D has 9 tuples in buys_computer = “yes” and 5 in “no’
gin (D)=1-(2 (a) = 0.459
= Suppose the attribute income partitions D Into 10 in D,: {low,
medium} and 4 in Dy giti son sive man \(D) = (# Gini (o+( Son @)
-8(0-(8)'-(@))+46-(@'-@)
= Gintama «tie 2)
Ginigow nigh) 1S 0-458; Gini gnecium righ) iS 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
= Allattributes are assumed continuous-valued
= May need other tools, e.g., clustering, to get the possible split
values
= Can be modified for categorical attributes ‘s
ou© Decision tree construction becomes inefficient due to swapping of the training tuples in
and out of main and cache memories.
© Over fitting problem.
Decision Tree Induction algorithm for Scalability are:
‘© SLIQ- Builds an index for each attribute and only class list and the attribute list reside in
memory.> Onir- ny
Decisim Tree Taduehon °-
=> Sherepratn leer ques
»> Beparati. dali 4d jnls Smaller Sober
> Classe she & « form ef dale Analy
List ee lines moedsle daferbing smopartant
= ates , Arneh me clels called
n
cert en" preter wagered) lone (Oeint
Wires) do clarsiticaten works"’
Date Clasehist wo tro hep presen.
fie Learnt shep
a. clarnpeste ~Slep-
" Leanne Ata - ;
ay pe lempern x bute debetbeni aw
prodarinicad Set of late Clarres or corepa.
SS where a clariificshy gets butte!
[ne we by Ara lgezs leat
Charrh eno
tors is i
=> 4 Erining Seb make op oF
fples anh (rar avon La) Clara
labels> & lepe b representinf by R=
bbc vebtor X= Crud, x -.- Xn)
> n maceosenent ton mos be oy eee Up
“incon n: databas — Uocdaale
5 Sagan cts!” 2 reheatUnit WM -Classification
Classification -It is also called as Supervised Learning .
Supervision
~The training data are used to learn classifier.
~the training data are labeled data.
-New data(unlabeled) are classified using the training data,
Principles
construct models(functions) based on some training examples
-describe and distinguish classes or concepts for future prediction
-predict some unknown‘calss labels
Comparing Classification and Prediction Method:
S.No ion Prediction ae
I
-Classification predicts -A predictor predicts a continuous-valued function, or
categorical (discrete, unordered) | ordered value.ie predicts unknown or missing values.
labels. Regression! Analysis is used for prediction.
Uses labels of the training data | -Predictior also cailed as numeric prediction.
to classify new data,
2
-Classification model to Prediction model to predict the expenditures of
categorize bank loan potential customers on purchase of computer
applications as either safe or | equipment. based o their income and occupation.
risky
T Accuracy:
Classifier Accuracy-the ability of a classifier to predict class labels.
Predictor Accuracy -how well a given predictor can guess the value of the predicted
attribute for new or previously unseen data.
0 Spe
Time to construct the model(training time)
Time to use the model(classification/prediction time)
G Robustness:
Handling noisy data or data with missing values.) Sealabili ‘
This refers to the ability to construct the classifier or predictor efficiently given large amounts of
data.
| Interpretability:
‘This refers to the level of understanding and insight that is provided by the classifier or predictor. |
Applications
= Creditloan approval:
1 Medical diagnosis: if'a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
*
Web page categorization: which category itis
Classification—A Two-Step Process
* Model construction leaming or training step): construct a classification model based on
training data.
‘Training Data F
|
Each tuple/sample is assumed to belong to a predefined class, as determined by |
the class label attribute,
The set of tuples used for mode! construction is training set |
The model is represented as classification rules, decision trees, or mathematical |
formulae
é
‘© Model usage: for classifying future or unknown objects
Estimate accuracy of the model
To measure the accuracy of a model we need test data
Accuracy rate is the percentage of test set samples that are correctly t
classified by the model
ing set (otherwise over fitting)
Test set is independent of trai
Irthe accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, itis called validation (test) set