DWM - Classification-Unit7
DWM - Classification-Unit7
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no
Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
August 20, 2020 PRASHASTI KANIKAR 5
Classification Examples
• Teachers classify students’ grades as A, B, C, D
or F.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition
F D
August 20, 2020 PRASHASTI KANIKAR 7
Classification problem
The classification problem can be expressed as: Use the training database given and predict
the class label of a previously unseen instance
age?
<=30 overcast
31..40 >40
no yes yes
j 1 | D |
26
August 20, 2020 PRASHASTI KANIKAR
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
27
August 20, 2020 PRASHASTI KANIKAR
Regression models
• Regression models can be used to approximate the given data.
• In (simple)linear regression, the data are modeled to fit a straight
line.
• For example, a random variable, y (called a response variable), can
be modeled as a linear function of another random variable, x
(called a predictor variable), with the equation
• y=wx+b
• where the variance of y is assumed to be constant. In the context of
data mining, x and y are numeric database attributes.
• The coefficients, w and b (called regression coefficients), specify the
slope of the line and the y-intercept, respectively.
• These coefficients can be solved for by the method of least squares,
which minimizes the error between the actual line separating the
data and the estimate of the line.
33
33
Holdout
• In this method, the given data are randomly partitioned into
two independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training
set, and the remaining one-third is allocated to the test set.
• The training set is used to derive the model. The model’s
accuracy is then estimated with the test set .
34
Random subsampling
• Random subsampling is a variation of the
holdout method in which the holdout method
is repeated k times.
• The overall accuracy estimate is taken as the
average of the accuracies obtained from each
iteration.