Predictive
Predictive
1
In the process of discussing supervised segmentation, we
introduce one of the fundamental ideas of data mining:
finding or selecting important, informative variables or
“attributes” of the entities described by the data.
What exactly it means to be “informative” varies among
2
Outline
Models, Induction, and Prediction
Supervised Segmentation
3
Models, Induction, and Prediction
A model is a simplified representation of reality created
to serve a purpose. It is simplified based on some
assumptions about what is and is not important for the
specific purpose, or sometimes based on constraints on
information or tractability).
For example, a map is a model of the physical world. It
abstracts away a tremendous amount of information that
the mapmaker deemed irrelevant for its purpose. It
preserves, and sometimes further simplifies, the relevant
information.
4
Models, Induction, and Prediction
In data science, a predictive model is a formula for
estimating the unknown value of interest: the target.
The formula could be mathematical, or it could be a
logical statement such as a rule. Often it is a hybrid
of the two.
Given our division of supervised data mining into
classification and regression, we will consider
classification models (and class-probability
estimation models) and regression models.
5
Terminology : Prediction
In common usage, prediction means to forecast a future
event.
In data science, prediction more generally means to estimate
an unknown value. This value could be something in the
future (in common usage, true prediction), but it could also
be something in the present or in the past.
Indeed, since data mining usually deals with historical data,
models very often are built and tested using events from the
past.
The key is that the model is intended to be used to estimate
an unknown value.
6
Models, Induction, and Prediction
Supervised learning is model creation where the model
describes a relationship between a set of selected variables
(attributes or features)
Predefined variable called the target variable. The model
estimates the value of the target variable as a function (possibly
a probabilistic function) of the features.
So, for our churn-prediction problem we would like to build a
model of the propensity to churn as a function of customer
account attributes, such as age, income, length with the
company, number of calls to customer service, overage
charges, customer demographics, data usage, and others.
7
Models, Induction, and Prediction
8
Many Names for the Same Things
The principles and techniques of data science historically have
been studied in several different fields, including machine
learning, pattern recognition statistics, databases, and others.
As a result there often are several different names for the
same things. We
typically will refer to a dataset, whose form usually is the
9
Many Names for the Same Things
The features (table columns) have many different
names as well. Statisticians speak of independent
variables or predictors as the attributes supplied
as input. In operations research you may also
hear explanatory variable.
The target variable, whose values are to be
10
Models, Induction, and Prediction
The creation of models from data is known as
model induction.
The procedure that creates the model from the
11
Outline
Models, Induction, and Prediction
Supervised Segmentation
12
Supervised Segmentation
13
Supervised Segmentation
14
Outline
Models, Induction, and Prediction
Supervised Segmentation
15
Selecting Informative Attributes
16
Selecting Informative Attributes
Attributes:
◦head-shape: square, circular
◦body-shape: rectangular, oval
◦body-color: gray, white
Target variable:
◦write-off: Yes, No
17
Selecting Informative Attributes
So let’s ask ourselves:
◦ which of the attributes would be best to segment these people into
groups, in a way that will distinguish write-offs from non-write-offs?
Technically, we would like the resulting groups to be as pure
as possible. By pure we mean homogeneous with respect to
the target variable. If every member of a group has the same
value for the target, then the group is pure. If there is at least
one member of the group that has a different value for the
target variable than the rest of the group, then the group is
impure.
Unfortunately, in real data we seldom expect to find a variable
18
Selecting Informative Attributes
Purity measure.
The most common splitting criterion is called
19
Selecting Informative Attributes
Entropy is a measure of disorderthat can be
applied to a set, such as one of our individual
segments.
Disorder corresponds to how mixed (impure) the
20
Selecting Informative Attributes
21
Selecting Informative Attributes
p(non-write-off) = 7 / 10 = 0.7
p(write-off) = 3 / 10 = 0.3
entropy(S)
= - 0.7 × log2 (0.7) – 0.3 × log2
(0.3)
≈ - 0.7 × - 0.51 - 0.3 × - 1.74
≈ 0.88
22
Selecting Informative Attributes
23
Selecting Informative Attributes
entropy(parent)
= - p( • ) × log2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.53 × - 0.9 - 0.47 × - 1.1
≈ 0.99 (very impure)
24
Selecting Informative Attributes
The entropy of the left child is:
entropy(Balance < 50K) = - p( • ) × log 2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.92 × ( - 0.12) - 0.08 × ( - 3.7)
≈ 0.39
The entropy of the right child is:
entropy(Balance ≥ 50K) = - p( • ) × log2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.24 × ( - 2.1) - 0.76 × ( - 0.39)
≈ 0.79
25
Selecting Informative Attributes
Information Gain
= entropy(parent)
- (p(Balance < 50K) × entropy(Balance < 50K) +
p(Balance ≥ 50K) × entropy(Balance ≥
50K))
≈ 0.99 – (0.43 × 0.39 + 0.57 × 0.79)
≈ 0.37
26
Selecting Informative Attributes
entropy(parent) ≈ 0.99
entropy(Residence=OWN) ≈ 0.54
entropy(Residence=RENT) ≈ 0.97
entropy(Residence=OTHER) ≈ 0.98
27
Numeric variables
We have not discussed what exactly to do if the attribute is
numeric.
Numeric variables can be “discretized” by choosing a split
point(or many split points).
For example, Income could be divided into two or more ranges.
Information gain can be applied to evaluate the segmentation
created by this discretization of the numeric attribute. We still
are left with the question of how to choose the split point(s) for
the numeric attribute.
Conceptually, we can try all reasonable split points, and choose
the one that gives the highest information gain.
28
Outline
Models, Induction, and Prediction
Supervised Segmentation
29
Example: Attribute Selection with Information Gain
30
Example: Attribute Selection with Information Gain
31
Example: Attribute Selection with Information Gain
section.
32
Outline
Models, Induction, and Prediction
Supervised Segmentation
33
Supervised Segmentation with Tree-Structured Models
34
Supervised Segmentation with Tree-Structured Models
Consider a
segmentation of
the data to take
the form of a
“tree,” such as
that shown in
Figure 3-10.
35
Supervised Segmentation with Tree-Structured Models
37
Supervised Segmentation with Tree-Structured Models
38
Supervised Segmentation with Tree-Structured Models
39
Selecting Informative Attributes
Attributes:
◦head-shape: square, circular
◦body-shape: rectangular, oval
◦body-color: gray, white
Target variable:
◦write-off: Yes, No
40
Supervised Segmentation with Tree-Structured Models
41
Supervised Segmentation with Tree-Structured Models
Figure 3-12.
Second
partitioning:
the oval
body people
sub-grouped
by head
type.
42
Supervised Segmentation with Tree-Structured Models
Figure 3-13.
Third
partitioning:
the
rectangular
body people
subgrouped
by body
color.
43
Figure 3-14. The
classification tree
resulting from the
splits done in
Figure 3-11 to
Figure 3-13.
44
Supervised Segmentation with Tree-Structured Models
45
Visualizing Segmentations
46
Trees as Sets of Rules
You classify a new unseen instance by starting at
the root node and following the attribute tests
downward until you reach a leaf node, which
specifies the instance’s predicted class.
If we trace down a single path from the root node
to a leaf, collecting the conditions as we go, we
generate a rule.
Each rule consists of the attribute tests along the
path connected with AND.
47
Trees as Sets of Rules
48
Trees as Sets of Rules
IF (Balance < 50K) AND (Age < 50) THEN Class=Write-
off
IF (Balance < 50K) AND (Age ≥ 50) THEN Class=No
Write-off
IF (Balance ≥ 50K) AND (Age < 45) THEN Class=Write-
off
IF (Balance ≥ 50K) AND (Age < 45) THEN Class=No
Write-off
49
Trees as Sets of Rules
The classification tree is equivalent to this rule
set.
Every classification tree can be expressed as a set
50
Probability Estimation
In many decision-making problems, we would like a
more informative prediction than just a classification.
For example, in our churn prediction problem. If we have
51
Probability Estimation Tree
52
Probability Estimation
If we are satisfied to assign the same class probability
to every member of the segment corresponding to a
tree leaf, we can use instance counts at each leaf to
compute a class probability estimate.
For example, if a leaf contains n positive instances
53
Probability Estimation
A problem: we may be overly optimistic about the
probability of class membership for segments with
very small numbers of instances. At the extreme, if
a leaf happens to have only a single instance,
should we be willing to say that there is a 100%
probability that members of that segment will have
the class that this one instance happens to have?
This phenomenon is one example of a fundamental
issue in data science (“overfitting”).
54
Probability Estimation
Instead of simply computing the frequency, we would
often use a “smoothed” version of the frequency-
based estimate, known as the Laplace correction, the
purpose of which is to moderate the influence of
leaves with only a few instances.
The equation for binary class probability estimation
becomes:
55
Example: Addressing the Churn
Problem with Tree Induction
We have a historical data set of 20,000 customers.
At the point of collecting the data, each
56
Example: Addressing the Churn
Problem with Tree Induction
57
Example: Addressing the Churn
Problem with Tree Induction
How good are each of these variables individually?
For this we measure the information gain
58
59
Example: Addressing the Churn
Problem with Tree Induction
The answer is that the table ranks each feature
by how good it is independently, evaluated
separately on the entire population of instances.
Nodes in a classification tree depend on the
60
Example: Addressing the Churn
Problem with Tree Induction
Therefore, except for the root node, features in a
classification tree are not evaluated on the entire
set of instances.
The information gain of a feature depends on the
61