0% found this document useful (0 votes)
41 views61 pages

Predictive

The document introduces predictive modeling and supervised segmentation. It discusses how selecting informative attributes is important for segmentation and predictive models. Attribute selection aims to find attributes that best separate a target variable into pure, homogeneous groups. Information gain is introduced as a common method that measures how well attributes reduce uncertainty about the target variable. The goal of segmentation is to identify subgroups with different target variable values in a way that is understandable to humans.

Uploaded by

pg ai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views61 pages

Predictive

The document introduces predictive modeling and supervised segmentation. It discusses how selecting informative attributes is important for segmentation and predictive models. Attribute selection aims to find attributes that best separate a target variable into pure, homogeneous groups. Information gain is introduced as a common method that measures how well attributes reduce uncertainty about the target variable. The goal of segmentation is to identify subgroups with different target variable values in a way that is understandable to humans.

Uploaded by

pg ai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Predictive Modeling :

From Correlation to Supervised


Segmentation

1
 In the process of discussing supervised segmentation, we
introduce one of the fundamental ideas of data mining:
finding or selecting important, informative variables or
“attributes” of the entities described by the data.
 What exactly it means to be “informative” varies among

applications, but generally, information is a quantity that


reduces uncertainty about something.
 Finding informative attributes also is the basis for a widely

used predictive modeling technique called tree induction,


which we will introduce toward the end of this chapter as an
application of this fundamental concept.

2
Outline
 Models, Induction, and Prediction
 Supervised Segmentation

◦ Selecting Informative Attributes


◦ Example: Attribute Selection with Information Gain
◦ Supervised Segmentation with Tree-Structured Models

3
Models, Induction, and Prediction
 A model is a simplified representation of reality created
to serve a purpose. It is simplified based on some
assumptions about what is and is not important for the
specific purpose, or sometimes based on constraints on
information or tractability).
 For example, a map is a model of the physical world. It
abstracts away a tremendous amount of information that
the mapmaker deemed irrelevant for its purpose. It
preserves, and sometimes further simplifies, the relevant
information.

4
Models, Induction, and Prediction
 In data science, a predictive model is a formula for
estimating the unknown value of interest: the target.
 The formula could be mathematical, or it could be a
logical statement such as a rule. Often it is a hybrid
of the two.
 Given our division of supervised data mining into
classification and regression, we will consider
classification models (and class-probability
estimation models) and regression models.

5
Terminology : Prediction
 In common usage, prediction means to forecast a future
event.
 In data science, prediction more generally means to estimate
an unknown value. This value could be something in the
future (in common usage, true prediction), but it could also
be something in the present or in the past.
 Indeed, since data mining usually deals with historical data,
models very often are built and tested using events from the
past.
 The key is that the model is intended to be used to estimate
an unknown value.

6
Models, Induction, and Prediction
 Supervised learning is model creation where the model
describes a relationship between a set of selected variables
(attributes or features)
 Predefined variable called the target variable. The model
estimates the value of the target variable as a function (possibly
a probabilistic function) of the features.
 So, for our churn-prediction problem we would like to build a
model of the propensity to churn as a function of customer
account attributes, such as age, income, length with the
company, number of calls to customer service, overage
charges, customer demographics, data usage, and others.

7
Models, Induction, and Prediction

8
Many Names for the Same Things
 The principles and techniques of data science historically have
been studied in several different fields, including machine
learning, pattern recognition statistics, databases, and others.
 As a result there often are several different names for the

same things. We
 typically will refer to a dataset, whose form usually is the

same as a table of a database or a worksheet of a


spreadsheet. A dataset contains a set of examples or
instances. An instance also is referred to as a row of a
database table or sometimes a case in statistics.

9
Many Names for the Same Things
 The features (table columns) have many different
names as well. Statisticians speak of independent
variables or predictors as the attributes supplied
as input. In operations research you may also
hear explanatory variable.
 The target variable, whose values are to be

predicted, is commonly called the dependent


variable in statistics.

10
Models, Induction, and Prediction
 The creation of models from data is known as
model induction.
 The procedure that creates the model from the

data is called the induction algorithm or learner.


Most inductive procedures have variants that
induce models both for classification and for
regression.

11
Outline
 Models, Induction, and Prediction
 Supervised Segmentation

◦ Selecting Informative Attributes


◦ Example: Attribute Selection with Information Gain
◦ Supervised Segmentation with Tree-Structured Models

12
Supervised Segmentation

 A predictive model focuses on estimating the


value of some particular target variable of
interest.
 To try to segment the population into subgroups

that have different values for the target variable .


 Segmentation may at the same time provide a

human-understandable set of segmentation


patterns.

13
Supervised Segmentation

 We might like to rank the variables by how good they are at


predicting the value of the target.
 In our example, what variable gives us the most information

about the future churn rate of the population? Being a


professional? Age? Place of residence? Income? Number of
complaints to customer service? Amount of overage charges?
 We now will look carefully into one useful way to select

informative variables, and then later will show how this


technique can be used repeatedly to build a supervised
segmentation.

14
Outline
 Models, Induction, and Prediction
 Supervised Segmentation

◦ Selecting Informative Attributes


◦ Example: Attribute Selection with Information Gain
◦ Supervised Segmentation with Tree-Structured Models

15
Selecting Informative Attributes

 The label over each head represents the value of the


target variable (write-off or not).
 Colors and shapes represent different predictor
attributes.

16
Selecting Informative Attributes
 Attributes:
◦head-shape: square, circular
◦body-shape: rectangular, oval
◦body-color: gray, white
 Target variable:
◦write-off: Yes, No

17
Selecting Informative Attributes
 So let’s ask ourselves:
◦ which of the attributes would be best to segment these people into
groups, in a way that will distinguish write-offs from non-write-offs?
 Technically, we would like the resulting groups to be as pure
as possible. By pure we mean homogeneous with respect to
the target variable. If every member of a group has the same
value for the target, then the group is pure. If there is at least
one member of the group that has a different value for the
target variable than the rest of the group, then the group is
impure.
 Unfortunately, in real data we seldom expect to find a variable

that will make the segments pure.

18
Selecting Informative Attributes
 Purity measure.
 The most common splitting criterion is called

information gain, and it is based on a purity


measure called entropy.
 Both concepts were invented by one of the

pioneers of information theory, Claude Shannon,


in his seminal work in the field (Shannon, 1948).

19
Selecting Informative Attributes
 Entropy is a measure of disorderthat can be
applied to a set, such as one of our individual
segments.
 Disorder corresponds to how mixed (impure) the

segment is with respect to these properties of


interest.

20
Selecting Informative Attributes

21
Selecting Informative Attributes
p(non-write-off) = 7 / 10 = 0.7
p(write-off) = 3 / 10 = 0.3
entropy(S)
= - 0.7 × log2 (0.7) – 0.3 × log2
(0.3)
≈ - 0.7 × - 0.51 - 0.3 × - 1.74
≈ 0.88

22
Selecting Informative Attributes

23
Selecting Informative Attributes
entropy(parent)
= - p( • ) × log2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.53 × - 0.9 - 0.47 × - 1.1
≈ 0.99 (very impure)

24
Selecting Informative Attributes
 The entropy of the left child is:
entropy(Balance < 50K) = - p( • ) × log 2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.92 × ( - 0.12) - 0.08 × ( - 3.7)
≈ 0.39
 The entropy of the right child is:
entropy(Balance ≥ 50K) = - p( • ) × log2 p( • ) - p( ☆ ) × log2 p( ☆ )
≈ - 0.24 × ( - 2.1) - 0.76 × ( - 0.39)
≈ 0.79

25
Selecting Informative Attributes
Information Gain
= entropy(parent)
- (p(Balance < 50K) × entropy(Balance < 50K) +
p(Balance ≥ 50K) × entropy(Balance ≥
50K))
≈ 0.99 – (0.43 × 0.39 + 0.57 × 0.79)
≈ 0.37

26
Selecting Informative Attributes

entropy(parent) ≈ 0.99
entropy(Residence=OWN) ≈ 0.54
entropy(Residence=RENT) ≈ 0.97
entropy(Residence=OTHER) ≈ 0.98

Information Gain ≈ 0.13

27
Numeric variables
 We have not discussed what exactly to do if the attribute is
numeric.
 Numeric variables can be “discretized” by choosing a split
point(or many split points).
 For example, Income could be divided into two or more ranges.
Information gain can be applied to evaluate the segmentation
created by this discretization of the numeric attribute. We still
are left with the question of how to choose the split point(s) for
the numeric attribute.
 Conceptually, we can try all reasonable split points, and choose
the one that gives the highest information gain.

28
Outline
 Models, Induction, and Prediction
 Supervised Segmentation

◦ Selecting Informative Attributes


◦ Example: Attribute Selection with Information Gain
◦ Supervised Segmentation with Tree-Structured Models

29
Example: Attribute Selection with Information Gain

 For a dataset with instances described by attributes and a


target variable.
 We can determine which attribute is the most informative
with respect to estimating the value of the target variable.
 We also can rank a set of attributes by their
informativeness, in particular by their information gain.

30
Example: Attribute Selection with Information Gain

 This is a classification problem because we have a


target variable, called edible?, with two values yes
(edible) and no (poisonous), specifying our two
classes.
 Each of the rows in the training set has a value for
this target variable. We will use information gain to
answer the question: “Which single attribute is the
most useful for distinguishing edible (edible?=Yes)
mushrooms from poisonous (edible?=No) ones?”

31
Example: Attribute Selection with Information Gain

 If you’re going to build a model to determine the


mushroom edibility using only a single feature,
you should choose its odor.
 If you were going to build a more complex model

you might start with the attribute ODOR before


considering adding others.
 In fact, this is exactly the topic of the next

section.

32
Outline
 Models, Induction, and Prediction
 Supervised Segmentation

◦ Selecting Informative Attributes


◦ Example: Attribute Selection with Information Gain
◦ Supervised Segmentation with Tree-Structured Models

33
Supervised Segmentation with Tree-Structured Models

 Let’s continue on the topic of creating a supervised


segmentation, because as important as it is, attribute
selection alone does not seem to be sufficient.
 If we select the single variable that gives the most
information gain, we create a very simple segmentation.
 If we select multiple attributes each giving some
information gain, it’s not clear how to put them together.
 We now introduce an elegant application of the ideas we’ve
developed for selecting important attributes, to produce a
multivariate (multiple attribute) supervised segmentation.

34
Supervised Segmentation with Tree-Structured Models

 Consider a
segmentation of
the data to take
the form of a
“tree,” such as
that shown in
Figure 3-10.

35
Supervised Segmentation with Tree-Structured Models

 The values of Claudio’s attributes are


Balance=115K, Employed=No, and
Age=40.
36
Supervised Segmentation with Tree-Structured Models

 There are many techniques to induce a supervised


segmentation from a dataset. One of the most
popular is to create a tree-structured model (tree
induction).
 These techniques are popular because tree models
are easy to understand, and because the induction
procedures are elegant (simple to describe) and
easy to use. They are robust to many common data
problems and are relatively efficient.

37
Supervised Segmentation with Tree-Structured Models

 Combining the ideas introduced above, the goal


of the tree is to provide a supervised
segmentation— more specifically, to partition the
instances, based on their attributes, into
subgroups that have similar values for their target
variables.
 We would like for each “leaf ” segment to contain

instances that tend to belong to the same class.

38
Supervised Segmentation with Tree-Structured Models

 To illustrate the process of classification tree


induction, consider the very simple example set
shown previously in Figure 3-2

39
Selecting Informative Attributes
 Attributes:
◦head-shape: square, circular
◦body-shape: rectangular, oval
◦body-color: gray, white
 Target variable:
◦write-off: Yes, No

40
Supervised Segmentation with Tree-Structured Models

 Figure 3-11. First


partitioning:
splitting on body
shape (rectangular
versus oval).

41
Supervised Segmentation with Tree-Structured Models

 Figure 3-12.
Second
partitioning:
the oval
body people
sub-grouped
by head
type.

42
Supervised Segmentation with Tree-Structured Models

 Figure 3-13.
Third
partitioning:
the
rectangular
body people
subgrouped
by body
color.

43
 Figure 3-14. The
classification tree
resulting from the
splits done in
Figure 3-11 to
Figure 3-13.

44
Supervised Segmentation with Tree-Structured Models

 In summary, the procedure of classification tree


induction is a recursive process of divide and conquer,
where the goal at each step is to select an attribute to
partition the current group into subgroups that are as
pure as possible with respect to the target variable.
 We perform this partitioning recursively, splitting
further and further until we are done. We choose the
attributes to split upon by testing all of them and
selecting whichever yields the purest subgroups.

45
Visualizing Segmentations

*The black dots correspond to instances of the class


Write-off.
*The plus signs correspond to instances of class non-
Write-off.

46
Trees as Sets of Rules
 You classify a new unseen instance by starting at
the root node and following the attribute tests
downward until you reach a leaf node, which
specifies the instance’s predicted class.
 If we trace down a single path from the root node
to a leaf, collecting the conditions as we go, we
generate a rule.
 Each rule consists of the attribute tests along the
path connected with AND.

47
Trees as Sets of Rules

48
Trees as Sets of Rules
 IF (Balance < 50K) AND (Age < 50) THEN Class=Write-
off
 IF (Balance < 50K) AND (Age ≥ 50) THEN Class=No
Write-off
 IF (Balance ≥ 50K) AND (Age < 45) THEN Class=Write-
off
 IF (Balance ≥ 50K) AND (Age < 45) THEN Class=No
Write-off

49
Trees as Sets of Rules
 The classification tree is equivalent to this rule
set.
 Every classification tree can be expressed as a set

of rules this way.

50
Probability Estimation
 In many decision-making problems, we would like a
more informative prediction than just a classification.
 For example, in our churn prediction problem. If we have

the customers’ probability of leaving when their contracts


are about to expire, we could rank them and use a limited
incentive budget to the highest probability instances.
 Alternatively, we may want to allocate our incentive

budget to the instances with the highest expected loss,


for which you’ll need the probability of churn.

51
Probability Estimation Tree

52
Probability Estimation
 If we are satisfied to assign the same class probability
to every member of the segment corresponding to a
tree leaf, we can use instance counts at each leaf to
compute a class probability estimate.
 For example, if a leaf contains n positive instances

and m negative instances, the probability of any new


instance being positive may be estimated as n/(n+m).
This is called a frequency-based estimate of class
membership probability.

53
Probability Estimation
 A problem: we may be overly optimistic about the
probability of class membership for segments with
very small numbers of instances. At the extreme, if
a leaf happens to have only a single instance,
should we be willing to say that there is a 100%
probability that members of that segment will have
the class that this one instance happens to have?
 This phenomenon is one example of a fundamental
issue in data science (“overfitting”).

54
Probability Estimation
 Instead of simply computing the frequency, we would
often use a “smoothed” version of the frequency-
based estimate, known as the Laplace correction, the
purpose of which is to moderate the influence of
leaves with only a few instances.
 The equation for binary class probability estimation

becomes:

55
Example: Addressing the Churn
Problem with Tree Induction
 We have a historical data set of 20,000 customers.
 At the point of collecting the data, each

customer either had stayed with the company


or had left (churned).

56
Example: Addressing the Churn
Problem with Tree Induction

57
Example: Addressing the Churn
Problem with Tree Induction
 How good are each of these variables individually?
 For this we measure the information gain

of each attribute, as discussed earlier. Specifically,


we apply Equation 3-2 to each variable
independently over the entire set of instances, to
see what each gains us.

58
59
Example: Addressing the Churn
Problem with Tree Induction
 The answer is that the table ranks each feature
by how good it is independently, evaluated
separately on the entire population of instances.
 Nodes in a classification tree depend on the

instances above them in the tree.

60
Example: Addressing the Churn
Problem with Tree Induction
 Therefore, except for the root node, features in a
classification tree are not evaluated on the entire
set of instances.
 The information gain of a feature depends on the

set of instances against which it is evaluated, so


the ranking of features for some internal node
may not be the same as the global ranking.

61

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy