Chapter 6 DATA MINING R1
Chapter 6 DATA MINING R1
CBA OM-BA
Business Analytics
Chapter 6
DATA MINING
• Vb
• Bm
• Mbm
Introduction
• The increase in the use of data-mining techniques in business has
been caused largely by three events.
3
Introduction
• Observation: set of recorded values of variables associated with a
single entity.
4
Introduction
• Whether we are using a supervised or unsupervised learning
approach, the data-mining process comprises the following steps:
• Data Sampling
• Data Preparation
• Model Construction
• Model Assessment
5
Data Sampling
6
Data Sampling
• When dealing with large volumes of data, it is best practice to
extract a representative sample for analysis.
• A sample is representative, if the analyst can make the same
conclusions from it as from the entire population of data.
• The sample of data must be large enough to contain significant
information, yet small enough to be manipulated quickly.
• Data-mining algorithms typically are more effective given more
data.
7
Data Sampling
• When obtaining a representative sample, it is generally best to
include as many variables as possible in the sample.
8
Data Preparation
9
Data Preparation
• The data in a data set are often said to be “dirty” and “raw”
before they have been preprocessed.
• We need to put them into a form that is best suited for a data-
mining algorithm.
10
Data Preparation
• Treatment of Missing Data
12
Data Preparation
13
Data Preparation
• Variable Representation
• In many data-mining applications, the number of variables for
which data is recorded may be prohibitive to analyze.
• Dimension reduction: Process of removing variables from the
analysis without losing any crucial information.
• One way is to examine pairwise correlations to detect variables or
groups of variables that may supply similar information.
15
Unsupervised Learning
16
Unsupervised Learning
• Unsupervised learning application
17
Unsupervised Learning
• Cluster Analysis
18
Unsupervised Learning
• Clustering methods:
• Hierarchical clustering
• k-means clustering
19
Unsupervised Learning
20
Unsupervised Learning
21
Unsupervised Learning
z-score, zj.
25
Unsupervised Learning
• Weakness of matching coefficient - If two observations both have
a 0 entry for a categorical variable, this is counted as a sign of
similarity between the two observations.
• However, matching 0 entries do not necessarily imply similarity.
• Jaccard’s coefficient: A similarity measure that does not count
matching zero entries.
26
Unsupervised Learning
• Hierarchical clustering: Bottom-up approach
• Starts with each observation in its own cluster and then iteratively
combines the two clusters that are the most similar into a single
cluster.
27
Unsupervised Learning
• Methods to obtain a cluster similarity measure:
• The similarity between two clusters is defined by
Single
linkage the similarity of the pair of observations (one
from each cluster) that are the most similar.
• This clustering method defines the similarity
Complete between two clusters as the similarity of the pair
linkage of observations (one from each cluster) that are
the most different.
• Defines the similarity between two clusters to be
Average the average similarity computed over all pairs of
linkage observations between the two clusters.
28
Unsupervised Learning
• Methods to obtain a cluster similarity measure:(contd.)
29
Figure 6.2 - Measuring Similarity between Clusters
30
Unsupervised Learning
• Using XLMiner for hierarchical clustering
• We use XLMiner to construct hierarchical clusters for the KTC data.
• We base the clusters on a collection of 0–1 categorical variables
(Female, Married, Car Loan, and Mortgage).
• We use Jaccard’s coefficient to measure similarity between
observations and the average linkage clustering method to
measure similarity between clusters.
31
Figure 6.4 - Dendrogram for KTC
• A dendrogram is a chart that depicts the set of nested clusters resulting
at each step of aggregation.
32
Unsupervised Learning
k-Means clustering
34
Unsupervised Learning
Table 6.1 - Average Distances within Clusters
35
Unsupervised Learning
36
Unsupervised Learning
38
Unsupervised Learning
• Confidence: Helps identify reliable association rules.
• For the data in Table 6.3, the rule “if {bread, jelly}, then {peanut
butter}” has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) =
1.25.
39
Unsupervised Learning
• Evaluating association rules
41
Supervised Learning
• The goal of a supervised learning technique is to develop a model
that predicts a value for a continuous outcome or classifies a
categorical outcome.
• Partitioning Data
• We can use the abundance of data to guard against the potential
for overfitting by decomposing the data set into three partitions
• the training set
• the validation set, and
• the test set
42
Supervised Learning
Partitioning Data
• Training set: Consists of the data used to build the candidate
models.
• Validation set: The data set to which promising subset of models
is applied to the to identify which model is the most accurate at
predicting when applied to data that were not used to build the
model.
• Test set: The data set to which the final model should be applied
to estimate this model’s effectiveness when applied to data that
have not been used to build or select the model.
43
Supervised Learning
• Classification Accuracy
44
Supervised Learning
45
Supervised Learning
46
Table 6.5 - Classification Probabilities
47
Table 6.6 - Classification Confusion Matrices and
Error Rates for Various Cutoff Values
48
Figure 6.12 - Classification Error Rates versus Cutoff
Value
49
Figure 6.12 - Classification Error Rates versus Cutoff
Value
• Cumulative lift chart: Compares the number of actual Class 1
observations identified if considered in decreasing order of their
estimated probability of being in Class 1 and compares this to the
number of actual Class 1 observations identified if randomly
selected.
• Decile-wise lift chart: Another way to view how much better a
classifier is at identifying Class 1 observations than random
classification.
• Observations are ordered in decreasing probability of Class 1
membership and then considered in 10 equal-sized groups.
50
Figure 6.13 - Cumulative and Decile-wise Lift Charts
51
Supervised Learning
• Prediction Accuracy
• Average error = /n
• Root mean squared error (RMSE) =
• = error in estimating an outcome for observation i.
52
Supervised Learning
• k-nearest neighbors (k-NN): This method can be used either to
classify an outcome category or predict a continuous outcome.
• k-NN uses the k most similar observations from the training set,
where similarity is typically measured with Euclidean distance.
54
Supervised Learning
55
Supervised Learning
56
Figure 6.20 - Construction Sequence of Branches in
a Classification Tree
57
Figure 6.21 - Geometric Illustration of Classification
Tree Partitions
58
Figure 6.22 - Classification Tree with one Pruned
Branch
59
Supervised Learning
Table 6.7 - Classification Error rates on Sequence of Pruned trees3 - Best Pruned Classification
Tree
60
Figure 6.27 - Best Pruned Classification Tree for
Hawaiian Ham
61
Figure 6.28 - Best Pruned Tree Classification of test
data for Hawaiian Ham
62
Figure 6.29 - Best Pruned Tree’s Classification
Confusion Matrix on Test Data
63
Supervised Learning
64
Figure 6.30 - XLMiner steps for regression trees
65
Figure 6.31 - Full Regression Tree for Optiva Credit
Union
66
Figure 6.32 - Regression Tree Pruning Log
67
Figure 6.33 - Best Pruned Regression tree for Optiva
Credit Union
68
Figure 6.34 - Best Pruned Tree Prediction of Test
Data for Optiva Credit Union
69
Figure 6.35 - Prediction Error of Regression Trees
70
Supervised Learning
71
Supervised Learning
72
Supervised Learning
73
Supervised Learning
ln + ∙ ∙ ∙ +
• Logistic Function
74
Figure 6.39 - XLMiner steps for logistic regression
75
Figure 6.40 - XLMiner logistic regression output
76
Figure 6.41 - XLMiner steps for refitting Logistic
Regression Model and using it to Predict new Data
77
Figure 6.42 - Classification Error for Logistic
Regression Model
78
Figure 6.43 - Classification of 30 new Customer
Observations
79
QUESTIONS/CLARIFICATIONS