0% found this document useful (0 votes)
226 views81 pages

Chapter 6 DATA MINING R1

This document provides an introduction to data mining techniques. It discusses three key events that have increased the use of data mining: the explosion in data production, ability to electronically warehouse data, and affordable computer power to analyze data. It also describes the main steps in data mining: data sampling, data preparation, model construction, and model assessment. Unsupervised learning techniques like cluster analysis are used to identify relationships between observations without predefined labels. Methods like hierarchical clustering, k-means clustering, and measures of similarity like Euclidean distance are discussed.

Uploaded by

keith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
226 views81 pages

Chapter 6 DATA MINING R1

This document provides an introduction to data mining techniques. It discusses three key events that have increased the use of data mining: the explosion in data production, ability to electronically warehouse data, and affordable computer power to analyze data. It also describes the main steps in data mining: data sampling, data preparation, model construction, and model assessment. Unsupervised learning techniques like cluster analysis are used to identify relationships between observations without predefined labels. Methods like hierarchical clustering, k-means clustering, and measures of similarity like Euclidean distance are discussed.

Uploaded by

keith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

WELCOME

I'm GLAD you’re here!

CBA OM-BA
Business Analytics
Chapter 6

DATA MINING
• Vb
• Bm
• Mbm
Introduction
• The increase in the use of data-mining techniques in business has
been caused largely by three events.

• The explosion in the amount of data being produced and


electronically tracked

• The ability to electronically warehouse these data

• The affordability of computer power to analyze the data

3
Introduction
• Observation: set of recorded values of variables associated with a
single entity.

• Data-mining approaches can be separated into two categories.

• Supervised learning – For prediction and classification.

• Unsupervised learning – To detect patterns and relationships in the


data.

4
Introduction
• Whether we are using a supervised or unsupervised learning
approach, the data-mining process comprises the following steps:

• Data Sampling

• Data Preparation

• Model Construction

• Model Assessment

5
Data Sampling

6
Data Sampling
• When dealing with large volumes of data, it is best practice to
extract a representative sample for analysis.
• A sample is representative, if the analyst can make the same
conclusions from it as from the entire population of data.
• The sample of data must be large enough to contain significant
information, yet small enough to be manipulated quickly.
• Data-mining algorithms typically are more effective given more
data.
7
Data Sampling
• When obtaining a representative sample, it is generally best to
include as many variables as possible in the sample.

• After exploring the data with descriptive statistics and


visualization, the analyst can eliminate variables that are not of
interest.

8
Data Preparation

9
Data Preparation
• The data in a data set are often said to be “dirty” and “raw”
before they have been preprocessed.

• We need to put them into a form that is best suited for a data-
mining algorithm.

• Data preparation makes heavy use of the descriptive statistics and


data visualization methods.

10
Data Preparation
• Treatment of Missing Data

• The primary options for addressing missing data

• To discard observations with any missing values

• To discard any variable with missing values

• To fill in missing entries with estimated values

• To apply a data-mining algorithm (such as classification and


regression trees) that can handle missing values
11
Data Preparation
• Identification of Outliers and Erroneous Data

• Examining the variables in the data set by means of summary


statistics, histograms, PivotTables, scatter plots, and other tools
can uncover data quality issues and outliers.

• Closer examination of outliers may reveal an error or a need for


further investigation to determine whether the observation is
relevant to the current analysis.

12
Data Preparation

Identification of Outliers and Erroneous Data

• A conservative approach is to create two data sets, one with and


one without outliers, and then construct a model on both data
sets.

• If a model’s implications depend on the inclusion or exclusion of


outliers, then one should spend additional time to track down the
cause of the outliers.

13
Data Preparation
• Variable Representation
• In many data-mining applications, the number of variables for
which data is recorded may be prohibitive to analyze.
• Dimension reduction: Process of removing variables from the
analysis without losing any crucial information.
• One way is to examine pairwise correlations to detect variables or
groups of variables that may supply similar information.

• Such variables can be aggregated or removed to allow more


parsimonious model development.
14
Data Preparation
• Variable Representation
• A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.

• The treatment of categorical variables is particularly important.

• Often data sets contain variables that, considered separately, are


not particularly insightful but that, when combined as ratios, may
represent important relationships.

15
Unsupervised Learning

16
Unsupervised Learning
• Unsupervised learning application

• The goal is to use the variable values to identify relationships


between observations.

• Qualitative assessments, such as how well the results match


expert judgment, are used to assess unsupervised learning
methods.

17
Unsupervised Learning
• Cluster Analysis

• The goal of clustering is to segment observations into similar


groups based on the observed variables.

• Clustering can be employed during the data preparation step to


identify variables or observations that can be aggregated or
removed from consideration.

• Cluster analysis can also be used to identify outliers.

18
Unsupervised Learning
• Clustering methods:

• Hierarchical clustering

• k-means clustering

• Both methods depend on how two observations are similar

• Hence, we have to measure similarity between observations.

19
Unsupervised Learning

 Measuring similarity between observations


• Euclidean distance: Most common method to measure
dissimilarity between observations, when observations include
continuous variables.
• Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each
comprise measurements of q variables.
• The Euclidean distance between observations u and v is

20
Unsupervised Learning

Measuring similarity between observations


Illustration:
• KTC is a financial advising company that provides personalized
financial advice to its clients.
• KTC would like to segment its customers into several groups (or
clusters) so that
• The customers within a group are similar and dissimilar with
respect to key characteristics

21
Unsupervised Learning

Measuring similarity between observations


Illustration (contd.):
• For each customer, KTC has corresponding to a vector of
measurements on seven customer variables, that is, (Age,
Female, Income, Married, Children, Car Loan, Mortgage).
• Example: The observation u = (61, 0, 57881, 1, 2, 0, 0)
corresponds to a 61-year-old male with an annual income of
$57,881, married with two children, but no car loan and no
mortgage.
22
Figure 6.1 - Euclidean Distance

• Euclidean distance becomes smaller as a pair of observations


become more similar with respect to their variable values.
23
Unsupervised Learning
• Euclidean distance is highly influenced by the scale on which
variables are measured.
• Therefore, it is common to standardize the units of each variable j
of each observation u;
• Example: uj, the value of variable j in observation u, is replaced with its

z-score, zj.

• The conversion to z-scores also makes it easier to identify outlier


measurements, which can distort the Euclidean distance between
observations. 24
Unsupervised Learning
•  Matching coefficient: Simplest overlap measure of similarity
between two observations.
• Achieved by counting the number of variables with matching
values.
• Used when clustering observations solely on the basis of
categorical variables encoded as 0–1 (or dummy variables)
• Matching coefficient:

25
Unsupervised Learning
•  Weakness of matching coefficient - If two observations both have
a 0 entry for a categorical variable, this is counted as a sign of
similarity between the two observations.
• However, matching 0 entries do not necessarily imply similarity.
• Jaccard’s coefficient: A similarity measure that does not count
matching zero entries.

26
Unsupervised Learning
• Hierarchical clustering: Bottom-up approach

• Determines the similarity of two clusters by considering the


similarity between the observations composing either cluster.

• Starts with each observation in its own cluster and then iteratively
combines the two clusters that are the most similar into a single
cluster.

27
Unsupervised Learning
• Methods to obtain a cluster similarity measure:
• The similarity between two clusters is defined by
Single
linkage the similarity of the pair of observations (one
from each cluster) that are the most similar.
• This clustering method defines the similarity
Complete between two clusters as the similarity of the pair
linkage of observations (one from each cluster) that are
the most different.
• Defines the similarity between two clusters to be
Average the average similarity computed over all pairs of
linkage observations between the two clusters.
28
Unsupervised Learning
• Methods to obtain a cluster similarity measure:(contd.)

Average • The similarity between two clusters is defined by


group the similarity of the pair of observations (one
linkage from each cluster) that are the most similar.

• Computes dissimilarity as the sum of the squared


Ward’s differences in similarity between each individual
method observation in the union of the two clusters and
the centroid of the resulting merged cluster.

29
Figure 6.2 - Measuring Similarity between Clusters

30
Unsupervised Learning
• Using XLMiner for hierarchical clustering
• We use XLMiner to construct hierarchical clusters for the KTC data.
• We base the clusters on a collection of 0–1 categorical variables
(Female, Married, Car Loan, and Mortgage).
• We use Jaccard’s coefficient to measure similarity between
observations and the average linkage clustering method to
measure similarity between clusters.

31
Figure 6.4 - Dendrogram for KTC
• A dendrogram is a chart that depicts the set of nested clusters resulting
at each step of aggregation.

32
Unsupervised Learning

k-Means clustering

• Given a value of k, the k-means algorithm randomly partitions the


observations into k clusters.

• After all observations have been assigned to a cluster, the


resulting cluster centroids are calculated.

• Using the updated cluster centroids, all observations are


reassigned to the cluster with the closest centroid
33
Figure 6.5 - Clustering observations by Age and
Income using k-means clustering with k = 3

34
Unsupervised Learning
Table 6.1 - Average Distances within Clusters

Table 6.2 - Distances between Cluster Centroids

35
Unsupervised Learning

Hierarchical clustering versus k-means clustering


Hierarchical clustering k-means clustering
Suitable when we have a small data Suitable when you know how
set (e.g., less than 500 observations) many clusters you want and you
and want to easily examine have a larger data set (e.g., larger
solutions with increasing numbers than 500 observations)
of clusters
Convenient method if you want to Partitions the observations,
observe how clusters are nested which is appropriate if trying to
summarize the data with k
“average” observations
that describe the data with the
minimum amount of error

36
Unsupervised Learning

• Association rules: if-then statements


• Convey the likelihood of certain items being purchased together.
• Antecedent: The collection of items (or item set) corresponding
to the if portion of the rule.
• Consequent: The item set corresponding to the then portion of
the rule
• Support count of an item: Number of transactions in the data
that include that item set.
37
Table 6.3 - Shopping Cart Transactions

38
Unsupervised Learning
• Confidence: Helps identify reliable association rules.
 

• Lift Ratio: Measure to evaluate the efficiency of a rule.

• For the data in Table 6.3, the rule “if {bread, jelly}, then {peanut
butter}” has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) =
1.25.

39
Unsupervised Learning
• Evaluating association rules

• An association rule is ultimately judged on how actionable it is and


how well it explains the relationship between item sets.

• For example, Wal-Mart mined its transactional data to uncover strong


evidence of the association rule, “If a customer purchases a Barbie
doll, then a customer also purchases a candy bar.”

• An association rule is useful if it is well supported and explain an


important previously unknown relationship.
40
Supervised Learning

41
Supervised Learning
• The goal of a supervised learning technique is to develop a model
that predicts a value for a continuous outcome or classifies a
categorical outcome.
• Partitioning Data
• We can use the abundance of data to guard against the potential
for overfitting by decomposing the data set into three partitions
• the training set
• the validation set, and
• the test set
42
Supervised Learning

Partitioning Data
• Training set: Consists of the data used to build the candidate
models.
• Validation set: The data set to which promising subset of models
is applied to the to identify which model is the most accurate at
predicting when applied to data that were not used to build the
model.
• Test set: The data set to which the final model should be applied
to estimate this model’s effectiveness when applied to data that
have not been used to build or select the model.
43
Supervised Learning
• Classification Accuracy

• By counting the classification errors on a sufficiently large


validation set and/or test set that is representative of the
population, we will generate an accurate measure of the model’s
classification performance.

• Classification confusion matrix: Displays a model’s correct and


incorrect classifications.

44
Supervised Learning

  Table 6.4 Classification Confusion Matrix

• Overall Error Rate: percentage of misclassified observations


• Measure of classification accuracy are based on the classification
confusion matrix.

45
Supervised Learning

 We define error rate with respect to the individual classes to


account for the assymetric costs in misclassification:

Class 1 error rate = ; Class 0 error rate =

Cutoff value: Probability value used to understand the tradeoff


between Class 1 error rate and Class 0 error rate.

46
Table 6.5 - Classification Probabilities

47
Table 6.6 - Classification Confusion Matrices and
Error Rates for Various Cutoff Values

48
Figure 6.12 - Classification Error Rates versus Cutoff
Value

49
Figure 6.12 - Classification Error Rates versus Cutoff
Value
• Cumulative lift chart: Compares the number of actual Class 1
observations identified if considered in decreasing order of their
estimated probability of being in Class 1 and compares this to the
number of actual Class 1 observations identified if randomly
selected.
• Decile-wise lift chart: Another way to view how much better a
classifier is at identifying Class 1 observations than random
classification.
• Observations are ordered in decreasing probability of Class 1
membership and then considered in 10 equal-sized groups.

50
Figure 6.13 - Cumulative and Decile-wise Lift Charts

51
Supervised Learning
•  Prediction Accuracy

• The measures of accuracy are some function of the error in


estimating an outcome for an observation i.

• Average error = /n
• Root mean squared error (RMSE) =
• = error in estimating an outcome for observation i.

52
Supervised Learning
• k-nearest neighbors (k-NN): This method can be used either to
classify an outcome category or predict a continuous outcome.

• k-NN uses the k most similar observations from the training set,
where similarity is typically measured with Euclidean distance.

• Using XLMiner to classify with k-nearest neighbors


• XLMiner provides the capability to apply the k-nearest neighbors
method for classifying a 0–1 categorical outcome.
• Using XLMiner to predict with k-nearest neighbors
• XLMiner provides the capability to apply the k-nearest neighbors
method for prediction of a continuous outcome.
53
Supervised Learning

• Classification and Regression Trees (CART)


• Partition a data set of observations into increasingly smaller and
more homogeneous subsets.

• At each iteration of the CART method, a subset of observations is


split into two new subsets based on the values of a single variable.

• Series of questions that successively narrow down observations


into smaller and smaller groups of decreasing impurity.

54
Supervised Learning

• Classification and Regression Trees (CART)


• Classification trees - The impurity of a group of observations is
based on the proportion of observations belonging to the same
class.
• Regression trees - The impurity of a group of observations is based
on the proportion of observations belonging to the same class.

55
Supervised Learning

Classifying a categorical outcome with a classification tree

• We use a small sample of data from HHI consisting of 46


observations and

• Only two variables from HHI—percentage of the $ character


(denoted Percent_$) and percentage of the ! character
(Percent_!)

• To explain how a classification tree categorizes observations.

56
Figure 6.20 - Construction Sequence of Branches in
a Classification Tree

57
Figure 6.21 - Geometric Illustration of Classification
Tree Partitions

58
Figure 6.22 - Classification Tree with one Pruned
Branch

59
Supervised Learning
Table 6.7 - Classification Error rates on Sequence of Pruned trees3 - Best Pruned Classification
Tree

Figure 6.23 - Best Pruned Classification Tree

60
Figure 6.27 - Best Pruned Classification Tree for
Hawaiian Ham

61
Figure 6.28 - Best Pruned Tree Classification of test
data for Hawaiian Ham

62
Figure 6.29 - Best Pruned Tree’s Classification
Confusion Matrix on Test Data

63
Supervised Learning

• Predicting a continuous outcome with a regression tree

• A regression tree bases the impurity of a partition based on the


variance of the outcome value for the observations in the group.

• After a final tree is constructed, the predicted outcome value of an


observation is based on the mean outcome value of the partition
into which the new observation belongs.

64
Figure 6.30 - XLMiner steps for regression trees

65
Figure 6.31 - Full Regression Tree for Optiva Credit
Union

66
Figure 6.32 - Regression Tree Pruning Log

67
Figure 6.33 - Best Pruned Regression tree for Optiva
Credit Union

68
Figure 6.34 - Best Pruned Tree Prediction of Test
Data for Optiva Credit Union

69
Figure 6.35 - Prediction Error of Regression Trees

70
Supervised Learning

Logistic regression: Attempts to classify a categorical outcome (y = 0


or 1) as a linear function of explanatory variables.

71
Supervised Learning

72
Supervised Learning

•  Odds - measure related to probability

• If an estimate of the probability of an event is then the equivalent


odds measure is / (1 – ).

• The odds metric ranges between zero and positive infinity.

• We eliminate the fit problem by using logit , ln (/ (1 – )).

• Estimating the logit with a linear function results in the estimated


logistic regression equation.

73
Supervised Learning

•  Estimated Logistic Regression Equation

ln + ∙ ∙ ∙ +

Given a set of explanatory variables, a logistic regression algorithm


determines values of , , . . . , that best estimate the log odds.

• Logistic Function

74
Figure 6.39 - XLMiner steps for logistic regression

75
Figure 6.40 - XLMiner logistic regression output

76
Figure 6.41 - XLMiner steps for refitting Logistic
Regression Model and using it to Predict new Data

77
Figure 6.42 - Classification Error for Logistic
Regression Model

78
Figure 6.43 - Classification of 30 new Customer
Observations

79
QUESTIONS/CLARIFICATIONS

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy