BA 2023 - 2024 T03 Descriptive Data Mining
BA 2023 - 2024 T03 Descriptive Data Mining
Source: Camm, J. D., Cochran, J. J., Fry, M. J., & Ohlmann, J. W. (2021). Business Analytics (4th ed.). Boston, MA: Cengage.
Introduction
Note: After conversion to z-scores, unequal weighting of variables can also be considered
by multiplying the variables of each observation by a selected set of weights.
For instance, after standardizing the units on customer observations so that income and
age are expressed as their respective z-scores (instead of expressed in dollars and years),
we can multiply the income z-scores by 2 if we wish to treat income with twice the
importance of age. Therefore, standardizing removes bias due to the difference in
measurement units, and variable weighting allows the analyst to introduce any desired bias
based on the business context.
Figure 3: Measuring
Similarity Between
Clusters
Cluster 1: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27}
5 mix of males and females, 15 out of 17 married, no car loans, 5 out of 17 with
mortgages
k-Means Clustering:
• Given a value of k, the k-means algorithm randomly assigns each
observation to one of the k clusters.
• After all observations have been assigned to a cluster, the resulting
cluster centroids are calculated.
• Using the updated cluster centroids, all observations are reassigned to
the cluster with the closest centroid.
Figure 5: Clustering
Observations by
Age and Income
Using
k-Means Clustering
with k = 3
• For the data in Table 4, the rule “if {bread, jelly}, then {peanut butter}”
has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.
• A lift ratio greater than 1 suggests that there is some usefulness to the rule
and that it is better at identifying cases when the consequent occurs than
no rule at all.
• For the data in Table 4, the rule “if {bread, jelly}, then {peanut butter}” has
confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.
-----> A lift ratio = 1.25 is 25% better than guessing at random.