0% found this document useful (0 votes)
76 views57 pages

BA 2023 - 2024 T03 Descriptive Data Mining

The document discusses descriptive data mining techniques used to identify relationships between observations without an outcome variable. It focuses on cluster analysis, which segments observations into similar groups based on measured variables. Two common clustering methods are hierarchical clustering, which sequentially merges the most similar clusters, and k-means clustering, which assigns observations to clusters to maximize similarity within clusters. The document also discusses different ways to measure similarity between observations, including Euclidean distance, Manhattan distance, and similarity coefficients for categorical variables.

Uploaded by

jhkkpmynkg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views57 pages

BA 2023 - 2024 T03 Descriptive Data Mining

The document discusses descriptive data mining techniques used to identify relationships between observations without an outcome variable. It focuses on cluster analysis, which segments observations into similar groups based on measured variables. Two common clustering methods are hierarchical clustering, which sequentially merges the most similar clusters, and k-means clustering, which assigns observations to clusters to maximize similarity within clusters. The document also discusses different ways to measure similarity between observations, including Euclidean distance, Manhattan distance, and similarity coefficients for categorical variables.

Uploaded by

jhkkpmynkg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Business Analytics | 2023-2024

Descriptive data mining


João Lourenço
joao.lourenco@tecnico.ulisboa.pt

Source: Camm, J. D., Cochran, J. J., Fry, M. J., & Ohlmann, J. W. (2021). Business Analytics (4th ed.). Boston, MA: Cengage.
Introduction

The increase in the use of data-mining techniques in


business has been caused largely by three events:
• The explosion in the amount of data being produced and electronically
tracked.
• The ability to electronically warehouse these data.
• The affordability of computer power to analyze the data.

Academic Year 2023/2024 Business Analytics 2


Introduction

• Observation: Set of recorded values of variables


associated with a single entity.
• Unsupervised learning: A descriptive data-mining
technique used to identify relationships between
observations.
• Thought of as high-dimensional descriptive analytics.
• There is no outcome variable to predict; instead, qualitative
assessments are used to assess and compare the results.

Academic Year 2023/2024 Business Analytics 3


Cluster Analysis
Measuring Similarity Between Observations
Hierarchical Clustering
k-Means Clustering
Hierarchical Clustering versus k-Means Clustering

Academic Year 2023/2024 Business Analytics 4


Cluster Analysis
• Goal of clustering is to segment observations into similar
groups based on observed variables.
• Can be employed during the data-preparation step to
identify variables or observations that can be aggregated or
removed from consideration.
• Commonly used in marketing to divide customers into
different homogenous groups; known as market
segmentation.
• Used to identify outliers.

Academic Year 2023/2024 Business Analytics 5


Cluster Analysis
• Clustering methods:
• Bottom-up hierarchical clustering starts with each observation
belonging to its own cluster and then sequentially merges the most
similar clusters to create a series of nested clusters.
• k-means clustering assigns each observation to one of k clusters in
a manner such that the observations assigned to the same cluster
are as similar as possible.
• Both methods depend on how two observations are
similar—hence, we have to measure similarity between
observations.
Academic Year 2023/2024 Business Analytics 6
Measuring Similarity Between Observations:
When observations include numeric variables, Euclidean distance is
the most common method to measure dissimilarity between
observations.
Let observations u = ( u1 , u2 ,  , uq ) and v = ( v1 , v2 , , vq ) each comprise
measurements of q variables.
The Euclidean distance between observations u and v is:

(u1 - v1 ) + (u2 - v2 ) + ××× + ( uq - vq )


2 2 2
duv =

Academic Year 2023/2024 Business Analytics 7


Cluster Analysis
Measuring Similarity Between Observations:
Illustration:
• KTC is a financial advising company that provides personalized
financial advice to its clients.
• KTC would like to segment its customers into several groups (or
clusters) so that the customers within a group are similar and
dissimilar with respect to key characteristics.
• For each customer, KTC has an observation of seven variables: Age,
Female, Income, Married, Children, Car Loan, Mortgage.
Example: The observation u = (61, 0, 57881, 1, 2, 0, 0) corresponds
to a 61-year-old male with an annual income of $57,881, married
with two children, but no car loan and no mortgage.

Academic Year 2023/2024 Business Analytics 8


Cluster Analysis

Figure 1: Euclidean Distance

Euclidean distance becomes smaller


as a pair of observations become
more similar with respect to their
variable values.

Academic Year 2023/2024 Business Analytics 9


Cluster Analysis
Let observation u = (23, $20,375) correspond to a 23-year-old
customer with an annual income of $20,375 and observation v =
(36, $19,475) correspond to a 36-year-old with an annual income
of $19,475. As measured by Euclidean distance, the dissimilarity
between these two observations is
We see that when using the raw variable
values, the amount of dissimilarity
between observations is dominated by the
Income variable because of the difference
in the magnitude of the measurements.
For the data we are using, the standardized (or normalized) values of observations u and v are
(−1.76, −0.56) and (−0.76, −0.62), respectively. The dissimilarity between these two observations
based on standardized values is
Observe that observations u and v are
actually much more different in age than
in income.

Academic Year 2023/2024 Business Analytics 10


Cluster Analysis
• Euclidean distance is highly influenced by the scale on which
variables are measured:
• Common to standardize the units of each variable j of each
observation u.
Example: u j , the value of variable j , in observation u, is replaced with
its z -score z j .
• The conversion to z-scores also makes it easier to identify outlier
measurements, which can distort the Euclidean distance between
observations.

Academic Year 2023/2024 Business Analytics 11


Measuring Similarity Between Observations (cont.):
Manhattan distance is a dissimilarity measure that is more robust to
outliers than Euclidean distance. The Manhattan distance between
observations u and v is

Academic Year 2023/2024 Business Analytics 12


Cluster Analysis
Figure 2: Manhattan distance

From Figure 2, we observe that the


Manhattan distance between two
observations is the sum of the lengths of
the perpendicular line segments
connecting observations u and v.
In contrast to Euclidean distance, which
corresponds to the straight-line “as the
crow flies” segment between two
observations, Manhattan distance
corresponds to the distance as if
travelled along rectangular city blocks.

Academic Year 2023/2024 Business Analytics 13


Cluster Analysis

The Manhattan distance between the standardized


observations u = (–1.77, –0.56) and v = (0.15, –0.62)
is

Note: After conversion to z-scores, unequal weighting of variables can also be considered
by multiplying the variables of each observation by a selected set of weights.
For instance, after standardizing the units on customer observations so that income and
age are expressed as their respective z-scores (instead of expressed in dollars and years),
we can multiply the income z-scores by 2 if we wish to treat income with twice the
importance of age. Therefore, standardizing removes bias due to the difference in
measurement units, and variable weighting allows the analyst to introduce any desired bias
based on the business context.

Academic Year 2023/2024 Business Analytics 14


Cluster Analysis
• When clustering observations solely on the basis of categorical variables
encoded as 0–1, a better measure of similarity between two observations
can be achieved by counting the number of variables with matching values.
• The simplest overlap measure is called the matching coefficient and is
computed as:

Academic Year 2023/2024 Business Analytics 15


Cluster Analysis
• A weakness of the matching coefficient is that if two observations
both have a 0 entry for a categorical variable, this is counted as a
sign of similarity between the two observations.
• To avoid misstating similarity due to the absence of a feature, a
similarity measure called Jaccard’s coefficient does not count
matching zero entries and is computed as:

Academic Year 2023/2024 Business Analytics 16


Cluster Analysis

Table 1: Comparison of Similarity Matrixes for


Observations with Binary Variables
Observation Female Married Loan Mortgage
1 1 0 0 0
2 0 1 1 1
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0

Academic Year 2023/2024 Business Analytics 17


Cluster Analysis

Table 1: Comparison of Similarity Matrixes for Observations


with Binary Variables (cont.)
• Similarity Matrix Based on Matching Coefficient:
Observation 1 2 3 4 5
1 1
2 0 1
3 0.5 0.5 1
4 0.75 0.25 0.75 1
5 0.75 0.25 0.75 1 1

Academic Year 2023/2024 Business Analytics 18


Cluster Analysis

Table 1: Comparison of Similarity Matrixes for Observations


with Binary Variables (cont.)
• Similarity Matrix Based on Jaccard’s Coefficient:
Observation 1 2 3 4 5
1 1
2 0 1
3 0.333 0.5 1
4 0.5 0.25 0.667 1
5 0.5 0.25 0.667 1 1

Academic Year 2023/2024 Business Analytics 19


Cluster Analysis
Matching distance:
Subtracting the matching coefficient from 1 results in a distance measure
for binary variables. The matching distance between observations u and v
(consisting entirely of binary variables) is

Academic Year 2023/2024 Business Analytics 20


Cluster Analysis
Jaccard distance:
Subtracting Jaccard’s coefficient from 1 results in the Jaccard distance
measure for binary variables. That is, the Jaccard distance between
observations u and v (consisting entirely of binary variables) is

Academic Year 2023/2024 Business Analytics 21


Cluster Analysis
Hierarchical Clustering:
• Determines the similarity of two clusters by considering the similarity
between the observations composing either cluster.
• Starts with each observation in its own cluster and then iteratively combines
the two clusters that are the most similar into a single cluster.
• Given a way to measure similarity between observations, there are several
clustering method alternatives for comparing observations in two clusters to
obtain a cluster similarity measure:
• Single linkage.
• Complete linkage.
• Group average linkage.
• Median linkage.
• Centroid linkage.
Academic Year 2023/2024 Business Analytics 22
Cluster Analysis
• Single linkage: The similarity between two clusters is defined by the similarity
of the pair of observations (one from each cluster) that are the most similar.
• Complete linkage: This clustering method defines the similarity between two
clusters as the similarity of the pair of observations (one from each cluster)
that are the most different.
• Group Average linkage: Defines the similarity between two clusters to be the
average similarity computed over all pairs of observations between the two
clusters.
• Median linkage: Analogous to group average linkage except that it uses the
median of the similarities computed between all pairs of observations
between the two clusters.
• Centroid linkage uses the averaging concept of cluster centroids to define
between-cluster similarity.
Academic Year 2023/2024 Business Analytics 23
Cluster Analysis

Figure 3: Measuring
Similarity Between
Clusters

Academic Year 2023/2024 Business Analytics 24


Cluster Analysis
• Ward’s method merges two clusters such that the
dissimilarity of the observations with the resulting single
cluster increases as little as possible.
• When McQuitty’s method considers merging two clusters A
and B, the dissimilarity of the resulting cluster AB to any other
cluster C is calculated as: ((dissimilarity between A and C) +
(dissimilarity between B and C)) divided by 2).
• A dendrogram is a chart that depicts the set of nested
clusters resulting at each step of aggregation.

Academic Year 2023/2024 Business Analytics 25


Cluster Analysis
Figure 4: Dendrogram
for KTC Using
Matching Coefficients
and Group Average
Linkage

Academic Year 2023/2024 Business Analytics 26


Cluster Analysis
Three clusters

Composition of these three clusters

Cluster 1: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27}
5 mix of males and females, 15 out of 17 married, no car loans, 5 out of 17 with
mortgages

Cluster 2: {2, 26, 8, 10, 20, 25}


5 all males with car loans, 5 out of 6 married, 2 out of 6 with mortgages

Cluster 3: {3, 9, 14, 16, 12, 24, 29}


5 all females with car loans, 4 out of 7 married, 5 out of 7 with mortgages
Academic Year 2023/2024 Business Analytics 27
Cluster Analysis

k-Means Clustering:
• Given a value of k, the k-means algorithm randomly assigns each
observation to one of the k clusters.
• After all observations have been assigned to a cluster, the resulting
cluster centroids are calculated.
• Using the updated cluster centroids, all observations are reassigned to
the cluster with the closest centroid.

Academic Year 2023/2024 Business Analytics 28


Cluster Analysis

Figure 5: Clustering
Observations by
Age and Income
Using
k-Means Clustering
with k = 3

Academic Year 2023/2024 Business Analytics 29


Cluster Analysis
Three clusters

Cluster 1 is characterized by relatively younger, lower-income customers (Cluster 1’s


centroid is at [33 years, $20,364]).

Cluster 2 is characterized by relatively older, higher-income customers (Cluster 2’s


centroid is at [58 years, $47,729]).

Cluster 3 is characterized by relatively older, lower-income customers (Cluster 3’s


centroid is at [53 years, $21,416]).

Academic Year 2023/2024 Business Analytics 30


Cluster Analysis
Table 2: Average Distances Within Clusters
Average Distance Between
No. of Observations Observations in Cluster
Cluster 1 12 0.622
Cluster 2 8 0.739
Cluster 3 10 0.520

Table 3: Distances Between Cluster Centroids


Cluster 1 Cluster 2 Cluster 3
Cluster 1 0 2.784 1.529
Cluster 2 2.784 0 1.964
Cluster 3 1.529 1.964 0

Academic Year 2023/2024 Business Analytics 31


Cluster Analysis

Hierarchical Clustering versus k-Means Clustering


Hierarchical Clustering k-Means Clustering
Suitable when we have a small data set (e.g., Suitable when you know how many clusters you
fewer than 500 observations) and want to easily want and you have a larger data set (e.g., more
examine solutions with increasing numbers of than 500 observations).
clusters.
Convenient method if you want to observe how Partitions the observations, which is appropriate
clusters are nested. if trying to summarize the data with k “average”
observations that describe the data with the
minimum amount of error.

Academic Year 2023/2024 Business Analytics 32


Association Rules
Evaluating Association Rules

Academic Year 2023/2024 Business Analytics 33


Association Rules
• Association rules: If-then statements which convey the likelihood of
certain items being purchased together.
• Although association rules are an important tool in market basket
analysis, they are also applicable to other disciplines.
• Antecedent: The collection of items (or item set) corresponding to the if
portion of the rule.
• Consequent: The item set corresponding to the then portion of the rule.
• Support count of an item set: Number of transactions in the data that
include that item set.

Academic Year 2023/2024 Business Analytics 34


Association Rules
Table 4: Shopping-Cart Transactions
Transaction Shopping Cart
1 bread, peanut butter, milk, fruit, jelly
2 bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter
3 whipped cream, fruit, chocolate sauce, beer
4 steak, jelly, soda, potato chips, bread, fruit
5 jelly, soda, peanut butter, milk, fruit
6 jelly, soda, potato chips, milk, bread, fruit
7 fruit, soda, potato chips, milk
8 fruit, soda, peanut butter, milk
9 fruit, cheese, yogurt
10 yogurt, vegetables, beer

Academic Year 2023/2024 Business Analytics 35


Association Rules
Table 4: Shopping-Cart Transactions
Transaction Shopping Cart
1 bread, peanut butter, milk, fruit, jelly
2 bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter
3 whipped cream, fruit, chocolate sauce, beer
4 steak, jelly, soda, potato chips, bread, fruit
5 jelly, soda, peanut butter, milk, fruit
6 jelly, soda, potato chips, milk, bread, fruit
7 fruit, soda, potato chips, milk
8 fruit, soda, peanut butter, milk
9 fruit, cheese, yogurt
10 yogurt, vegetables, beer
• Investigating the rule “if {bread, jelly}, then {peanut butter}” we see the support count of
{bread, jelly, peanut butter} is 2.
Academic Year 2023/2024 Business Analytics 36
Association Rules
• Confidence: Helps identify reliable association rules:

• Lift ratio: Measure to evaluate the efficiency of a rule:

• For the data in Table 4, the rule “if {bread, jelly}, then {peanut butter}”
has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.

Academic Year 2023/2024 Business Analytics 37


Association Rules

• This measure of confidence can be viewed as the conditional probability


of the consequent item set occurs given that the antecedent item set
occurs.
• A high value of confidence suggests a rule in which the consequent is
frequently true when the antecedent is true, but a high value of
confidence can be misleading.

Academic Year 2023/2024 Business Analytics 38


Association Rules

• A lift ratio greater than 1 suggests that there is some usefulness to the rule
and that it is better at identifying cases when the consequent occurs than
no rule at all.
• For the data in Table 4, the rule “if {bread, jelly}, then {peanut butter}” has
confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.
-----> A lift ratio = 1.25 is 25% better than guessing at random.

Academic Year 2023/2024 Business Analytics 39


Association Rules
Table 5: Association Rules for Hy-Vee
Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Bread Fruit, Jelly 4 5 4 100.0 2.00


Bread Jelly 4 5 4 100.0 2.00
Bread, Fruit Jelly 4 5 4 100.0 2.00
Fruit, Jelly Bread 5 4 4 80.0 2.00
Jelly Bread 5 4 4 80.0 2.00
Jelly Bread, Fruit 5 4 4 80.0 2.00
Fruit, Potato Soda 4 6 4 100.0 1.67
Chips
Peanut Butter Milk 4 4 6 100.0 1.67
Peanut Butter Milk, Fruit 4 6 4 100.0 1.67

Academic Year 2023/2024 Business Analytics 40


Association Rules
Table 5: Association Rules for Hy-Vee (continued)
Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio
Peanut Butter, Milk 4 6 4 100.0 1.67
Fruit
Potato Chips Fruit, Soda 4 6 4 100.0 1.67

Potato Chips Soda 4 6 4 100.0 1.67


Fruit, Soda Potato Chips 6 4 4 66.7 1.67
Milk Peanut Butter 6 4 4 66.7 1.67
Milk Peanut Butter, 6 4 4 66.7 1.67
Fruit
Milk, Fruit Peanut Butter 6 4 4 66.7 1.67
Soda Fruit, Potato 6 4 4 66.7 1.67
Chips

Academic Year 2023/2024 Business Analytics 41


Association Rules
Table 5: Association Rules for Hy-Vee (continued)
Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio
Soda Potato Chips 6 4 4 66.7 1.67
Fruit, Soda Milk 6 6 5 83.3 1.39
Milk Fruit, Soda 6 6 5 83.3 1.39
Milk Soda 6 6 5 83.3 1.39
Milk, Fruit Soda 6 6 5 83.3 1.39
Soda Milk 6 6 5 83.3 1.39
Soda Milk, Fruit 6 6 5 83.3 1.39

Academic Year 2023/2024 Business Analytics 42


Association Rules
Evaluating Association Rules:
• An association rule is ultimately judged on how actionable it is and how
well it explains the relationship between item sets.
• For example, Walmart mined its transactional data to uncover strong
evidence of the association rule, “If a customer purchases a Barbie doll,
then a customer also purchases a candy bar.”
• An association rule is useful if it is well supported and explains an
important previously unknown relationship.

Academic Year 2023/2024 Business Analytics 43


Text Mining
Voice of the Customer at Triad Airline
Preprocessing Text Data for Analysis
Movie Reviews

Academic Year 2023/2024 Business Analytics 44


Text Mining
• Text, like numerical data, may contain information that can help solve
problems and lead to better decisions.
• Text mining is the process of extracting useful information from text
data.
• Text data is often referred to as unstructured data because in its raw
form, it cannot be stored in a traditional structured database (rows and
columns).
• Audio and video data are also examples of unstructured data.
• Data mining with text data is more challenging than data mining with
traditional numerical data, because it requires more preprocessing to
convert the text to a format amenable for analysis.

Academic Year 2023/2024 Business Analytics 45


Text Mining
Voice of the Customer at Triad Airline:
• Triad solicits feedback from its customers through a follow-up e-mail
the day after the customer has completed a flight.
• Survey asks the customer to rate various aspects of the flight and asks
the respondent to type comments into a dialog box in the e-mail;
includes:
• Quantitative feedback from the ratings.
• Comments entered by the respondents which need to be analyzed.
• A collection of text documents to be analyzed is called a corpus.

Academic Year 2023/2024 Business Analytics 46


Text Mining
Table 6: Ten Respondents’ Concerns for Triad Airlines
Concerns
The wi-fi service was horrible. It was slow and cut off several times.
My seat was uncomfortable.
My flight was delayed 2 hours for no apparent reason.
My seat would not recline.
The man at the ticket counter was rude. Service was horrible.
The flight attendant was rude. Service was bad.
My flight was delayed with no explanation.
My drink spilled when the guy in front of me reclined his seat.
My flight was canceled.
The arm rest of my seat was nasty.

Academic Year 2023/2024 Business Analytics 47


Text Mining
Voice of the Customer at Triad Airline:
• To be analyzed, text data needs to be converted to structured data (rows
and columns of numerical data) so that the tools of descriptive statistics,
data visualization and data mining can be applied.
• Think of converting a group of documents into a matrix of rows and
columns where the rows correspond to a document and the columns
correspond to a particular word.
• A presence/absence or binary term-document matrix is a matrix with the
rows representing documents and the columns representing words.
• Entries in the columns indicate either the presence or the absence of a particular
word in a particular document.

Academic Year 2023/2024 Business Analytics 48


Text Mining
Voice of the Customer at Triad Airline (continued):
• Creating the list of terms to use in the presence/absence matrix can be a
complicated matter:
• Too many terms results in a matrix with many columns, which may be difficult to
manage and could yield meaningless results.
• Too few terms may miss important relationships.
• Term frequency along with the problem context are often used as a guide.
• In Triad’s case, management used word frequency and the context of
having a goal of satisfied customers to come up with the following list of
terms they feel are relevant for categorizing the respondent’s comments:
delayed, flight, horrible, recline, rude, seat, and service.

Academic Year 2023/2024 Business Analytics 49


Text
Table 7: The Presence/Absence Term-Document Matrix for
Triad Airlines
Term
Document Delayed Flight Horrible Recline Rude Seat Service
1 0 0 1 0 0 0 1
2 0 0 0 0 0 1 0
3 1 1 0 0 0 0 0
4 0 0 0 1 0 1 0
5 0 0 1 0 1 0 1
6 0 1 0 0 1 0 1
7 1 1 0 0 0 0 0
8 0 0 0 1 0 1 0
9 0 1 0 0 0 0 0
10 0 0 0 0 0 1 0

Academic Year 2023/2024 Business Analytics 50


Text Mining
Preprocessing Text Data for Analysis:
• The text-mining process converts unstructured text into numerical data
and applies quantitative techniques.
• Which terms become the headers of the columns of the term-document
matrix can greatly impact the analysis.
• Tokenization is the process of dividing text into separate terms, referred to
as tokens:
• Symbols and punctuations must be removed from the document, and all
letters should be converted to lowercase.
• Different forms of the same word, such as “stacking,” “stacked,” and
“stack” probably should not be considered as distinct terms.
• Stemming is the process of converting a word to its stem or root word.

Academic Year 2023/2024 Business Analytics 51


Text Mining
Preprocessing Text Data for Analysis (continued):
• The goal of preprocessing is to generate a list of most-relevant terms that is
sufficiently small so as to lend itself to analysis:
• Frequency can be used to eliminate words from consideration as tokens.
• Low-frequency words probably will not be very useful as tokens.
• Consolidating words that are synonyms can reduce the set of tokens.
• Most text-mining software gives the user the ability to manually specify
terms to include or exclude as tokens.
• The use of slang, humor, and sarcasm can cause interpretation problems and
might require more sophisticated data cleansing and subjective intervention
on the part of the analyst to avoid misinterpretation.
• Data preprocessing parses the original text data down to the set of tokens
deemed relevant for the topic being studied.
Academic Year 2023/2024 Business Analytics 52
Text Mining
Preprocessing Text Data for Analysis (continued):
• When the documents in a corpus contain many words and when the
frequency of word occurrence is important to the context of the
business problem, preprocessing can be used to develop a frequency
term-document matrix.
• A frequency term-document matrix is a matrix whose rows represent
documents and columns represent tokens, and the entries in the
matrix are the frequency of occurrence of each token in each
document.

Academic Year 2023/2024 Business Analytics 53


Text Mining
Movie Reviews:
• A new action film has been released, and we now have a sample of 10
reviews from movie critics.
• Using preprocessing techniques, we have reduced the number of
tokens to only two: “great” and “terrible.”
• Table 8 displays the corresponding frequency term-document matrix.
• To demonstrate the analysis of a frequency term-document matrix with
descriptive data mining, we apply k-means clustering with k = 2 to the
frequency term-document matrix to obtain the two clusters in Figure 6.

Academic Year 2023/2024 Business Analytics 54


Text Mining
Table 8: The Frequency Term-Document Matrix for Movie
Reviews
Term
Document Great Terrible
1 5 0
2 5 1
3 5 1
4 3 3
5 5 1
6 0 5
7 4 1
8 5 3
9 1 3
10 1 2

Academic Year 2023/2024 Business Analytics 55


Text Mining
Figure 6: Two Clusters Using k-Means Clustering on Movie
Reviews

Academic Year 2023/2024 Business Analytics 56


Questions?

Academic Year 2023/2024 Business Analytics 57

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy