Meet Your Customers v4.7 Ebook
Meet Your Customers v4.7 Ebook
ISBN: 978-3-9523926-3-8
www.knime.com
Preface
Machine learning promises great value for marketing-related applications. However,
the proliferation of data types, methods, tools, and programming languages hampers
knowledge integration amongst marketing analytics teams, making collaboration and
extraction of actionable insights difficult.
Visual-based programming tools come to the rescue. Their uncomplicated data flow
building process and intuitive UI can help marketing and data teams develop,
orchestrate, and deploy advanced machine learning projects in a visual, easy-to-create,
and share fashion.
Motivated by the desire to share the technical expertise and knowledge gathered over
the years, this book is a collection of experiences solving some of the most popular
challenges in Marketing Analytics using a no-code/low-code approach with KNIME
Analytics Platform.
All examples described in this book are the result of a prolific collaboration between
business and academia. We teamed up to create a live repository of Machine Learning
and Marketing solutions tailored to new and expert users, market analysts, students,
data scientists, marketers, researchers, and data analysts. All workflow solutions are
available for free on the KNIME Community Hub
We will update this book as frequently as possible with the descriptions and workflows
from the newest, most recent projects in Marketing Analytics, as they become
available.
We hope this collection of Marketing Analytics experiences will help foster the growth
of practical data science skills in the next generation of market and data professionals.
CUSTOMER SEGMENTATION 20
MARKET BASKET ANALYSIS WITH THE APRIORI ALGORITHM 33
MOVIE RECOMMENDATIONS WITH SPARK COLLABORATIVE FILTERING 38
iii
Table of Contents
iv
Marketing Analytics with KNIME
1
Marketing Analytics with KNIME
Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Rosaria Silipo, KNIME
Many businesses are currently expanding their adoption of data science techniques to
include machine learning. Marketing analytics is one of them. Anything can be reduced
to numbers, including customer behavior and color perception, and therefore anything
can be analyzed, modeled, and predicted.
2
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
The public workflow repository for marketing analytics solutions on the KNIME Community Hub.
Since its creation in 2021, we have been maintaining and will continue to maintain this
repository by updating the existing workflows and adding new ones every time a
solution from a new project becomes available.
Note. This solution repository has been designed, implemented, and maintained by
a mixed team of KNIME users and marketing experts from the KNIME Evangelism
Team in Constance (Germany), headed by Rosaria Silipo, and Francisco
Villarroel Ordenes, Professor of Marketing at LUISS Guido Carli university in Rome
(Italy).1
1
F. Villarroel Ordenes & R. Silipo, “Machine learning for marketing on the KNIME Hub: The development of a
live repository for marketing applications”, Journal of Business Research 137(1):393-410,
DOI: 10.1016/j.jbusres.2021.08.036.
3
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
process. If the churn probability is very high and the customer is valuable, the firm
might want to take actions to prevent this churn.
The “Churn Prediction” subfolder in the Machine Learning and Marketing space on the
KNIME Community Hub includes:
The dashboard reporting the churn risk in orange for all new customers.
Sentiment analysis
Sentiment is another popular metric used in marketing to evaluate the reactions of
users and customers to a given initiative, product, event, etc. Following the popularity
of this topic, we have dedicated a few solutions to the implementation of a sentiment
evaluator for text documents. Such solutions are contained in the “Sentiment Analysis”
subfolder. All solutions focus on three sentiment classes: positive, negative, and
neutral.
4
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
• Traditional machine learning algorithms. In this case, texts are transformed into
numerical vectors, where each unit represents the presence/absence, or the
frequency of a given word from the corpus dictionary. After that, traditional
machine learning algorithms, such as Random Forest, Support Vector Machine, or
Logistic Regression can be applied to classify the text polarity. Notice that in the
vectorization process the order of the word in the text is not preserved.
• Deep Learning-based. Deep learning-based solutions are becoming more and
more popular for sentiment analysis, since some deep learning architectures can
exploit the word context (i.e., the sequence history) for better sentiment
estimation. In this case, texts are one-hot encoded into vectors, and the sequence
of such vectors is presented to a neural network, which is trained to recognize the
text polarity. Often, the architecture of the neural network includes a layer of Long
Short-Term Memory units (LSTM), since LSTM performs the task by taking into
account the order of appearance of the input vectors (the words), i.e., by taking
into account the word context.
• Language models. They are also referred to as deep contextualized language
models because they reflect the context-dependent meaning of words. It has been
argued that these methods are more efficient than Recurrent Neural Networks
because they allow parallelized encoding (rather than sequential) of word and sub-
word tokens contingent on their context. Recent language model algorithms are
ULMFiT, BERT, RoBERTa, XLNet, etc. In the Machine Learning repository, we
provide a straightforward implementation of BERT.
5
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
Visualization of tweets with estimated sentiment (red = negative, green = positive, light orange = neutral).
6
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
7
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
Analysis of the image in the top left corner through Google Vision services.
8
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics
are working in the field of cybersecurity, then including these words in your web page
should increase your page ranking.
Top co-occurring keywords around the topic of cybersecurity using different methods.
9
Marketing Analytics with KNIME
I’m a content marketer who writes about data science. My acquaintance with data
science began when I started working as an assistant at the Chair for Bioinformatics
and Information Mining at Konstanz University. Initially, copywriting papers about
pruning and Decision Trees, fuzzy logic, or Bisociative Information Networks, I
gradually became more familiar with information mining and increasingly in awe of the
data scientists around me.
Since then, I’ve moved into a marketing role, and have written 20+ articles, edited over
1,000, and interviewed dozens of data scientists. But I’ve never done much analysis
myself. I occasionally look at Google Analytics (GA), but I don’t download .csv files,
use Tableau, or even always check the traffic to my blogs. Curious to finally step into
this world of data science myself, I wanted to see how I could use data analytics to
make our content stronger.
KNIME is one of the low-code tools making data analytics accessible to even non-
technical users, like me. So, I decided to set myself a challenge: Build a Data App with
the low-code KNIME Analytics Platform (that would be more helpful than GA).
• Why this is important: It will show how the blog is growing and give insight into
monthly trends. Getting this figure as a percentage lets me compare it more
easily with industry benchmarks. Such insights will help me plan content better.
• Why Google Analytics doesn’t quite hit the mark: In GA, this involves manually
setting time frames for each comparison, and not all metrics can be combined in
custom reports.
• Why my workflow helps: I only need to do the manual work once, configuring the
workflow to query GA correctly. And my dashboard can combine data from any
10
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
• Why these are important: They should give me an indication of how interesting
our content is. Are people arriving and staying to read, or are they arriving, looking,
and leaving?
• Why Google Analytics doesn’t quite hit the mark: While I can set a custom report
to get me these figures and send a report to my boss to review, the report is static.
She can’t delve deeper if she wants to compare a different time period.
11
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
3. Maarit Widmann, my mentor! I'm lucky enough to work with her directly, but she
regularly teaches courses and her peers run Data Connects, where you can
directly ask data scientists questions and discuss projects you’re working on.
These four steps ultimately translated into different sections of my workflow. I’ll
describe now how I got there, what was easy, and what stumbling blocks I
encountered.
MoM blog performance workflow to connect to Google Analytics, remove outlier articles and produce an
interactive dashboard showing blog traffic, MoM growth, time spent on blog, and bounce rates.
12
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
project on Google Cloud Console. I found some useful instructions on how to do this
in related editorial content on the KNIME Blog (e.g., Querying Google Analytics), which
I could adapt for my purposes.
However, I was able to benefit from using a so-called "component". A component is a
group of nodes, (i.e., a sub-workflow) that encapsulates and abstracts the
functionalities of the logical block. Components serve a similar purpose as nodes, but
let you bundle functionality for sharing and reusing. In addition, components can have
their own configuration dialog, and custom interactive views. One of my colleagues
already had a workflow that connects to Google’s API. She preconfigured some nodes
with the right settings to connect to the API and GA, and wrapped them together into
a component. All I had to do was insert this component into my workflow, and I was
ready to go.
After connecting to Google’s API, I needed two more nodes to connect with GA (the
Google Analytics Connection node) and fetch the data I needed (the Google Analytics
Query node). To specify the metrics and dimensions for a query, you have to know the
terms. I kept this overview of the names of dimensions and metrics on hand while
doing this.
13
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
Data App for blog performance over time (left). I can select any article and get the scroll-depth for that
article (right), here showing the data for an article on sentiment analysis.
I used this app to identify five well-performing articles to remove from my analysis, and
I then manually listed them in a Google Sheet. My workflow could now be configured
to access this sheet (with the Google Sheets node), take whichever articles are listed
there, and remove them. The Reference Row Filter node performs this task. Finding
this node was my first major stumbling block. Searching the Node Repository for
anything to do with “Row” helped me find it, but it took a lot of trial and error.
Finding the Reference Row Filter node in the Node Repository and adding it to the section of the workflow
that accesses a Google Sheet to get the current list of outliers to remove from the analysis.
Tip. Setting the search box in the Node Repository to enable fuzzy searches was a
useful hint I got on the Forum to make finding things easier. You get all the results
for the word you enter, not just specific names.
14
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
column is a string. Maarit pointed out that I wouldn’t be able to set time frames in my
interactive dashboard if the format of that column stayed as a string.
Checking the progress of my data as it flows through each node in the workflow. Here, a right-click to open
Filtered Table shows me all my articles with outliers removed.
So I had to do a bit of so-called “data processing” and convert the date column –a
string– into a date format to enable me to enter a given time frame for later analysis.
I found a formula to calculate MoM growth online (I’m not so hot at math). Searching
the Node Repository for the respective node was time-consuming since I used the
wrong search terms (“MoM growth” and “Metrics”). Finally, the word “Calculate”
brought up the Math Formula node.
Translating “Subtract the first month from the second month, then divide that by the last
month’s total, then multiply the result by 100” into a mathematical expression was
tricky. I had vague memories of formulas from school, but I needed help to configure
the node.
15
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
Throughout my challenge, I kept forgetting that I had to tell the workflow each step of
the process. For example, I discovered I needed another Math Formula node to convert
the average time in seconds to minutes. I also stumbled when working out how to enter
MoM growth percentage “one row down” in my table. Such a simple thing, but it was
hard to know what to search for in the Node Repository. My mentors came to the
rescue and told me about the Lag Column node to enter values “lagged” by one row.
Table view showing MoM growth in %, sum of unique page views, and average bounce rates and time on
page (mins).
Working out how to set the workflow to perform the analysis on any given time period
was the most complicated step. How could I even begin to explain to the workflow
which reference date it should use as a basis? I learned about variables: how to set
them up, and how to configure nodes to take variables as input. To be honest, I found
this really hard. The reference date problem was ultimately resolved when I was able
to copy a colleague’s component that works out “today’s date.”
I found a solution of inserting interactive fields to set time frames by exploring the
wonderful world of widgets. Designing the layout of a Data App is explained clearly in
16
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
the “Cheat Sheet: Components with KNIME Analytics Platform”. I liked the fact that I
could preview all my layout designs by right-clicking “Interactive View.”
The dashboard in my Data App showing a line plot of overall blog performance by unique views and a table
of month on month growth.
17
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required
Developing a more logical approach to the required steps will help prevent stumbles in
future, too. I frequently left out steps, then didn’t understand why the workflow wasn’t
doing what I wanted. Maarit’s initial advice to break down my project into steps was
spot-on. I realize now that going forward, splitting things into even smaller steps will
help.
Non-technical marketers might shy away from learning a data science tool, fearing it
will be too time-consuming, but a low-code environment helps here. It’s intuitive. If I
needed to leave the project for a few days, it was easy to pick it up again later, with all
my progress documented visually.
Shareable components were my friends, not just for getting non-technical people like
me started, but also for easier team collaboration. Components can be built to bundle
any repeatable operation. Standardized logging operations, ETL clean-up tasks, or
model API interfaces can be wrapped into components, shared, and reused within the
team.
18
Segmentation and Personalization
In this chapter, we will understand who our customers are, whether they can be
clustered in similar groups, what they usually purchase and what product we can
recommend to best meet their needs. More specifically, we will look at use cases to
perform customer segmentation, market basket analysis, and build a
recommendation engine.
• Customer Segmentation, p. 20
– Elisabeth Richter, KNIME
19
Segmentation and Personalization
Customer Segmentation
Workflows on the KNIME Community Hub: Basic Customer Segmentation and Customer Segmentation
Geographic segmentation
Splitting customers based on their geographic locations, as they might have different
needs in different areas. For example, segmenting them based on country or state, or
the characteristics of an area (e.g., rural vs. urban vs. suburban).
Demographic segmentation
Splitting up customers based on features like age, sex, marital status, or income. This
is a rather basic form of segmentation, using easily accessible information. This is,
hence, one of the most common forms of customer segmentation.
20
Segmentation and Personalization
Customer Segmentation
Psychographic segmentation
Behavioral segmentation
Splitting up customers based on their behaviors, i.e., how they respond and interact
with the brand. Such criteria include loyalty, shopping habits (e.g., the dropout rate on
a website), or the frequency of product purchases.
Rule-based segmentation
Customers are segmented into groups based on manually designed rules. This usually
involves domain and target knowledge, and thus requires at least one business expert.
For example, segmenting customers based on their purchase history as “first-time
customers,” “occasional customers,” “frequent customers,” or “inactive customers” is
highly interpretable, and an expert will know which customer counts in which group. A
drawback is that this type of segmentation is not portable to other analyses. So, with
a new goal, new knowledge, or new data, the whole rule system needs to be
redesigned. Implementing rule-based segmentation in KNIME Analytics Platform can
be done using the Rule Engine node.
This is simple and easily implemented, binning data based on one or more features.
This does not necessarily require domain knowledge, but some knowledge about the
target is required – i.e., the business goal must be clear. For example, when
considering a clothing line designed for teenagers, the age of the target audience is
clear and non-negotiable. In KNIME Analytics Platform, the Auto-Binner or the Numeric
Binner node can be used.
21
Segmentation and Personalization
Customer Segmentation
If nothing is known about the domain or target, common clustering algorithms can be
applied to segment the data. This is applicable to different use cases. There are many
clustering algorithms in KNIME Analytics Platform, such as k-Means and DBSCAN.
This is a simple use case in which we have some customer data for a telephone
company. The data was originally available on Iain Pardoe’s website, and can be
downloaded from Kaggle. The data set has been split into two files: the contract-
related data (ContractData.csv), which contains information about telco plans, fees,
etc., and the telco operational data (CallsData.xls), which contains information about
call times in different time zones throughout the day and the corresponding paid
amounts. Each customer is identified by a unique combination of their phone number
(“Phone”) and its corresponding area code (“Area Code”).
22
Segmentation and Personalization
Customer Segmentation
After reading and joining the two datasets, the data is preprocessed. The
preprocessing tasks depend on the nature of the data and the business. In our case,
we first join the information about each customer from the two datasets using the
telephone number. Then, we convert the columns “Area Code,” “Churn,” “Int’l Plan,” and
“VMail Plan” from integer to string. The “Area Code” column is a part of the unique
identifier, which is why we don’t include it as an input column for clustering. The other
columns are excluded because they are categorical values to which the k-Means
algorithm is not applicable, as categorical variables have no natural origin. Lastly, we
normalize all remaining numerical columns.
Note. It is usually recommended that you normalize the data before clustering,
especially when dealing with attributes with vastly different scales.
23
Segmentation and Personalization
Customer Segmentation
component, right-click the component and select “Interactive View: Cluster Viz.”
Because our component encapsulates two scatter plots, our interactive view also
contains two scatter plots.
The sub-workflow, encapsulated by the Cluster Viz component. This component de-normalizes the clustered
data back to its original range, and visualizes the clustered data colored by cluster, with the cluster centers in
one scatter plot each.
The composite view of the component contains two scatter plots: the visualization of
the telco data, colored by cluster, and the prototypes (i.e., centers) of each cluster. In
the figure below, the view of the second scatterplot is displayed. It shows the attributes
“VMail Message” (the number of voicemail messages) on the X-axis and “CustServ
Calls” (the number of customer service calls) on the Y-axis. From there, some
implications can be drawn.
• The group of data points on the right shows the prototypes of the clusters of
customers who use voicemail, and the group of data points on the left shows the
prototype of the clusters of customers who don’t use voicemail.
• The data points in the top left and top right corners show the cluster
representatives of those customers who complain a lot. On average, they call
customer service almost twice a day.
24
Segmentation and Personalization
Customer Segmentation
The visualization of the cluster analysis for k =10, plotting “VMail Message” vs. “CustServ Calls.” The scatter
plot shows the center of each cluster.
We can now easily change the view of the scatter plot by changing the axes of the plot
directly from the interactive view without changing the node settings. In the interactive
view, we can click the settings button in the top-right corner (list icon) and then change
the X and Y columns as we wish.
The view of the scatter plot can easily be changed directly from the interactive view without changing the
node settings.
The resulting plot when changing the axes to “Day Mins” (minutes during day time) vs.
“Night Mins” (minutes during night time) is shown below. Here, we can distinguish
between customers who have significantly higher call minutes during the day (the Day
25
Segmentation and Personalization
Customer Segmentation
Callers) and those who have significantly higher call minutes during the night (Night
Callers). The cluster representative in the top-right corner indicates the group of
customers who call a lot during the day as well as at night (the Always Caller).
The visualization of the cluster analysis for k = 10, plotting “Day Mins” vs. “Night Mins.” The scatter plot
shows the cluster center of each cluster.
Et voilá, after only a handful of nodes, we have already completed a basic customer
segmentation task. Without any particular domain or target knowledge, we were able
to gain some meaningful insights about our customers.
26
Segmentation and Personalization
Customer Segmentation
In order to call the workflow from a web browser, we need to connect to KNIME
Business Hub. One of the most useful features of KNIME Business Hub is the
possibility to deploy workflows as browser-based applications. It allows interaction
with the workflow at predetermined points, or the adjustment of certain parameters.
When executing the workflow, users are guided through each step using a sequence
of interactive webpages. Each composite view of the deployed workflow becomes a
webpage. The following section introduces a Data App deployed on the KNIME
Business Hub that makes use of several Widgets nodes.
The extended customer segmentation workflow from the previous example. With the help of some Widget
nodes wrapped inside the components, it enables immediate integration of knowledge and direct interaction
with the segmentation result.
With the help of the Integer Widget and Column Filter Widget node, the user can define
the parameters of the clustering. This is the number of clusters, k, and the attributes
27
Segmentation and Personalization
Customer Segmentation
used as input columns. The two nodes are wrapped inside the “Define Cluster
Parameters” component.
On the KNIME Business Hub, the user can execute the customer segmentation workflow as a Data App. In
the first step, the number of clusters as well as the columns used for clustering must be defined. This page
is the composite view of the “Define Cluster Parameters” component.
Once the customer segmentation parameters are defined, the k-Means clustering is
performed. The necessary nodes for clustering are wrapped inside the “Customer
Segmentation” metanode, following the “Define Cluster Parameters” component.
Once the data has been clustered, the results are displayed on the next webpage. The
“Display Cluster Result” component in the workflow (see figure above) contains the
nodes responsible for visualizing the results. The composite view of the component
contains three scatter plots: the cluster centers (top left), the PCA-reduced telco data
colored by cluster (top right), and the non-reduced telco data colored by cluster
(bottom). Also on the KNIME Business Hub, the views of the scatter plot can be easily
changed by tweaking the X and Y axes, as described.
28
Segmentation and Personalization
Customer Segmentation
The clustering results for k = 10 using all attributes as input columns. The scatter plots show the attribute
“Day Mins” against “Night Mins”. The PCA clusters plot shows the data when reduced to two dimensions.
29
Segmentation and Personalization
Customer Segmentation
Once the data is clustered, the workflow iterates over all clusters, presenting a cluster-
wise visualization and providing the possibility to add comments and annotations for
each customer segment. These cluster-wise visualizations are implemented in the
“Label Cluster” component inside the group loop block. Here the expert analyst can
take advantage of their knowledge and annotate the clusters accordingly.
30
Segmentation and Personalization
Customer Segmentation
At the bottom, you can give each customer group a unique name. In addition, you can add a more detailed
description of the customers belonging to that group.
After the group labels and annotations have been assigned, the clusters are updated
accordingly. This last webpage of the Data App refers to the “Displayed Label Clusters
component” of the workflow. This dashboard shows not only the cluster centers and
the colored scatter plot, but also the cluster statistics. Note that each customer group
now has a meaningful name that clearly distinguishes it from the others. A more
31
Segmentation and Personalization
Customer Segmentation
detailed description of the customer group from the annotation box was added as a
“Cluster Annotation” column.
Cluster statistics. The table shows for each customer group its
coverage, mean distance, standard deviation, and skewness.
32
Segmentation and Personalization
Workflows on the KNIME Community Hub: Market Basket Analysis: Building Association Rules and Market
Basket Analysis: Apply Association Rules
What we need
The dataset required needs examples of past shopping baskets with shopping items
(products) in it. If products are identified via product IDs, then a shopping basket is a
series of product IDs that looks like:
<Prod_ID1, Prod_ID2, Prod_ID3, ….>
The dataset used here was artificially generated with the “Shopping Basket
Generation” workflow available for download from the KNIME Community Hub. This
workflow generates a set of Gaussian random basket IDs and fills them with a set of
Gaussian random product IDs. After performing a few adjustments on the a priori
probabilities, our artificial shopping basket dataset is ready to use 2.
This dataset consists of two KNIME tables: one containing the transaction data –i.e.,
the sequences of product IDs in imaginary baskets– and one containing product info
– i.e., product ID, name, and price. The sequence of product IDs (the basket) is output
as a string value, concatenating many substrings (the product IDs).
2 Adä, M. Berthold, “The New Iris Data: Modular Data Generators”, SIGKDD, 2010.
33
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm
The first part of the rule is known as the “antecedent”, and the second part,
“consequent”. A few measures, such as support, confidence, and lift, define how
reliable each rule is. The most famous algorithm generating these rules is the Apriori
algorithm3.
The central part in building a recommendation engine is the Association Rule Learner
(Borgelt) node, which implements the Apriori algorithm in either the traditionalError!
Bookmark not defined. or the Borgelt4 version. The Borgelt implementation offers a
few performance improvements over the traditional algorithm. The produced
association rule set, however, remains the same. Both Association Rule Learner nodes
work on a collection of product IDs.
A Workflow to train association rules using the Borgelt’s variation of the Apriori algorithm.
3R. Agrawal and R. Srikant, Proc. 20th Int. Conf. on Very Large Databases (VLDB 1994, Santiago de Chile),
487-499, Morgan Kaufmann, San Mateo, CA, USA 1994.
4“Find Frequent Item Sets and Association Rules with the Apriori Algorithm”, C. Borgelt’s home page:
http://www.borgelt.net/doc/apriori/apriori.html.
34
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm
A collection is a particular data cell type, assembling together data cells. There are
many ways of producing a collection data cell from other data cells. The Cell Splitter
node, for example, generates collection type columns, when the configuration setting
“as set (remove duplicates)” is enabled. We use a Cell Splitter node to split the basket
strings into product IDs substrings, setting the space as the delimiter character. The
product IDs substrings are then assembled together and output in a collection column
to feed the Association Rule Learner node.
After running on a dataset with past shopping basket examples, the Association Rule
Learner node produces a number of rules. Each rule includes a collection of product
IDs as antecedent, one product ID as consequent, and a few quality measures, such as
support, confidence, and lift.
In Borgelt’s implementation of the Apriori algorithm, three support measures are
available for each rule. If A is the antecedent and C is the consequent, then:
𝑩𝒐𝒅𝒚 𝑺𝒆𝒕 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) = # 𝑖𝑡𝑒𝑚𝑠/𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐴
𝑯𝒆𝒂𝒅 𝑺𝒆𝒕 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐶) = # 𝑖𝑡𝑒𝑚𝑠/𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐶
𝑰𝒕𝒆𝒎 𝑺𝒆𝒕 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴 ∪ 𝐶) = # 𝑖𝑡𝑒𝑚𝑠/𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐴 𝑎𝑛𝑑 𝐶
Item Set Support tells us how often antecedent A and consequent C are found together
in an item set in the whole dataset. However, the same antecedent can produce a
number of different consequents. So, another measure of the rule quality is how often
antecedent A produces consequent C among all possible consequents. This is the Rule
Confidence.
𝑹𝒖𝒍𝒆 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ∪ 𝐶) /𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)
One more quality measure ‒the Rule Lift‒ tells us how precise this rule is, compared to
just the Apriori probability of consequent C.
𝑹𝒖𝒍𝒆 𝑳𝒊𝒇𝒕 = 𝑅𝑢𝑙𝑒 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐶) / 𝑅𝑢𝑙𝑒𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(∅ → 𝐶)
∅ is the whole dataset and 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(∅) is the number of items/transactions in the
dataset.
You can make your association rule engine larger or smaller, restrictive or tolerant, by
changing a few threshold values in the Association Rule Learner configuration settings,
like the “minimum set size”, the “minimum rule confidence”, and the “minimum
support” referring to the minimum Item Set Support value.
We also associate a potential revenue to each rule as:
𝑹𝒆𝒗𝒆𝒏𝒖𝒆 = 𝑝𝑟𝑖𝑐𝑒 𝑜𝑓 𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑥 𝑟𝑢𝑙𝑒 𝑖𝑡𝑒𝑚 𝑠𝑒𝑡 𝑠𝑢𝑝𝑝𝑜𝑟𝑡
Based on this set of association rules, we can say that if a customer buys wine, pasta,
and garlic (antecedent) usually ‒or as usually as support says‒ they also buy pasta-
sauce (consequent); we can trust this statement with the confidence percentage that
comes with the rule.
35
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm
After some preprocessing to add the product names and prices to the plain product
IDs, the association rule engine with its antecedents and consequents is saved in the
form of a KNIME table file.
Deployment
Let’s move away now from the dataset with past examples of shopping baskets and
into real life. Customer X enters the shop and buys pasta and wine. Are there any other
products we can recommend?
The second workflow prepared for this use case takes a real-life customer basket and
looks for the closest antecedent among all antecedents in the association rule set. The
central node of this workflow is the Subset Matcher node.
Extracting top recommended items for current basket. Here the Subset Matcher node explores all rule
antecedents to find the appropriate match with the current basket items.
The Subset Matcher node takes two collection columns as input: the antecedents in
the rule set (top input port) and the content of the current shopping basket (lower input
port). It then matches the current basket item set with all possible subsets in the rule
antecedent item sets. The output table contains pairs of matching cells: the current
shopping basket and the totally or partially matching antecedents from the rule set.
36
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm
37
Segmentation and Personalization
Workflow on the KNIME Community Hub: Movie Recommendation Engine with Spark Collaborative Filtering
Collaborative Filtering (CF) based on the Alternating Least Squares (ALS) technique 5
is another algorithm used to generate recommendations. Collaborative Filtering (CF)
is an algorithm that makes automatic predictions (filtering) about the interests of a
user by collecting preferences from many other users (collaborating). The underlying
assumption of the collaborative filtering approach is that if a person A has the same
opinion as person B on an issue, A is more likely to have B's opinion on a different issue
than that of a randomly chosen person. This algorithm gained a lot of traction in the
data science community after it was used by the team that won the Netflix prize.
The algorithm has also been implemented in Spark MLlib6 with the aim to address fast
execution also on very large datasets. KNIME Analytics Platform with its Big Data
Extensions offers the CF algorithm in the Spark Collaborative Filtering Learner (MLlib)
node. We will use it, in this section, to recommend movies to a new user. This use case
is a KNIME implementation of the Collaborative Filtering solution originally provided
by Infofarm.
What we need
For this use case, we used the large MovieLens dataset. This dataset contains many
different files all related to movies and movie ratings. We’ll use the files “ratings.csv”
and “movies.csv”.
The dataset in the file “ratings.csv” contains 20M movie ratings by circa 130K users
and it is organized as: “userID”, “movieID”, “rating”, “timestamp”. Each row contains the
5Y. Koren, R. Bell, C. Volinsky, “Matrix Factorization Techniques for Recommender Systems“, in Computer
Journal, Volume 42 Issue 8, August 2009, Pages 30-37: https://dl.acm.org/citation.cfm?id=1608614.
6“Collaborative Filtering. RDD based API” The Spark MLlib implementation:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html.
38
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering
The idea of the ALS algorithm is to find other users in the training set with preferences
similar to the current, selected user. Recommendations for the current user are then
created based on the preferences of such similar profiles. This means that we need a
profile for the current user to match the profiles of other existing users in the training
set.
Let’s suppose that you are the current user, with assigned userID=999999. It is likely
that the MovieLens dataset has no data about your movie preferences. Thus, in order
to issue some movie recommendations, we would first need to build your movie
preference profile. So, we will start the workflow by asking you to rate 20 movies,
randomly extracted from the movie list in the “movies.csv” file. Rating ranges between
0 and 5 (0 – horrible movie; 5 – fantastic movie). You can use rating -1, if you have not
seen the proposed movie. Movies with a rating of -1 will be removed from the list.
Movies with not -1 ratings will become training set material.
The web page below is the result of a Text Output Widget node and a Table Editor
node displayed via a component interactive view. Your rating can be manually inserted
in the last column to the right.
39
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering
A Spark Context
The CF-ALS algorithm has been implemented in KNIME Analytics Platform via the
Spark Collaborative Filtering Learner (MLlib) node. This node belongs to the KNIME
Extension for Apache Spark, which needs to be installed on your KNIME Analytics
Platform to run this use case.
The Spark Collaborative Filtering Learner node executes within a Spark context, which
means that you also need a big data platform and a Spark context to run this use case.
This is usually a showstopper due to the difficulty and potential cost of installing a big
data platform, especially if the project is just a proof of concept. Indeed, installing a
big data platform is a complex operation and might not be worth it for just a prototype
workflow. Installing it on the cloud might also carry additional unforeseeable costs.
40
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering
Note. Version 3.6 or higher (currently 4.7) of KNIME Analytics Platform and KNIME
Extension for Local Big Data Environments include a precious node: the Create
Local Big Data Environment node. This node creates a simple but complete local
big data environment with Apache Spark, Apache Hive and Apache HDFS, and does
not require any further software installation. While it may not provide the desired
scalability and performance, it is useful for prototyping and offline development.
The Create Local Big Data Environment node has no input port, since
it needs no input data, and produces three output objects:
41
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering
The configuration window of the Create Local Big Data Environment node includes a
frame with options related to the “on dispose” action.
• “Destroy Spark Context” destroys the Spark context and all allocated resources;
this is the most destructive, but cleanest, option.
• “Delete Spark DataFrames” deletes the intermediate results of the Spark nodes
in the workflow and keeps the Spark context open to be reused.
• “Do nothing” keeps both the Spark DataFrames and context alive. If you save the
already executed workflow and reopen it later, you can still access the
intermediate results of the Spark nodes within. This is the most conservative
option, but also keeps space and memory busy on the execution machine.
Option number 2 is set as default, as a compromise between resource consumption
and reuse.
Note. It is necessary that movie preferences of the current user are part of the
training set. This is why we ask the current user to rate 20 random movies, in order
to get a sample of his/her preferences.
Note. The matrix factorization model output by the node contains references to the
Spark DataFrames/RDDs used in execution and thus is not self-contained. The
referenced Spark DataFrames/RDDs are deleted, like for any other Spark nodes,
when the node is reset or the workflow is closed. Therefore, the model cannot be
reused in another context in another workflow.
Like the KNIME native Numeric Scorer node, the Spark Numeric Scorer node
calculates a number of numeric error metrics between the original values –in this case
the ratings– and the predicted values. Ratings range between 0 and 5, as the number
42
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering
Deployment
We previously asked the current user to rate 20 randomly chosen movies. These
ratings were added to the training set. Using a generic Spark Predictor node, we now
estimate the ratings of our current user (ID=999999) on all remaining unrated movies.
Movies are then sorted by predicted ratings and the top 10 are recommended to the
current user on a web page on the KNIME Business Hub.
43
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering
Since I volunteered to be the current user for this experiment, based on my ratings of
20 randomly selected movies, I got back a list of 10 recommended movies shown
below. I haven’t seen most of them. However, some of them I do know and appreciate.
I will now add “watch recommended movies” on my list of things to do for the next
month.
Note. Please note that this is one of the rare cases where training and deployment
are included in the same workflow.
The Collaborative Filtering model produced by the Spark Collaborative Filtering Learner
node is not self-contained but depends on the Spark Data Frame/RDDs used during
training execution and, therefore, cannot be reused later in a separate deployment
workflow.
The Collaborative Filtering algorithm is not computationally heavy and does not take
long to execute. So, including the training phase in the deployment workflow does not
noticeably hinder recommendation performance. However, if recommendation
performance is indeed a problem, the workflow could be partially executed on KNIME
Analytics Platform, or as a Data App on the KNIME Business Hub, until the collaborative
filtering model is trained and then the rest of the workflow can be executed on demand
for each existing user in the training set.
This workflow asks the new user to rate 20 randomly selected movies via web browser, with this data trains
a Collaborative Filtering model, evaluates the model performance via some numeric error metric, and finally
proposes a list of top 10 recommended movies based on the previously asked ratings.
44
Consumer Mindset Metrics
In this chapter, we gauge how our customers perceive and experience our products
and services. We will start off by introducing techniques to improve SEO and make our
business relevant and easy to reach on the web. We will then analyze customer’s
review and sentiment to learn about their opinions, perceptions and mindsets
regarding our products and brand. More specifically, we will look at use cases in SEO,
customer experience evaluation, brand reputation measurement and sentiment
analysis.
45
Consumer Mindset Metrics
Workflow on the KNIME Community Hub: Search Engine Optimization (SEO) with Verified Components
Semantic search techniques with unstructured data are becoming more and more
common in most search engines. In the context of search engine optimization (SEO),
a semantic keyword search aims to find the most relevant keywords for a given search
query. Keywords can include frequently occurring single words, as well as words
considered in context, like co-occurring words or synonyms of the current keywords.
Modern search engines, like Google and Bing, can perform semantic searches,
incorporating some understanding of natural language and of how keywords relate to
each other for better search results.
Semantic search.
As Paul Shapiro explains in “Semantic Keyword Research with KNIME and Social
Media Data Mining”, a semantic search can be carried out in two ways: with structured
or unstructured data.
A structured semantic search takes advantage of the schema markup —the semantic
vocabulary provided by the user and incorporated in the form of metadata in the HTML
of a website. Such metadata will be displayed in the Search Engine Results Page
(SERP). In the figure below, you can see an example snippet of a SERP as returned by
Google. Notice how it shows the operating system support, the latest stable release,
the link to a GitHub repository, and other metadata from the found webpage. Creating
metadata in HTML syntax can be done with schema.org, a collaborative work by
Google, Bing, Yahoo, and Yandex, used to create appropriate semantic markups for
websites. Relevant metadata helps to improve the webpage ranking in search engines.
Meanwhile, unstructured semantic searches use machine learning and natural
language processing to analyze text in webpages for better SERP ranking. The more
relevant the text to the search, the higher the web position in the SERP list.
46
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
While structured searches on metadata are present in all search engines, techniques
with unstructured data are now also becoming more common. There are two places
where searches happen: search engines (of course) and social media. We want to
explore the results of search queries on SERP and social media, and learn from the text
in the top-performing pages.
47
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
48
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
49
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
50
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
From tweets
Here we use the Twitter URLs
Extractor component. This component
extracts tweets from Twitter around a
given search query and then extracts the
URLs in them. We could have also
chosen to consume Facebook, LinkedIn,
or any other social media site.
In case you are wondering why we opt
for the URLs and not the main text
bodies of the tweets, we personally
believe that relying on text from linked
URLs should provide more information
than what is contained in a simple short
tweet.
To run the component, you need a Twitter URLs Extractor component configuration
Twitter Developer account and the window.
With the URLs from both SERP and tweets, the application must proceed with scraping
their texts to get the corpus for keyword extraction. For this step, we have another
verified component, the “Web Text Scraper”. This component connects to an external
web library —BoilerPipe— to extract the text from the input URL. It then filters out
KNIME verified component for scraping meaningful text from web pages.
51
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
unneeded HTML tags and boilerplate text. (Headers, menus, navigation blocks, span
texts, all the unneeded content outside the core text.)
The details about algorithms used in Boilerpipe and its performance optimization are
mentioned in the research paper “Boilerplate Detection using Shallow Text Features”.
This component takes a column of URLs as input and produces a column with the texts
in String format as output. Only valid HTML content is supported —all other associated
scripts are skipped. With long lists of URLs to be crawled, this component can become
slow, because identifying the boilerplates for each URL is time-consuming.
52
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
Workflow identifying URLs from SERP and tweets, scraping texts from URLs, and extracting and visualizing
keywords from scraped texts.
53
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
The top 10 topic-related keywords, sorted by their LDA weight, are reported in the bar
chart below. The words “data” and “science” have the highest weightage, which makes
sense, but is also obvious. The next highly meaningful keywords are “machine” and
“learning,” followed by “course.” The reference to “machine learning” is clear, while
“course” is probably due to the recent blossoming of educational material on this
subject. Adding a reference to machine learning and a course or two to your data
science page might improve its ranking.
Co-occurring keywords
However, in the network graph below, we observe a different set of keywords. “Data”
and “science” still represent the core of the keyword set, as they are located at the
center of the graph and have the highest degree. “Machine” and “learning” are also
present again, and this time we learn that they are connected to each other. In the same
way, “program” is connected to “courses,” as those words are often found together in
highly ranked pages about “data science.”
We can learn as much from this graph about the words that don’t occur together. For
example, “program” is only co-occurring with “science” and “courses,” while “skills” and
“course” only co-occurs with “data” and not with “science,” “machine,” “learning,” or
“business.” It thus becomes clear which pairs of words you should include in your data
science page.
54
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
Lastly, if we look at single keywords in the Tag Cloud, we can see that the terms
“master,” “insight,” “book,” and “specialization” appear most prominently. This makes
sense, since the web contains a large number of offers for specialized courses with
universities, online platforms, and books. In the word cloud, you can also see the
names of institutes, publishers, and universities offering such learning options. Only
on a second instance do you see the words “data,” “science,” “machine,” “learning,”
“philosophy,” “model,” and so on.
55
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search
Single keywords seem to better describe one aspect of the “data science” community,
i.e., teaching, while co-occurring keywords and topics best extract the content of “data
science” discipline.
Notice that the multifaceted approach to visualizing keywords that we have adopted
in this project allows for an exploration of the results from different perspectives.
56
Consumer Mindset Metrics
Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Rosaria Silipo, KNIME
A valuable customer experience (CX) seems to be at the heart of all the businesses we
happen to visit. After using a website, booking system, restaurant, or hotel, we are often
(always?) asked to rate the experience. We often do, and we also often wonder what,
if anything, the business is going to do with my rating. The rating embeds an overall
summary, but stars alone cannot say much about an (un)pleasant experience, or what
needs to improve in the customer journey.
7De Keyser, A., Verleye, K., Lemon, K. N., Keiningham, T. L., & Klaus, P. (2020). Moving the Customer
Experience Field Forward: Introducing the Touchpoints, Context, Qualities (TCQ) Nomenclature. Journal of
Service Research, 23(4), 433–455. https://doi.org/10.1177/1094670520928390.
57
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
The TripAdvisor dataset containing the reviews and associated star rankings related to a specific hotel stay.
This is not enough, though. We need to identify which touchpoints, context, and
qualities might have been critical in the customer experience. For example, this
customer gave only 2 stars for their visit to “Hotel_1,” with this explanation:
An example of a review associated with a 2-star rating from the dataset (review ID 201855460).
Lots of text, huh? The review points out a crucial element of this experience: This is a
frequent customer, with 100+ visits, and it seems that the key reasons for the
complaint relate to the front desk being non-responsive and room service failing with
the daily cleaning. In this case, the low star rating seems to truly reflect a negative
experience that relates to a frequent customer with several failures of two touchpoints
58
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
(reception and room service). Firms and researchers can perform this type of analysis
at an aggregated level using all reviews posted by customers.
In this work, we are going to show how to extract useful information from the pairs
“(review text, star ranking)” that customers leave when describing their experience at
a hotel (but the same could be applied to any kind of business with the appropriate
categories). We are going to do that by considering reviews for two hotels from the
TripAdvisor dataset to show customer journey differences.
The TripAdvisor dataset contains customer experience evaluations for 2,580 visits to
Hotel_1 and 2,437 visits to Hotel_2. Each evaluation contains the hotel name (Hotel_1
vs Hotel_2), the number of reviews provided in total by an author, the number of
“helpful” votes the author got, the number of “helpful” votes each specific review got,
the date each review was left, and the associated number of stars.
59
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
The workflow we implemented to discover the relation between the steps in a customer journey and a poor
star ranking.
Starting from the top left square in the solution workflow, the first steps implement the
text preprocessing operations for the reviews.
The first steps are the classic ones in text preprocessing: Transform the review texts
into Document objects, clean out the punctuation, standard stop words, prepositions
as short words, and numbers from the text, reduce all words to lowercase and their
stem, and remove infrequent words.
Then, instead of using single words, we focus on bigrams. In the English language,
bigrams and trigrams store more information than single words. In the “N-grams”
metanode, we collect all bigrams in the review dataset, and we keep only the 20 most
frequent ones. “Front desk” is the most frequent bigram found in the reviews, which
already makes us suspect that the “front desk” is present in many reviews, for good or
bad. Finally, all bigrams are identified and tagged as such in the review Document
objects via a recursive loop.
Note that this listing of the 20 most frequent bigrams is arbitrary. It could have been
larger; it could have been smaller. We choose 20 as a compromise between the light
weight of the application and a sufficient number of bigrams for the next data analysis.
60
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
The last step is implemented in the “Filter Reviews with less than 10 words” metanode
and this is what it does: It counts the words in each review so as to remove the shorter
reviews, those with under 10 words.
We are now left with 432 reviews (246 for Hotel_1, 186 for Hotel_2) and each review
has been reduced to its minimal terms. Remember the review with the ID 201855460?
This is what remains after all preprocessing steps:
The remaining text in review 20185546,0 after the text processing phase.
The processed review documents are now ready for the topic extraction algorithm.
61
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
topics, and each topic can be described by its most frequent words. In that context,
LDA is often used for topic modeling.
The KNIME node that implements LDA is the Topic Extractor (Parallel LDA). The node
takes a number of Documents at its input port and produces the same Document table
with topic probabilities and the assigned topic at its top output port, the words related
to each topic with the corresponding weight at the middle output port, and the iteration
statistics at the lowest port. The configuration window allows you to set a number of
parameters, of which the most important related just to the algorithm are the number
of topics, the number of keywords to describe for each topic, and parameters alpha
and beta. Parameter alpha sets the prior weight of each topic in a document.
Parameter beta sets the prior weight of each keyword in a topic. A small alpha (e.g.,
0.1) produces a sparse topic distribution —that is, the less prominent topics for each
document. A small beta (e.g., 0.001) produces a sparse keyword distribution —the less
prominent keywords to describe each topic.
As in every machine learning algorithm, there is no a priori recipe to select the best
hyperparameters. Of course, the number of topics k has to be manageable if we want
to represent them in a bar chart. It must be lower than 100, but between 1 and 100,
there are still a lot of options. Alpha can be taken empirically as 50/k, where k is the
number of topics.
The number of keywords and the value of parameter beta are less critical, since we will
not use that information in the final visualization. Hence, we use 10 keywords per topic
and beta = 0.01, as we found this to work in previous experiments.
The other configuration settings define the length and speed of the algorithm
execution. We set 1,000 iterations on 16 parallel threads.
We are left with the choice of the best k. We start off by running the LDA algorithm on
a number of different k to calculate the perplexity as 2^(-Log likelihood) for each k from
the last iteration of the LDA algorithm, and to visualize the perplexity values against
the number of topic k with a line plot. Initially, we run this loop for k = 2, 4, 6, 8, 10, 15,
20, 25, 30, 40, 50, 60, 80. The perplexity plot shows that the only useful range for k is
up to 20. Thus, we focus on that range, and we run the loop again for k = 13, 14, 15, 16,
17. The perplexity plot in the figure below shows a minimum at k = 15, and this is the
value we will use for the number of topics.
62
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
Perplexity plot from the last iteration of the LDA algorithm for different values of
number of topics k.
Let’s move now to square #3 in the workflow, the one in the lower left corner. Here we
extract 15 topics from the reviews in the dataset using the Topic Extractor (Parallel
LDA) node. After concatenating the keywords for each topic via a GroupBy node, we
report the topic descriptions in the table below, to which we attach our own
interpretation/identification of the touchpoint, context, or quality of the customer
journey.
63
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
Finally, the “Topic Analysis” component displays the average number of stars
calculated on all reviews assigned to the same topic for Hotel_1 on the left and Hotel_2
on the right. Just a few comments:
• The overall experience at Hotel_2 is generally rated lower than the overall
experience at Hotel_1. This can also be seen by comparing the other single bars
in the chart.
64
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
And so on. Just by pairing the average star number and the review topic, this bar chart
can visually indicate to us where the failures might be found in the customer journeys
for both hotels.
Average number of stars for each assigned topic for Hotel_1 and Hotel_2.
The other component, named “Topic View,” also shows a bar chart, displaying the
average score for the assigned topics for each hotel.
In this final part of the workflow, we will identify which customer experience topics are
predictors of the star ratings (both good and bad). We’ll train a linear regression model
to predict the star rating based on each topic score for that review and the hotel name.
To do so, we select a “baseline” topic that is not included as a predictor in the
regression, because otherwise we would be violating one of the regression properties
related to multicollinearity. Then we will interpret the coefficients of the topics
(predictors) in comparison with the baseline (topic 0 = Check In).
Based on the coefficient values, we first notice that the variable Hotel_2 is non-
significant (p>0.1), which indicates that the hotels do not differ significantly in their
star rating evaluations. The journey touchpoints called “common areas” (Coeff = 4.04,
p < 0.05) and “car park entrance” (Coeff = 3.89, p < 0.05) are the ones with a more
positive association with the star rating. On the other hand, the topics “location around
museums” (Coeff = -3.95, p < 0.05) and “free features” (Coeff = -2.63, p = 0.09) are
associated with more negative star ratings.
65
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews
• And visually and numerically inspect the customer journey topics with the star
rating.
66
Consumer Mindset Metrics
Authors: Francisco Villarroel Ordenes & Konstantin Pikal, LUISS Guido Carli University
“Your brand is what others say about you when you are not in the room”. This famous
quote, attributed to Jeff Bezos, Amazon’s founder, is one of our favorites. It is not
without criticism, but it is an apt way to approach the measurement of branding.
Building a brand is on every marketer’s task description. But how do you do it? And
foremost: How do you measure what you have built (and hopefully are still building)?
How do you analyze what your stakeholders (e.g., customers, media, competitors)
have to say about you? We could turn our heads and point them towards social media,
one of the channels where people “are talking about us”.
8Rust, R. T., Rand, W., Huang, M.-H., Stephen, A. T., Brooks, G., & Chabuk, T. (2021). Real-Time Brand
Reputation Tracking Using Social Media. Journal of Marketing, 85(4), 21–
43. https://doi.org/10.1177/0022242921995173.
67
Consumer Mindset Metrics
Brand Reputation Measurement
First, we need some data to use the brand reputation tracker on. Therefore, we suggest
you get developer access on the Twitter API. It sounds complicated, but it really isn’t.
It just gives you access to Tweets that you will have to text-mine later on.
68
Consumer Mindset Metrics
Brand Reputation Measurement
Before we start analyzing the data, we will have to do some cleaning: or better, the
workflow is doing the cleaning for us. It excludes retweets and filters only tweets in
English (this is important because our text-mining dictionary is only in English).
69
Consumer Mindset Metrics
Brand Reputation Measurement
Now that we have cleaned our data, the most important part begins. We are going to
work with tweet texts and extract insights using the KNIME Textprocessing extension.
Preprocessing
We start off with pre-processing. For this, we first must convert Strings (e.g., text on
tweets) back into Document data type.
We need the Document data type to be able to perform text-mining operations in
KNIME. In the first step, we stem all our words to make them easier to interpret for the
machine. For example, “exciting” becomes “excit” and “inspiring” becomes “inspir”.
After KNIME has done this for us, our documents are fed into the Dictionary Tagger
node.
The Dictionary Tagger consists of two inputs: a dictionary, including all the relevant
words; and a tagger, where we specify what tag applies to which word. For example,
“trendi” and “hip” are part of the positive “Cool-dictionary”, whereas “ancient” and
“lame” are part of the negative “Cool-dictionary”. The tagger uses the dictionaries to
tag the document as follows: When the tagger finds a word in the document that is
also in the dictionary, e.g., “modern”, it tags that word with the corresponding tag (FTB-
A).
You might have noticed that in the paragraph above we used a tag type called “FTB”
and its values (e.g., A). This is because KNIME does not have a custom tagger for brand
reputation drivers. Next, we create a “bag of words”. A bag of words is simply a list of
all single words occurring in the dataset.
The KNIME workflow for Tagging. The “cool” dictionary: Positive words on
the left, negative words on the right.
70
Consumer Mindset Metrics
Brand Reputation Measurement
71
Consumer Mindset Metrics
Brand Reputation Measurement
72
Consumer Mindset Metrics
Brand Reputation Measurement
73
Consumer Mindset Metrics
74
Consumer Mindset Metrics
Analyze Customer Sentiment
Workflows on the KNIME Community Hub: Building Sentiment Predictor - Lexicon Based and Deploying
Sentiment Analysis Predictive Model - Lexicon Based Approach
Before purchasing a product, people often search for reviews online to help them
decide if they want to buy it. These reviews usually contain expressions that carry so-
called emotional valence, such as “great” (positive valence) or “terrible” (negative
valence), leaving readers with a positive or negative impression.
In lexicon-based sentiment analysis, words in texts are labeled as positive or negative
(and sometimes as neutral) with the help of a so-called “valence dictionary”. Take the
phrase: “Good people sometimes have bad days”. A valence dictionary would label the
word “Good” as positive; the word “bad” as negative; and possibly the other words as
neutral.
Once each word in the text is labeled, we can derive an overall sentiment score by
counting the numbers of positive and negative words and combining these values
mathematically. A popular formula to calculate the sentiment score (StSc) is:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠
𝑆𝑡𝑆𝑐 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠
If the sentiment score is negative, the text is classified as negative. It follows that a
positive score means a positive text, and a score of zero means the text is classified
as neutral.
Note that in the lexicon-based approach we don’t use machine learning models: The
overall sentiment of the text is determined on-the-fly, depending only on the dictionary
that is used for labeling word valence.
Tip. Valence dictionaries are language-dependent and are usually available at the
linguistic department of national universities. As an example, consider these
valence dictionaries for:
Chinese (http://compling.hss.ntu.edu.sg/omw/)
Thai (http://compling.hss.ntu.edu.sg/omw/)
The workflow we now want to walk through involved building a lexicon-based predictor
for sentiment analysis. We used a Kaggle dataset containing over 14K customer
reviews on six US airlines.
75
Consumer Mindset Metrics
Analyze Customer Sentiment
Preprocessing steps
Tip. Document is the required data type for most text mining tasks in KNIME, and
the best way to inspect it is with the Document Viewer node.
These documents contain all the information we need for our lexicon-based analyzer,
so we can now exclude all the other columns from the processed dataset with
the Column Filter node.
We can now move on and use a valence dictionary to label all the words in the
documents we have created for each tweet. We used an English dictionary from
the MPQA Opinion Corpus which contains two lists: one list of positive words and one
list of negative words.
Tip. Alternative formulas for sentiment scores calculate the frequencies of neutral
words, either using a specific neutral list or by tagging any word that is neither
positive nor negative as neutral.
76
Consumer Mindset Metrics
Analyze Customer Sentiment
Note that since we’re not tagging neutral words here, all words that are not marked as
either positive or negative are removed in this step.
Now, the total number of words per tweet, which we need to calculate the sentiment
scores (see formula above), is equivalent to the sum of positive and negative words.
What we now want to do is generate sentiment scores for each tweet. Using our filtered
lists of tagged words, we can determine how many positive and negative words are
present in each tweet.
We start this process by creating bags of words for each tweet with the Bag Of Words
Creator node. This node creates a long table that contains all the words from our
preprocessed documents, placing each one into a single row.
Next, we count the frequency of each tagged word in each tweet with the TF node. This
node can be configured to use integers or weighted values, relative to the total number
of words in each document. Since tweets are very short, using relative frequencies
(weighted values) is not likely to offer any additional normalization advantage for the
frequency calculation. For this reason, we use integers to represent the words’
absolute frequencies.
We then extract the sentiment of the
words in each tweet (positive or negative)
with the Tags to String node, and finally
calculate the overall numbers of positive
and negative words per document by
summing their frequencies with Tagging words in tweets as positive or negative
based on an MPQA dictionary. These nodes are
the Pivoting node. For consistency, if a
wrapped in the Number of Positive and Negative
tweet does not have any negative or Words per Tweet metanode.
positive words at all, we set the
corresponding number to zero with the Missing Value node. For better readability, this
process is encapsulated in a metanode in the workflow.
We’re now ready to start the fun part and calculate the
sentiment score (StSc) for each tweet. We get our sentiment
score by calculating the difference between the numbers of
positive and negative words, divided by their sum (see
formula for StSc above) with the Math Formula node. Based Determining the sentiment of
on this value, the Rule Engine node decides whether the tweets by calculating
sentiment scores (StSc).
tweet has positive or negative sentiment.
77
Consumer Mindset Metrics
Analyze Customer Sentiment
Since the tweets were annotated by actual contributors as positive, negative, or neutral,
we have gold data against which we can compare the lexicon-based predictions.
Remember that StSc > 0 corresponds to a positive sentiment; StSc = 0, to a neutral
sentiment; and StSc < 0, to a negative sentiment. These three categories can be seen
as classes, and we can easily frame our prediction task as a classification problem.
To perform this comparison, we start by setting the column containing the
contributors’ annotations as the target column for classification, using the Category
to Class node. Next, we use the Scorer node to compare the values in this column
against the lexicon-based predictions.
It turns out that the accuracy of this approach is rather low: the sentiment of only 43%
of the tweets was classified correctly. This approach performed especially badly for
neutral tweets, probably because both the formula for sentiment scores and the
dictionary we used do not handle neutral words directly. Most tweets, however, were
annotated as negative or positive, and the lexicon-based predictor performed slightly
better for the former category (F1-scores of 53% and 42%, respectively).
78
Consumer Mindset Metrics
Analyze Customer Sentiment
The process of tagging words in tweets, deriving sentiment scores and finally
predicting their sentiments is no different from what we described for the first
workflow. However, since we do not have labels for the tweets here, we can only
assess the performance of the lexicon-based predictor subjectively, relying on our own
judgment.
The words “pretty” and “better” were tagged by our predictor as positive; no words were
tagged as negative. This led to StSc = 1, meaning the tweet was then classified as
having positive sentiment –something with which most annotators may agree. Let’s
now take a look at a different example, for which our predictor fails:
79
Consumer Mindset Metrics
Analyze Customer Sentiment
The word “kind” was tagged as positive, even though it does not correspond to a
positive adjective in this context, and no words were tagged as negative. Consequently,
the tweet was classified as positive even though it in fact corresponds to a complaint.
Dashboard of unlabeled tweets and their predicted sentiments per date. A quick inspection confirms the low
performance of this predictor, assessed quantitatively during training.
These examples illustrate a few limitations of the lexicon-based approach: it does not
take the context around words into consideration, nor is it powerful enough to handle
homonyms, such as “kind”.
80
Consumer Mindset Metrics
Analyze Customer Sentiment
81
Consumer Mindset Metrics
Analyze Customer Sentiment
Note. The workflow that we’ll illustrate in this section is a slight variation of the
workflows stored in the Machine Learning and Marketing repository: Building a
Sentiment Analysis Predictive Model - Supervised Machine Learning and Deploying
a Sentiment Analysis Predictive Model - Supervised Machine Learning. Nevertheless,
the underlying rationale and the use of the Learner-Predictor construct for
supervised learning is the same.
The workflow starts with the CSV Reader node, reading a CSV file, that contains the
review texts, its associated sentiment label, the IMDb URL of the corresponding movie,
and its index in the Large Movie Review Dataset v1.0. Important are the text and the
sentiment columns. In the first metanode, "Document Creation", Document cells are
created from the string cells, using the Strings To Document node. The sentiment
labels are stored in the category field of each document to remain available for later
tasks; and all columns, except the Document column, are filtered out.
The output of the first metanode "Document Creation" is a data table with only one
column containing the Document cells.
First, punctuation marks are removed by the Punctuation Erasure node, then numbers
and stop words are filtered, and all terms are converted to lowercase. After that, the
stem is extracted from each word using the Snowball Stemmer node. Indeed, the
words “selection”, “selecting” and “to select” refer to the same lexical concept and
carry the same information in a document classification or topic detection context.
Besides English texts, the Snowball Stemmer node can be applied on texts of various
languages, e.g., German, French, Italian, Spanish, etc. The node is using the Snowball
stemming library.
After all this preprocessing, we reach the central point of the analysis which is to
extract the terms to use as components of the document vectors and as input features
to the classification model.
82
Consumer Mindset Metrics
Analyze Customer Sentiment
To create the document vectors for the texts, first we create their bag of words using
the Bag Of Words Creator node; then we feed the data tables containing the bag of
words into the Document Vector node. The Document Vector node will consider all
terms contained in the bag of words to create the corresponding document vector.
Notice that a text is made of words and a document –which is a text including some
additional information such as category or authors– contains terms i.e., words
including some additional information such as grammar, gender, or stem.
Since texts are quite long, the corresponding bags of words can contain many words,
the corresponding document vector can have a very high dimensionality (too many
components), and the classification algorithm can suffer in terms of speed
performance. However, not all words in a text document are equally important.
A common practice is to filter out all least informative words and keep only the most
significant ones. A good measure of word importance can be indicated by the number
of occurrences of words in each single document as well as in the whole dataset.
Based on this consideration, after the bag of words has been created, we filter out all
terms that occur in less than 20 documents inside the dataset. Within a GroupBy node,
we group by terms and count all unique documents containing a term at least once.
The output is a list of terms with the number of documents in which they occur.
We filter this list of terms to keep only those terms with a number of documents greater
than 20, and then we filter the terms in each bag of words accordingly, with
the Reference Row Filter node. In this way, we reduce the feature space from 22379
distinct words to 1499. This feature extraction process is part of the "Preprocessing"
metanode and can be seen in the figure below.
We set the minimum number of documents to 20 since we assume that a term has to
occur in at least 1% of all documents (20 of 2000) in order to represent a useful feature
for classification. This is a rule of thumb and of course can be optimized.
83
Consumer Mindset Metrics
Analyze Customer Sentiment
Document vectors are now created, based on these extracted words (features).
Document vectors are numerical representations of documents. Here each word in the
dictionary becomes a component of the vector and for each document assumes a
numerical value, that can be 0/1 (0 absence, 1 presence of word in document) or a
measure of the word importance within the document (e.g., word scores or
frequencies). The Document Vector node allows for the creation of bit vectors (0/1) or
numerical vectors. As numerical values, previously calculated word scores or
frequencies can be used, e.g., by the TF or IDF nodes. In our case bit vectors are used.
For classification we can use any of the traditional supervised mining algorithms
available in KNIME Analytics Platform (e.g., Decision Tree, Random Forest, Support
Vector Machines, Neural Networks, and much more).
As in all supervised mining algorithms we need a target variable. In our example, the
target is the sentiment label, stored in the document category. Therefore, the target or
class column is extracted from the documents and appended as a string column, using
the Category To Class node. Based on the category, a color is assigned to each
document by the Color Manager node. Documents with the label "positive" are colored
green, documents with the label "negative" are colored red.
As classification algorithms we used a Decision Tree Learner and a XGBoost Tree
Ensemble Learner nodes applied to a training (70%) and test set (30%), randomly
partitioned from the original dataset. The accuracy of the Decision Tree is 91.3%,
whereas the accuracy of the XGBoost Tree Ensemble is 92.0%. Both values are
obtained using the Scorer node. The corresponding ROC curves are presented below.
ROC Curve for the Decision Tree and the XGBoost models.
84
Consumer Mindset Metrics
Analyze Customer Sentiment
While performing slightly better, the XGBoost Tree Ensemble doesn’t have the
interpretability that the Decision Tree can provide. The next figure displays the first two
levels of the Decision Tree. The most discriminative terms with respect to the
separation of the two classes are "bad", "wast", and "film". If the term "bad" occurs in a
document, it is likely to have a negative sentiment. If "bad" does not occur but "wast"
(stem of waste), it is again likely to be a negative document, and so on.
Decision Tree view allows us to investigate which features, in our case words or their
stems, contribute the most to separating documents in the two sentiment classes.
85
Consumer Mindset Metrics
Analyze Customer Sentiment
What we’ve shown in this section is just a quick tutorial on how to approach sentiment
classification with supervised machine learning algorithms. Of course, much more
than that can be done on much larger datasets and on much more complex sentences.
There are three broader ways we can improve this classification:
86
Consumer Mindset Metrics
Analyze Customer Sentiment
Workflows on the KNIME Community Hub: Building a Sentiment Analysis Predictive Model - Deep Learning
using an RNN and Deploying a Sentiment Analysis Predictive Model - Deep Learning using an Recurrent
Neural Network (RNN)
87
Consumer Mindset Metrics
Analyze Customer Sentiment
• Use the Conda Environment Propagation node for the workflows in this tutorial,
which ensures the existence of a Conda environment with all the needed
packages. An alternative to using this node is setting up your own Python
integration to use a Conda environment with all packages.
Note. These workflows were tailored for Windows. If you execute them on another
system, you may have to adapt the environment of the Conda Environment
Propagation node.
88
Consumer Mindset Metrics
Analyze Customer Sentiment
• For each word that is present in a tweet, set its corresponding entry in the tweet’s
X-sized vector as 1.
LSTM Layer
We connect the Keras Embedding Layer node to the star of our neural network:
the Keras LSTM Layer node. When configuring this layer, we need to set its number
of units. The more units this layer has, the more context is kept – but the workflow
execution also becomes slower. Since we are working with short text in this application
(tweets), 256 units suffice.
Dense Layer
Finally, we connect the Keras LSTM Layer node to a Keras Dense Layer node that
represents the output of our neural network. We set the activation function of this layer
as Softmax to output the sentiment class probabilities (positive, negative, or neutral)
for each tweet. The use of Softmax in the output layer is very appropriate for multiclass
classification tasks, with the number of units being equal to the number of classes.
Since we have 3 classes, we set the number of units in the layer as 3.
89
Consumer Mindset Metrics
Analyze Customer Sentiment
The network architecture that we use for this Sentiment Analysis task.
Before building our sentiment predictor, we need to preprocess our data. After reading
our dataset with the CSV Reader node, we associate each word in the tweets’
vocabulary with an index. In this step, which is implemented in the Index Encoding and
Zero Padding metanode, we break the tweets into words (tokens, more specifically)
with the Cell Splitter node.
Here is where the native encoding of your operating system may make a difference in
the number of words you end up with, leading to a larger or smaller vocabulary. After
you execute this step, you may have to update the parameter input dimension in
the Keras Embedding Layer to make sure that it is equal to the number of words
extracted in this metanode.
Since it is important to work with same-length tweets in the training phase of our
predictor, we also add zeros to the end of their encodings in the Index Encoding and
Zero Padding metanode, so that they all end up with the same length. This approach is
known as zero padding.
Already outside of the metanode, the next step is to use the Table Writer node to save
the index encoding for the deployment phase. Note that this encoding can be seen as
a dictionary that maps words into indices.
The zero-padded tweets output by the metanode also come with their annotated
sentiment classes, which also need to be encoded as numerical indices for the training
phase. To do so, we use the Category to Number and the Create Collection
Column nodes. The Category to Number node also generates an encoding model with
90
Consumer Mindset Metrics
Analyze Customer Sentiment
a mapping between sentiment class and index. We save this model for later use with
the PMML Writer node.
Finally, we use the Partitioning node to separate 80% of the processed tweets for
training, and 20% of them for evaluation.
With our neural network architecture set up, and with our dataset preprocessed, we
can now move on to training the neural network.
To do so, we use the Keras Network Learner node, which takes the defined network
and the data as input. This node has four tabs, and for now we will focus on the first
three: the Input Data, the Target Data, and the Options tabs.
In the Input Data tab, we include the zero-padded tweets (column “ColumnValues”) as
input. In the Target Data tab, we use column Class as our target and set
the conversion parameter as “From Collection of Number (integer) to One-Hot-Tensor”
because our network needs a sequence of one-hot-vectors for learning.
In the Target Data tab, we also have to choose a loss function. Since this is a multiclass
classification problem, we use the loss function “Categorical Cross Entropy”.
In the Options tab, we can define our training parameters, such as the number
of epochs for training, the training batch size, and the option to shuffle data before
each epoch. For our application, 50 epochs, a training batch size of 128, and the
shuffling option led to a good performance.
The Keras Network Learner node also has an interactive view that shows you how the
learning evolves over the epochs –that is, how the loss function values drop and how
the accuracy increases over time. If you are satisfied with the performance of your
training before it reaches its end, you can click the “Stop Learning” button.
91
Consumer Mindset Metrics
Analyze Customer Sentiment
One of the tabs of the Keras Network Learner node’s interactive view, showing how accuracy increases over
time.
We set the Python environment for the Keras Network Learner node, and for the nodes
downstream, with the Conda Environment Propagation node. This is a great way of
encapsulating all the Python dependencies for this workflow, making it very portable.
Alternatively, you can set up your own Python integration to use a Conda environment
with all the packages required.
After the training is complete, we save the generated model with the Keras Network
Writer node. In parallel, we use the Keras Network Executor node to get the class
probabilities for each tweet in the test set. Recall that we obtain class probabilities
here because we use Softmax as the activation function of our network’s output layer.
We show how to extract class
predictions from these probabilities
next.
92
Consumer Mindset Metrics
Analyze Customer Sentiment
To evaluate how well our LSTM-based predictor works, we must first obtain actual
predictions from our class probabilities. There are different approaches to this task,
and in this application, we choose to always predict the class with the highest
probability. We do so because this approach is easy to implement and interpret,
besides always producing the same results given a certain class probability
distribution.
The class extraction takes place in the
Extract Prediction metanode. First, the Many
to One node is used to generate a column
that contains the class names with the
highest probabilities for each test tweet. We
post-process the class names a bit with Structure of the Extract Prediction metanode.
the Column Expressions node, and then map
the class indices into positive, negative, or neutral using the encoding model we built
in the preprocessing part of the workflow.
With the predicted classes at hand, we use the Scorer node to compare them against
the annotated labels (gold data) in each tweet. It turns out that the accuracy of this
approach is slightly above 74%: significantly better than what we obtained with
a lexicon-based approach.
Note that our accuracy value, here 74%, can change from one execution to another
because the data partitioning uses stratified random sampling.
Interestingly, the isolated performance for negative tweets –which correspond to 63%
of the test data– was much better than that (F1-score of 85%). Our predictor learned
very discernible and generalizable patterns for negative tweets during training.
However, patterns for neutral and positive tweets were not as clear, leading to an
imbalance in performance for these classes (F1-scores of 46% and 61%, respectively).
Perhaps if we had more training data for classes neutral and positive, or if their
sentiment patterns were as clear as those in negative tweets, we would obtain a better
overall performance.
After evaluating our predictor over some annotated test data, we implement a second
workflow to show how our LSTM-based model could be deployed on unlabeled data.
First, this deployment workflow re-uses the Tweet Extraction component introduced in
the lexicon-based sentiment analysis. This component enables users to enter their
Twitter credentials and specify a search query. The component in turn connects with
the Twitter API, which returns the tweets from last week, along with data about the
93
Consumer Mindset Metrics
Analyze Customer Sentiment
tweet, the author, the time of tweeting, the author’s profile image, the number of
followers, and the tweet ID.
The next step is to read the dictionary created in the training workflow and use it to
encode the words in the newly extracted tweets. The "Index Encoding Based on
Dictionary" metanode breaks the tweets into words and performs the encoding.
In parallel, we again set up a Python environment with the Conda Environment
Propagation node, and connect it to the Keras Network Reader node to read the model
created in the training workflow. This model is then fed into the Keras Network
Executor node along with the encoded tweets, so that sentiment predictions can be
generated for them.
We then read the encoding for sentiment classes created in the training workflow and
send both encoding and predictions to the Extract Prediction metanode. Similar to
what occurs in the training workflow, this metanode predicts the class with the highest
probability for each tweet.
Here we do not have annotations or labels for the tweets: we are extracting them on-
the-fly using the Twitter API. For this reason, we can only verify how well our predictor
is doing subjectively.
To help us in this task, we implement a dashboard that is very similar to the one in our
lexicon-based sentiment analysis project. It combines (1) the extracted tweets; (2) a
word cloud in which the tweets appear with sizes that correspond to their frequency;
and (3) a bar chart with the number of tweets per sentiment per date.
In terms of structure, we use the Joiner node to combine the tweets’ content and their
predicted sentiments with some of their additional fields –including profile images.
Next, we send this information to the Visualization component, which implements the
dashboard.
Note. This component is almost identical to the visualization one discussed in our
lexicon-based sentiment analysis post. Both components just differ a bit in how
they process words and sentiments for the tag cloud, since they receive slightly
different inputs.
94
Consumer Mindset Metrics
Analyze Customer Sentiment
A quick inspection of the tweets’ content in the dashboard also suggests that our
predictor is relatively good at identifying negative tweets. An example is the following
tweet, which is clearly a complaint and got correctly classified (negative):
@AmericanAir’s standards are what these days? Luggage lost for over 48 hour
s and no response AT ALL. All good as long as there’s a customer service st
andard which there isn’t.
This is aligned with the performance metrics we calculated in the training workflow.
The bar chart also suggests that most tweets correspond to negative comments or
reviews. Although it is hard to verify if this is really the case, this distribution of
sentiment is compatible with the one in the Kaggle dataset we used to create our
model. In other words, this adds to the hypothesis that airline reviews on Twitter tend
to be negative.
Finally, the tweets below give us an opportunity to discuss the limitations of our
sentiment analysis. The most common tweet, which is the largest one in the tag cloud,
is a likely scammy spam that got classified as positive:
@AmericanAir Who wants to earn over £5000 weekly from bitcoin mining? You c
an do it all by yourself right from the comfort of your home without stress
. For more information WhatsApp +447516977835.
Although this tweet addresses an American airline company, it does not correspond to
a review. This is indicative of how hard it is to isolate high quality data for a task
through simple Twitter API queries.
Note that our model did capture the positivity of the tweet: after all, it offers an
“interesting opportunity” to make money in a comfortable and stress-free (many
positive and pleasant words). However, this positivity is not of the type we see in good
95
Consumer Mindset Metrics
Analyze Customer Sentiment
reviews of products or companies —human brains quickly understand that this tweet
is too forcedly positive, and likely fake. Our model would need much improvement to
discern these linguistic nuances.
96
Consumer Mindset Metrics
Analyze Customer Sentiment
Workflows on the KNIME Community Hub: Building Sentiment Predictor - BERT and Deploying a Sentiment
Analysis Predictive Model - BERT
In recent years, large transformer-based language models have taken the data
analytics community by storm, obtaining state-of-the-art performances in a wide range
of tasks that are typically very hard, such as human-like text generation, question-
answering, caption creation, speech recognition, etc. These language models, which
rely on advanced deep learning architectures, are also referred to as deep
contextualized language models, because they are exceptionally good at mapping and
leveraging the context-dependent meaning of words to return meaningful predictions.
But there’s no such a thing as a free lunch. While amazingly powerful, these models
require humongous computational resources and data to be effectively trained. So
much that they are usually developed and released only by the world’s top tech labs
and companies.
One of the most ground-breaking examples in the space is BERT. It stands
for Bidirectional Encoder Representations from Transformers and is a deep neural
network architecture built on the latest advances in deep learning for NLP. It was
released in 2018 by Google, and achieved State-Of-The-Art (SOTA) performance in
multiple natural language understanding (NLU) benchmarks. These days, other
transformer-based models, such as GPT-3 or PaLM, have outperformed BERT.
Nevertheless, BERT represents a fundamental breakthrough in the adoption and
consolidation of transformer-based models in advanced NLP applications.
Harnessing the power of the BERT language model for multi-class classification tasks
in KNIME Analytics Platform is extremely simple. This is possible thanks to
the Redfield BERT Nodes extension, which bundles up the complexity of this
transformer-based model in easy-to-configure nodes that you can drag and drop from
the KNIME Community Hub. For
marketers and other data
professionals, this means
implementing cutting-edge solutions
without writing a single line of code.
BERT nodes in the Redfield extension.
97
Consumer Mindset Metrics
Analyze Customer Sentiment
The BERT Classification Learner node uses the selected BERT model and adds three
predefined neural network layers: a GlobalAveragePooling, a Dropout, and a Dense
layer. Adding these layers adapts the selected model to a multiclass classification
task.
The configurations of this node are very simple. All we need to do is select the column
containing the tweet texts, the class column with the sentiment annotation, and the
maximum length of a sequence after tokenization. In the “Advanced” tab, we can then
decide to adjust the number of epochs, the training and validation batch size, and the
choice of the network optimizer.
98
Consumer Mindset Metrics
Analyze Customer Sentiment
Notice the box “Fine tune BERT.” If checked, the pretrained BERT model will be trained
along with the additional classifier stacked on top. As a result, fine-tuning BERT takes
longer, but we can expect better performance.
The BERT Predictor node: This node applies the trained model to unseen tweets in the
testset. Configuring this node follows the paradigm of many predictor nodes in KNIME
Analytics Platform. It’s sufficient to select the column for which we want the model to
generate a prediction. Additionally, we can decide to append individual class
probabilities, or set a custom probability threshold for determining predictions in the
“Multi-label” tab.
99
Consumer Mindset Metrics
Analyze Customer Sentiment
100
Consumer Mindset Metrics
Analyze Customer Sentiment
101
Consumer Mindset Metrics
Analyze Customer Sentiment
Dashboard of unlabeled tweets and their predicted sentiments per date. A quick inspection confirms the high
performance of this predictor.
102
Consumer Behavior
In this chapter, we will devise ways to monitor the behavior of our customers and get
insights into their decision-making process. We will show how to retrieve information
about website usage and traffic, identify the risk of customers stopping buying our
products, and give proper credits to marketing channels and touchpoints. More
specifically, we will look at use cases to query Google Analytics, predict customer
churn, and in marketing attribution modeling.
103
Consumer Behavior
The KNIME Google Connectors extension provides nodes to connect and interact with
Google resources, like Google Sheets, Google Drive, and Google Analytics. Google
Analytics is a set of services provided by Google to investigate traffic and activity on a
website of your property. This is important: the website must belong to you. This
service can be used to give insight into how visitors to your website are using your site.
In this section, we want to show how to use the KNIME Google Connectors to integrate
Google Analytics into your workflow, connect to the Google Analytics API, and recover
the number of pageviews and entrances for new users coming to our web property.
• Go to https://analytics.google.com/analytics/web
• Within the Analytics account, create one or more web properties referring to the
URL of the pages you want to analyze (use the Universal Analytics property).
104
Consumer Behavior
Querying Google Analytics
• The blue square port passes authentication for a Google account connection.
• The red square port provides an Analytics account connection for a specific web
property.
• The familiar black triangle port exposes the extracted measures for the web
traffic.
105
Consumer Behavior
Querying Google Analytics
Once the node is executed successfully, the connection to the Google Analytics API
for the selected web property is established.
106
Consumer Behavior
Querying Google Analytics
The dimensions and metrics can be selected in the top right part of the configuration
window. There, in the Settings tab, two dropdown menus show the categories and the
related set of metrics and dimensions.
Dimensions are classes such as full referrer, session count, keywords, country,
operating system, and much more. Metrics are aggregations of data such as page
views, users, new users, bounces, AdSense revenues, etc. on the selected dimensions.
Dimensions and metrics can be added to the query via the “Add” button on the right
and they will appear in the respective list on the left. Use the arrow buttons to decide
their order in the output table, the “+” button to add more dimensions and metrics, and
the “X” button to remove them.
For instance, if “operating system” and “browser” are specified as dimensions and
“users” as metric, then each value represents the sum of users for the given
combination of operating system and browser. For country as dimension and page
load time as metric, the resulting value is the average page load time for users of that
country.
From the “Geo Network” category, we selected and added to the query dimensions
“continent” and “country” and from the “Users” category the metric “New Users”. This
translates into extracting the number of new users (metric) coming to our web property
from each continent and country (dimensions).
107
Consumer Behavior
Querying Google Analytics
Such dimensions and metrics apply to the full basis of the web traffic, all users, all
views, all likes, and so on. The data domain, however, can be restricted either via
Segments or via Filters.
Define a Segment
Segments filter the data before the metrics are calculated. The corresponding
dropdown menu in the node configuration window offers predefined segments to
choose from, e.g., new users, returning users, paid traffic, mobile traffic, android traffic,
etc. We selected “mobile traffic”, which means extracting the number of new users
(metric) coming to our web property from each continent and country (dimensions) via
mobile phone.
At the same time, you can restrict the metric domain by setting up a filter based on the
selected dimensions. A filter rule restricts the data after the calculation of the metrics.
Possible operations include less than, greater than, regex matches and many more.
They can also be combined with a logical AND or OR. For a full list of available
operations and details about the syntax please see the node description or the “Google
Analytics developer documentation”.
We introduced “Continent==Europe”, which means extracting the number of new users
(metric) coming to our web property from each continent and country (dimensions) via
mobile phone and exporting only the numbers related to Europe.
Sort Results
Sorts the results by the selected dimension or metric. The sort order can be changed
to descending by prepending a dash.
Specifies the time frame for the returned data. Both start and end date are inclusive.
We introduced start date 2020-08-22 and end data 2021-08-22, which means
extracting the number of new users (metric) coming to our web property from each
continent and country (dimensions) via mobile phone between August 22, 2020 and
August 22, 2021 included, and exporting only the numbers related to Europe.
108
Consumer Behavior
Querying Google Analytics
Start Index
The API limits one query to a maximum of 10000 rows. To retrieve more rows the index
parameter can be used as a pagination mechanism.
Max Results sets the maximum number of rows that should be returned. The
maximum value is 10000. For more details about the parameters and settings see
the “Google Analytics developer documentation”.
The result of this configuration are the top 100 referrals that brought new users to your
website. The dimensions “source” and “referralPath” contain the source addresses and
the path of the page from which the new users came to your website. The metric “new
users” results in the number of new users for every referring page. By sorting by “new
users” in descending order and limiting the max results to 100, we get the 100 most
relevant referrals.
109
Consumer Behavior
Querying Google Analytics
The results of this configuration are the most relevant topics for the selected date
(here the month of June 2014) ascertained by total number of pageviews. The
dimension “pagePath” is used to list all topics of the web site (it is a forum, all forum
topics). The dimension “pageTitle” contains the corresponding topic name.
• The metric “pageviews” counts the number of views for each page.
• The specified filters filter out all pages that had less views than 100 and filter out
all pages that are not under the forum page.
• The result is then sorted descending by page views to get the most relevant
pages first.
• The specified start and end date keeps only the page views for the selected time.
110
Consumer Behavior
Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Rosaria Silipo, KNIME
Workflows on the KNIME Community Hub: Training a Churn Predictor and Deploying a Churn Predictor.
While we are not sure which data analytics task is the oldest, prediction of customer
churn has certainly been around for a very long time. In customer intelligence, “churn”
refers to a customer who cancels a contract at the moment of renewal. A company’s
CRM is often filled with such data. Identifying which customers are at risk of churning
is vital for a company’s survival. Because of this, churn prediction applications are very
popular, and were among the earliest data analytics applications developed.
Here we propose a classic churn prediction application. The application consists of
two workflows: a training workflow shown in the first figure, in which a supervised
machine learning model is trained on a set of past customers to assign a risk factor to
each one, and a deployment workflow shown in the second figure, in which that trained
model assigns a churn risk to new customers. 1
The training workflow trains a few random forest models to assign churn risk.
111
Consumer Behavior
Predict Customer Churn
Customer data
Customer data usually include demographics (e.g., age, gender), revenues (e.g., sales
volume), perceptions (e.g., brand liking), and behaviors (e.g., purchase frequency).
While “demographics” and “revenues” are easy to define, the definitions of behavioral
and perception variables are not always as straightforward since both depend on the
business case.
For this solution, we rely on a popular simulated telecom customer dataset, available
via Kaggle. In our effort to provide a broader overview of KNIME functionality, we split
the dataset into a CSV file (which contains operational data, such as the number of
calls, minutes spent on the phone, and relative charges) and an Excel file (which lists
the contract characteristics and churn flags, such as whether a contract was
terminated). Each customer can be identified by an area code and phone number. The
dataset contains data for 3,333 customers, who are described through 21 features.
112
Consumer Behavior
Predict Customer Churn
After that the “Churn” column is converted into string type with the Number to
String node to meet the requirement for the upcoming classification algorithm (in this
case, nominal). Note that KNIME offers a series of nodes to manipulate data (e.g.,
string to date, or vice-versa).
Before continuing with further preparation steps, it is important to explore the dataset
via visual plots or by calculating its basic statistics. The Data Explorer node (or else
a Statistics node) calculates the average, variance, skewness, kurtosis, and other
basic statistical measures, and at the same time it draws a histogram for each feature
in the dataset. Opening the interactive view of the Data Explorer node reveals that the
churn sample is unbalanced, and that most observations pertain to non-churning
customers (over 85%), as expected. There are typically much fewer churning than non-
churning customers. To address this class imbalance, we use the SMOTE node, which
oversamples the minority class by creating synthetic examples. Notice that the
execution of the SMOTE procedure is very time and resource consuming. It was
possible here because the dataset is quite small.
113
Consumer Behavior
Predict Customer Churn
under the Curve (AuC) as a metric for model quality. All these metrics range from 0 to
1; higher values indicate better models. For this example, we obtain a model with 93.8%
overall accuracy and 0.89 AuC. Better predictions might be achieved by fine-tuning the
settings in the Random Forest Learner node.
The interactive dashboard of the deployment workflow, reporting the churn risk in orange
for all new customers.
114
Consumer Behavior
Predict Customer Churn
Conclusions
We have presented here one of the many possible solutions for churn prediction based
on past customer data. Of course, other solutions are possible, and this one can be
improved.
After oversampling the minority class (churn customers) in the training set using the
SMOTE algorithm, a Random Forest is trained and evaluated on a 5-fold cross-
validation cycle. The best trained Random Forest is then included in the deployment
workflow.
The deployment workflow applies the trained model to new customer data, and
produces a dashboard to illustrate the churn risk for each of the input customers.
115
Consumer Behavior
Attribution Modeling
Workflow on the KNIME Community Hub: Touch-based, Correlation and Regression, Shapley-based, and
Randomized field experiments
116
Consumer Behavior
Attribution Modeling
Following the work by de Haan, the workflow is divided into different sections: from the basic touch-based
attribution models to the more complex randomized field experiments.
• Touch-based attribution.
9de Haan, E. (2022). Attribution Modeling. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of
Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-05542-8_39-1.
117
Consumer Behavior
Attribution Modeling
The customer journey dataset used for the analysis has been artificially generated by
de Haan using R and contains 50k customer journeys, with approximately 25% of the
journeys resulting in a purchase. The length of each journey ranges between 1 and 50
touchpoints. The dataset includes in total eight unique touchpoints, such as banner
impressions, e-mails, website visits, and clicks to search-engine advertising for brand-
and product-related keywords.
We start off by reading in the dataset and performing some simple aggregations to
represent the distribution of the touchpoints. Additionally, we use the Data Explorer
node to compute some descriptive statistics (e.g., mean, median, standard deviation,
min, max, etc.).
To visually explore the dataset, we use the Bar Chart node to plot the most frequent
touchpoints between a company and its (potential) customers. It turns out that for
most customers visiting the website was the most frequent touchpoint. Furthermore,
using the Sunburst Chart node, we observe that about 70% of the total interactions
occurred either via direct visit to the company website or via banner advertising.
Total occurrences per touchpoint. Notice the interactive selection across plots.
Touch-based attribution
Let’s now delve into the core of the workflow and see how we have implemented the
first three basic touch-based attribution models:
• First-touch attribution.
• Last-touch attribution.
• Average-touch attribution.
For the first two models mapping the attribution is straightforward since the dataset
contains the columns “First channel” and “Last channel. All we need to do is sum up
the purchases for these two columns separately, and sort the total purchases in
descending order. This can be easily done with the GroupBy and Sorter nodes. With
last-touch attribution, we conclude that a sale can only be attributed to a channel that
118
Consumer Behavior
Attribution Modeling
leads to a “Direct visit”. On the other hand, with first-touch attribution, also “Banner
impression” can get credit for a conversion.
Attributing credit to a touchpoint with the average attribution method is less intuitive.
We first need to compute the weight of a touchpoint in each single customer journey.
To do that, we divide each touchpoint by the total number of touchpoints in each
customer journey and multiply the result by the “Purchase” column, a dummy variable
indicating if the path to purchase ends with a purchase (1) or not (0). This calculation
is implemented in KNIME using just two nodes: the Math Formula (Multi Column) and
the GroupBy node. Also in this case, direct website visits and banner impressions get
the most credit. The three bar charts below provide a snapshot of the results for each
method.
The bar charts display the touchpoints that get the most credit for a conversion according to the three
different touch-based attribution models.
119
Consumer Behavior
Attribution Modeling
Although the Linear Correlation node already produces a local view, we prefer to
display the matrix with the Heatmap node, which creates an interactive visualization
that can be integrated in the component composite view. We see that “Purchase”
correlates positively (although weakly) with most other variables. The strongest
correlations can be observed with “Direct visit” and “Amount touchpoints''. While
interesting, correlations have to be interpreted with caution since they only tell us
something about the relationships between variables, not about their causality.
Nevertheless, correlation tables are a convenient way to identify soft patterns in the
data, and a popular starting point for more sophisticated analyses.
The correlation matrix above is the result of the Linear Correlation and Heatmap
node.
120
Consumer Behavior
Attribution Modeling
the summary table with the logistic regression coefficients returned by the R Snippet
node and display it with the Table View node.
In model 1, consistently with the results obtained with last-touch attribution, we
observe that “Direct visit” has a strong and positive estimate, indicating that there is a
higher chance for conversion associated with this touchpoint than with “Banner click”
whenever “Direct visit” is the last touchpoint in a customer’s journey. On the other hand,
“Banner impression” and “Email received” have strongly negative estimates, hence
they don’t directly relate to a conversion. In model 2, “Direct visit” becomes statistically
insignificant in favor of “CLV” and “Relation length”. The underlying reason might be
that direct website visits are more likely to happen among loyal and long-existing
customers, who also have a higher chance of conversion.
Logistic regression coefficients for model 1 and 2 (incl. confidence intervals) when we look at the last
channel used. “Banner click” is used as the reference category.
121
Consumer Behavior
Attribution Modeling
Regression models are more suitable for attribution modeling than touch-based or
linear correlation attribution since we are able to control for several variables at the
same time. However, they still remain unfit to shed light on causal relationships among
touchpoints and channels –for example, in the event that a particular channel or
touchpoint does not occur.
Another approach to the attribution problem that has become more popular than
correlations and regressions in recent years is Shapley values-based attribution. This
approach compares similar customer’s paths to purchase with the only difference that
in some paths a specific touchpoint is not included. Shapley values-based attribution
also allows us to answer the following question: “How would the outcome change if a
specific touchpoint was not included in a specific customer’s journey?”. For example,
let’s take the following path as the starting point to compare similar paths:
Banner impression → Product search → Brand search
The way this approach works is very intuitive. Firstly, all the observations (customer
journeys) that correspond to this path are extracted from the dataset and the average
purchase probability is computed. Afterwards, three sub-paths are obtained and
extracted by removing one of the three touchpoints at the time. The average
conversion probability is then computed also for each of these three sub-paths. For
the example path above, the sub-paths are:
1. Banner impression → Product search
2. Banner impression → Brand search
3. Product search → Brand search
Implementing this procedure is really simple and intuitive in KNIME. We use a series
of Rule-based Row Filter nodes to extract the journey of interest and the
corresponding sub-paths. Next, we use the GroupBy nodes to compute the conversion
probability of each path, and with the Math Formula (Multi Column) we obtain the
difference between the purchase probability of the journey of interest and the purchase
probability when excluding a focal touchpoint. In this way, we are able to identify the
incremental probability of conversion across similar paths.
122
Consumer Behavior
Attribution Modeling
In the table below, we can see that in 16.28% of the cases the full path results in a
conversion. If we exclude “Banner impression”, only in 7.49% of the cases the path
results in a conversion, meaning that whenever we leave out “Banner impression” the
conversion probability drops by 8.79%. This percentage also represents the credit that
“Banner impression” should get to drive profitable customer action.
With Shapley-value based attribution, it is indeed possible to explain the contribution
of each touchpoint in a particular consumer journey.
The last approach to attribution modeling proposes the use of field experiment data to
understand the impact a channel/touchpoint has whenever one or multiple of them are
excluded for some groups of customers. To do that, consumers are randomly placed
into two groups: one group is exposed to the channel/touchpoint of interest (i.e., the
treatment group), and one group is not exposed to this channel/touchpoint (i.e., the
123
Consumer Behavior
Attribution Modeling
control group). The dataset used by the authors in the book contains two different
randomized field experiments.
In the first experiment, the treatment group (80% of the customers) encountered a firm
banner, whereas the control group (the other 20% of the customers) encountered an
unrelated banner, from a charity organization in this case. The fact that the alternative
banner is unrelated to the firm allows us to consider its causal effect on firm
performance and customer behavior as null.
In the second field experiment, flyers are distributed to consumers in randomly
selected regions. The treatment group received the flyer, whereas the control did not.
Due to the random allocation, the experiment setup leads to a 50% split. Because
information about the region in which customers live is available, we know exactly
which customers received the flyer. Hence, this experiment allows us to investigate
the effect of a channel/touchpoint at the individual customer level.
We can furthermore investigate if there are synergy effects between the banner ads
and the flyer. For example, does being exposed to both advertising forms increase the
purchase likelihood relative to the two individual effects (positive synergy)? Or do they
weaken each other since they might be substitutes (negative synergy)? A preliminary
step should verify whether consumers in the firm’s banner group and who have
received a flier indeed have a higher likelihood of purchasing.
The KNIME implementation follows once again de Haan’s work and produces several
interactive charts to visualize the conversion rate according to the channel/touchpoint
and group of belonging (either the control or the treatment group). Furthermore, we
inspect the synergy effects between variables both visually and by estimating new
logistic regression models. To do so, we rely both on KNIME’s native JavaScript-based
Views and Logistic Regression Learner nodes, as well as on the R integration for the
sake of exact display of plots and regression coefficients.
In the charts below, we can see the difference between customers in the treatment
group (1) vs. the control group (0) for banner advertising (left) and flyer region (right).
We observe that the conversion rate is 28% for the treatment group vs. 17% for the
control group when banner advertising is used, indicating a strong effectiveness on
purchase likelihood. Similarly, when flyers are used, the conversion rate is considerably
higher for the treatment group (33%) than for the control group (18%).
124
Consumer Behavior
Attribution Modeling
Conversion rates when using a banner (left) or a flyer (right) for different groups: treatment (1), control (0).
Interactive bar charts using the Bar Chart node (top); static error bars using the R View (Table) node
(bottom).
We can also visualize the synergy effect of banner advertising and the flyer. When the
firm distributes a flyer, not being in the firm’s banner group increases the purchase
likelihood from 13.61 to 20.26%, i.e., an increase in conversion of 6.65%. When the firm
distributes a flyer, being in the firm’s banner group increases the purchase likelihood
from 19.70 to 35.91%, i.e., an increase in conversion of 16.21%. In other words, when
the firm distributes a flyer, the banner becomes more effective.
More insights about the synergy effects can be obtained from the wealth of additional
plots and regression coefficient tables, which can all be visualized and interacted with
in the final dashboard.
125
Consumer Behavior
Attribution Modeling
Banner and flyer synergy visualized using the Bar Chart node (left), and the R View (Table) node (right).
126
Marketing Mix
In this chapter, we will focus on strategies to understand how pricing metrics affect
our business –be it to reduce customer churn, increase market share or drive
profitability. More specifically, we will look at a use case to perform price optimization
using two different approaches.
127
Marketing Mix
Pricing Analytics
Workflow on the KNIME Community Hub by STAR COOPERATION: Price optimization: value-based pricing
and regression
• Average revenue per user (ARPU). It is a measure of the revenue generated each
month (or for a different predefined period) from each user. It is calculated by
dividing the total monthly recurring revenue (MRR) by the total number of
128
Marketing Mix
Pricing Analytics
customers. It tells whether a chosen pricing strategy suits the market and allows
the company to stay competitive.
• Customer lifetime value (CLV). Together with customer acquisition cost (CAC),
CLV measures whether the investment to drive and keep customers is profitable.
If CAC outweighs CLV, the costs are jeopardizing the revenues.
Based on the insights extracted from the metrics above, the best pricing strategy can
be applied and adjusted. Some of the most commonly used ones include:
• Cost-plus pricing. With this strategy, production costs (i.e., direct material cost,
direct labor cost, overhead costs, etc.) are summed up and added to a markup
percentage in order to set the final price of the product. This approach can
provide a good starting point, but it is usually not comprehensive enough to
inform a thorough pricing strategy.
• Penetration pricing. This strategy sets the price very low in order to attract new
customers and gain substantial market share quickly. Because pricing low
sacrifices profitability, it is feasible only for a short period of time until market
share is gained.
• Cream pricing. This strategy aims at reaching profitability not by high sales, but
by selling the product at a high price. It is usually used to target early adopters
and for a limited duration to recover the cost of investment of the original
research into the product.
• Value-based pricing. This strategy prices the product based on the value the
product has for the customer and not on its costs of production, under the
condition that the former is considerably higher than the latter. To apply this
strategy, it’s essential to understand customers’ perception, preferences and
alternatives.
129
Marketing Mix
Pricing Analytics
competition information to discover the need for price adaptation. Finally, we will
optimize prices systematically using either value-based pricing or linear regression.
While a suitable analytical tool is fundamental, pricing expertise is needed to check the
plausibility of new prices and to weigh up different pricing measures.
After a short introduction to Pricing Analytics, let’s now delve into the practical
implementation of a price optimization workflow using KNIME Analytics Platform. The
original workflow was developed by STAR COOPERATION. We'll guide you through a
seamless, codeless solution comprised of four steps:
1. Data preparation and analysis for pricing.
2. Integration of competition information.
3. Price optimization with value-based pricing.
4. Price optimization with linear regression.
To kick things off, we'll start by reading a practice dataset containing product and order
information of an e-commerce shop. To do this, we'll use two distinct reader nodes:
the Excel Reader node to access product prices and categories, and two separate CSV
Reader nodes to retrieve orders spanning across different years (2017-2019).
Next, we’ll perform a few data merging and joining operations. With the Concatenate
node, we bring together orders from 2017-2018 and 2019, and we join them with the
product price and category spreadsheet using the Joiner node.
For data transformation, we use the Row Filter node to eliminate irrelevant information,
such as canceled orders and returns. Then, we isolate the year of the order with the
String Manipulation node, and calculate the turnover (sale price x ordered quantity)
using the Math Formula node. We filter out articles with missing turnover information
130
Marketing Mix
Pricing Analytics
and use the GroupBy node to compute for each Article ID either the mean or the first
occurrence of each feature (e.g., Tax, Shipping Time, Length, Article Name, etc.)
With our data fully prepared, we're now almost ready to bring the data to life through
visualization. Before connecting the view nodes, we perform yet another aggregation
by feeding the data into two different Pivoting nodes. The first one will provide us with
the turnover per product category and year, while the second will give us the sales per
year and product category.
Subsequently, in the “Prepare for visualization” metanode, we bring the data in a
suitable shape for the plots and use the Color Manager node to assign colors to each
product category. To display turnover trends for each product category (x-axis) over
the years (y-axis), we employ the Line Plot node. We can see that ironware purchases
reached a peak in 2017, generating a turnover of about 19000 euros. On the contrary,
in 2018 ironware purchases yielded only 1500 euros.
To showcase sales per product category (y-axis) and year (x-axis), we create a grouped
bar chart using the Bar Chart node. Consistent with the line plot above, we observe
high sales for ironware in 2017. However, we can see that despite high sales for
storage in 2018, these generated a fairly low turnover.
131
Marketing Mix
Pricing Analytics
132
Marketing Mix
Pricing Analytics
The table produced by this joining operation can also serve as a useful source for
creating visualizations and gaining further insights. For instance, we can compare and
visualize how we price the article “screwdriver” relative to our competitors using the
Box Plot node. This box plot provides a nice statistical overview of the article-price
range, where we can observe that the average price of our article is comparably low
(8.30 euros) and almost half the price of our competitors’ median price (14.5 euros).
Price optimization with value-based pricing requires extensive domain knowledge and
pricing expertise. Identifying which factors make sense for this approach varies
according to the industry and product/service. Likewise, correctly devising a complex
set of rules that best capture a customer’s perceived product value, preferences and
alternatives often requires the manual setting and fine-tuning of thresholds and
conditions.
We start off by employing two separate Math Formula nodes to calculate sales
development as a percentage in 2017-2019 and determine the average competition
price. To handle missing values in sales development, we use the Rule Engine node to
assign a set of values whenever specific user-defined conditions are met.
Next, we employ in parallel a series of Rule Engine nodes to assign scores for sales
development, competitive pressure and product value based on user-defined
thresholds or labels. An overall score is then computed as the weighted sum of the
scores above with importance weights. Thanks to this new metric, we can define new
rules and intervals to assign price adjustments.
Following this, we wrap in a component a series of Math Formula nodes to calculate
new sales prices, review the contribution margin, and obtain new turnover figures. The
results of this update are aggregated by summing the values of the ordered quantity,
current and new turnover for each product category. From here, we can calculate the
turnover development as a percentage to gain further insights into the data. With the
133
Marketing Mix
Pricing Analytics
The second approach to price optimization relies on a simple linear regression analysis
to predict future sale prices. Unlike value-based pricing, this approach allows for a
greater degree of automation and removes several manual and fine-tuning steps.
Since our dataset contains time information (i.e., Order Year), we cannot use the
Partitioning node to split our dataset into training and test set, as doing so would cause
a data leakage problem. Rather, we sort Order Year in ascending order and use the
Rule-based Row Splitter node to divide our data into a training set for 2017-2018 and
a test set for 2019.
We then feed the training set into the Linear Regression Learner node to train our
statistical model, and we apply the latter to test data using the Regression Predictor
in order to make sale price predictions.
Once the predictions have been output, we can proceed as in the previous approach
and compute the new contribution margin and turnover figures. Finally, as a last step
in this workflow, we aggregate the results to inspect current vs. new turnover by
product group, and express turnover development as a percentage. Using regression
analysis for price optimization, we should expect a turnover increase of 10% for all our
product categories.
As a final remark, it’s worth noticing that this example can be expanded to achieve
higher robustness and reliability by inspecting in advance which independent
variable(s) explains current sale prices.
134
Marketing Mix
Pricing Analytics
135
Customer Valuation
In this chapter, we will understand how to capture and properly account the value each
customer has to our business by monitoring a few key indicators. More specifically,
we will look at use cases to measure customer lifetime value, and compute recency,
frequency and monetary value scores.
136
Customer Valuation
Workflow on the KNIME Community Hub: SAP ECC Customer Life Time Value Analysis
• Improvement comes from measurement. Measuring CLV and breaking down its
various components empowers businesses with new valuable insights that can
be employed to adopt ad-hoc strategies around pricing, sales, advertising, and
customer retention with a goal of continuously reducing costs and increasing
profit.
137
Customer Valuation
Customer Lifetime Value
Total costs of acquiring and serving the customer is the sum of the various costs
incurred during the customer relation in terms of acquiring the customer and
maintaining it.
Although there are many variations of CLV, starting with the basic formula has the
following advantages:
1. It contains the key ingredients of the CLV method.
2. Easier to understand and implement.
3. With our example workflow, it provides a template to build on.
Other more advanced formulas/techniques, such as traditional and predictive
methods provide different approaches to the CLV calculations. For example, if your
customer revenues don’t stay flat year on year, and you need to factor in changes that
happen across the customer lifetime, the traditional version of the formula takes rate
of discount into consideration and provides a more detailed understanding of how CLV
can change over the years:
𝑪𝑳𝑽 = 𝐺𝑟𝑜𝑠𝑠 𝑚𝑎𝑟𝑔𝑖𝑛 ∗ (𝑅𝑒𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 / (1 + 𝑅𝑎𝑡𝑒 𝑜𝑓 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 – 𝑅𝑒𝑡𝑒𝑛𝑡𝑖𝑜𝑛))
There are many examples of CLV in helping companies to achieve greater success in
sales revenue growth and higher profitability. One of the most well-known case studies
is the success of Amazon Prime. Amazon’s own study found their Kindle owners spend
approximately 40% more per year buying stuff from Amazon, compared to other
customers. As a result, Amazon paid close attention to CLV with the development of
Amazon Prime. By doing so, Amazon understood how to invest and exploit their most
profitable customer segments. Amazon Prime’s growth is what has been most
impressive. They have managed to convert millions of customers into loyal
subscribers at a very fast rate with a much higher than average spend of $1400 per
year.
Using KNIME Analytics Platform and the DVW KNIME Connector node for SAP, we will
show how a sales/account manager can easily extract the total revenue of a set of
customers from an SAP ERP system (e.g., SAP ECC, SAP S/4HANA), calculate their
CLVs and segment customers into different bands by value and lifetime (e.g., 30+
years and 1m+ as platinum).
138
Customer Valuation
Customer Lifetime Value
After a short theoretical intro to CLV, let’s now move to the core of this section: the
analytical implementation. We will walk you through a codeless solution with KNIME
built around three main steps:
1. Extract of sales and customer data from SAP.
2. Process extracted data and calculate CLV.
3. Format and output the data.
Note. This workflow requires the DVW KNIME Connector for SAP. For more details
and to request a trial, visit www.dvwanalytics.com.
139
Customer Valuation
Customer Lifetime Value
Application Connector for SAP and works with all core SAP systems, as well as with
analytics applications that support the OData v4 standard.
Our data transformation involves the creation of new constant value columns using
the Column Expressions node, and the renaming and exclusion of some of them with
the Table Manipulator node. The final data table is displayed below.
SAP CLV analysis workflow –dynamic input for xCS Table Read
tool.
To use the xCS Table Data Read tool to extract SAP sales data, drag an SAP Executor
node onto the canvas.
First, we configure the Basic tab of the KCS SAP Executor with the following steps:
1. Select SAP Table Data Read from the SAP Tool dropbox.
2. Select the appropriate SAP system from the SAP Systems dropbox.
3. Enter your SAP User name and Password.
140
Customer Valuation
Customer Lifetime Value
We then configure the Parameters tab of the KCS SAP Executor with the following
steps:
1. Enter VBAK (Sales Document: Header Data) in the Selected Table text box.
2. Click the Search button to search and bring back table metadata from SAP.
3. Select relevant fields for extraction.
4. Click on the Save button to save the configuration.
We repeat the same steps for table KNA1, containing general customer data.
141
Customer Valuation
Customer Lifetime Value
xCS SAP Table Tool to extract data from SAP table –KNA1 (General Data in
Customer Master).
Once the sales and customer data have been extracted, we can then use the various
KNIME nodes to work out both how long each customer has been ordering and how
much they have been ordering. With the total revenue information, we can also deduct
the costs supplied to calculate CLV.
We use two GroupBy nodes to work out the first and last order date for the customer.
142
Customer Valuation
Customer Lifetime Value
A Joiner node is used to combine first and last order date into the same customer
record. Then, the Date&Time Difference node is used to calculate the number of years
the customer has been placing orders.
We then use the Column Expressions node to segment the customer base to
platinum/gold/silver/bronze etc based on the number of years the customer has been
ordering.
143
Customer Valuation
Customer Lifetime Value
Next, we can start calculating CLV by summing up the customer’s revenue with a
GroupBy node. Then a Joiner tool is used to bring in the customer cost figures from
the initial input table. The CLV is then calculated below with a Column Expressions
node, applying the formula illustrated above.
144
Customer Valuation
Customer Lifetime Value
The last step of our workflow is about visualizing and reporting results. Using the
Pie/Donut Chart node, we can plot our Customer Lifetime Value Segmentation
Analysis identifying customer groups by the number of years they have been ordering,
or by how much they have been spending on the orders.
In the chart below, we can see that most of the customers have been very loyal to the
business with 88% of them having been ordering between 10-20 years or more.
Although loyal, our customers still spend too little (see figure below). Indeed, 44% of
them spend less than 5K; while only 18% spend more than 1 mil. This valuable piece
of information could be used by the Marketing and Sales departments to create ad-
hoc campaigns and promotions.
145
Customer Valuation
Customer Lifetime Value
Finally, we join the 3 sets of data (customer general data, lifetime, and CLV) with the
Joiner nodes and write the data for further processing or reporting to Tableau format
using the Tableau Writer node.
146
Customer Valuation
• Recency. Customers who have purchased from a business recently are more
likely to buy again than customers who have not purchased for a while. To reverse
the situation, businesses may need to nurture non-recent customers with new
promotional offers or even reintroduce the brand.
• Frequency. Customers who often make purchases are more likely to purchase
again than customers who buy infrequently. For frequent buyers, businesses
have the chance to collect a wealth of information to build a comprehensive
overview of purchasing habits and preferences. On the other hand, one-time
customers are much harder to profile. They are good candidates for a customer
satisfaction survey to understand what can be done to improve customer
retention.
• Monetary Value. While all purchases are valuable, customers who spend more
are more likely to buy again than customers who spend less. This third factor
helps understand more clearly the first two letters in the RFM acronym. A recent
customer who is a frequent buyer and makes purchases at a high price point has
147
Customer Valuation
Recency, Frequency and Monetary Value
the potential to turn into a brand loyalist and secure high revenues for the
business.
To successfully conduct RFM analysis, businesses need to rely on extensive customer
data which must include information about a) the date (or time interval) of the most
recent purchase; b) the number of transactions within a specified time interval or since
a certain date; and c) the total or average sales attributed to the customer. Based on
this data, a scoring system can be devised that assigns Recency, Frequency and
Monetary Value scores based on an arbitrarily decided number of categories. For
example, five or less categories can be used to distinguish groups whose purchases
are more or less recent, or at a higher or lower price point.
Once RFM scores are calculated, it’s easy to identify the best customers by ranking
them. The higher a customer ranks, the more likely it is that they will do business again
with a firm. Notice that the order of the attributes in RFM does not necessarily
correspond to the order of their importance in ranking customers. It’s the combination
of all the attributes that defines the importance of a customer.
With these newly acquired insights into the customer base, it’s possible to start
analyzing the characteristics and purchasing behavior of different groups, identify
what distinguishes them from other groups, and address them with relevant offers or
initiatives. RFM analysis has been successfully used, for example, by nonprofit
organizations to target donors, as people who have been the source of contributions
in the past are likely to make additional gifts. The adjustments of the RFM model can
be used very differently depending on the company/business needs. Thus, it usually
requires an ad-hoc design.
Thanks to the abundance of customer data, businesses can implement automated
solutions to measure customer value and get a better understanding of how to target
different customer groups effectively. Using KNIME Analytics Platform, we will analyze
transactional data to devise a scoring system that clusters customers based on RFM
scores. Next, we’ll enrich the analysis with the addition of historical CLV calculation,
and present customer insights using interactive visualizations.
148
Customer Valuation
Recency, Frequency and Monetary Value
Now that we have briefly clarified what the RFM analysis is and why it is useful, let’s
explore how we can implement it with a codeless approach using KNIME Analytics
Platform. The idea is to analyze light-weight transactional data, such as orders, using
a quite simple RFM model and a popular KPI. To do that, we’ll walk you through a
workflow example comprised of three steps:
1. Data ingestion and RFM preparation
2. RFM and historical CLV calculation
3. Visualization of results
We start off by ingesting a practice dataset containing transactional data with the
following information:
• customer = identifies the customer who will be evaluated via RFM and CLV.
149
Customer Valuation
Recency, Frequency and Monetary Value
Next, we start processing our data to engineer the quantitative factors that constitute
the RFM analysis. As the first step we aggregate transactions at the customer level
using the GroupBy node. For each customer, the node returns details about the last
purchase (i.e., Recency), the number of products purchased (i.e., Frequency), and the
total product value (i.e., Monetary Value). For example, customer A purchased a
volume of 6 products worth 6800 euros, the last of which was bought 15 days ago.
After the results have been combined, an RFM model can be calculated individually for
each customer. The aggregated data table is fed in parallel into three different k-
Means nodes, one for each factor of the RFM model. These clustering nodes are key
to building a system that assigns RFM scores to each customer. Here, the number of
categories is controlled by k, the parameter that in the k-Means algorithm determines
the number of clusters to form. We set k = 3 but some RFM models set k = 5 or higher.
150
Customer Valuation
Recency, Frequency and Monetary Value
For example, for Recency, we sort the centroids of the three clusters formed from
“Min*(days ago)” in descending order, and we use the Math Formula node to append
a new column “Recency_Cluster” that assigns a higher ordinal value to low Recency
(i.e., ten days ago).
With a series of Joiner nodes, we combine our clustered transactions with the scores
of each factor. Our final table now has three new columns: Recency, Frequency and
Monetary Cluster.
To take advantage of the RFM scores, we now have two options. In case we have a
lookup table for each combination of scores, we can use this and attach the customer
segment. This is a valid approach if we want to identify, for example, different Recency
phases of our customers first. In our simple use case, the RFM calculation involves
summing up the cluster scores of each factor in one consolidated RFM score.
At this point, we can adopt the same procedure described above, that is using the k-
Means node to identify clusters, centroids, and assign ordinal values to rank
customers. In this case, we distinguish four clusters, which correspond to the four
customer segments that we aim to identify. The number of clusters can be adjusted
to fit a specific business strategy.
In addition to RFM, we can enrich our analysis with a popular KPI: historical Customer
Lifetime Value (CLV) for each customer. Again, the transactional data serves as input
and is limited to the maximum lifetime of our customers, which varies depending on
the industry (e.g., in the Retail industry, the maximum lifetime of a customer might be
two months only). In the example, we choose to retain the past 6 months (i.e., 183
151
Customer Valuation
Recency, Frequency and Monetary Value
days) and disregard older orders. The historical CLV is calculated as the sum of the
order value for each customer, and corresponds to the Monetary Value.
Finally, we join the tables containing RFM and CLV values to visualize the extracted
insights.
3. Visualization of results
We can now use some popular visualization nodes available in the KNIME JavaScript
Views extension to present and interact with our customer insights.
We start off visualizing the results of the RFM analysis using the Bar Chart node. We
can immediately see that customers C, D and E ranked the highest in all three factors
within our customer base. These are our most valuable customers, and we should
nurture our relationship with them in order to ensure long-lasting revenues. For
example, we could offer exclusive benefits or discounts. Similarly, customers B, H and
S show high Recency (and even high Monetary Value in one case), which indicates
great potential for transforming these customers into brand loyalists.
On the contrary, customers F, G and J are currently the least valuable for the business.
We should consider creating ad-hoc marketing campaigns and promotions to attract
them back.
152
Customer Valuation
Recency, Frequency and Monetary Value
Next, we visualize RFM and CLV together using the Scatter Plot and the Conditional
Box Plot nodes. In the first plot, we can observe that our best customers according to
the RFM scores are historically those who have also spent the most (high CLV). The
opposite is also true.
Additionally, thanks to the conditional boxplot, we can see more clearly the lower and
upper bound, as well as the median monetary value for each customer segment.
Interestingly enough, two major findings stand out. First, the third quartile of the green
boxplot is closely approaching the first quartile of the red boxplot. This tells us that
there’s a business opportunity to grasp in order to close the gap and convert those
customers into the most valuable ones. Second, we can observe that in the orange
boxplot there are upper and lower outliers. The first should be targeted with initiatives
to encourage spending and frequent purchases, whereas the second requires ad-hoc
action to avoid permanent churn.
153
Customer Valuation
Recency, Frequency and Monetary Value
Besides RFM and CLV, we can also identify the cross-selling potential of our products.
This can help us understand which combination is most demanded in order to market
it effectively and drive sales. Therefore, a Pivoting node is needed to aggregate the
product types purchased by each customer. This results in a table with dummy
variables indicating the purchase of a product with 1, and 0 otherwise. With the Linear
Correlation node, we can then display the correlation of pairwise combinations in a
matrix. Values close to 1 indicate strong correlation and, hence, higher cross-selling
potential. That’s the case, for example, for the products ABB-MAC and AAA-MAC.
Finally, the calculated customer KPIs can be saved and exported to an Excel sheet for
further analysis.
154
Customer Valuation
Recency, Frequency and Monetary Value
155
Data Protection and Privacy
In this chapter, we will delve into the pressing issue of integrating data privacy and
protection practices in our data flows. We will show how to handle appropriately
sensitive customer information while preserving valuable insights for analysis. More
specifically, we will look at a use case to anonymize customer data.
156
Data Protection and Privacy
Almost every organization has to collect some data about its customers, be it for direct
business purposes, e.g., sending out a newsletter, or for getting insights about the
customers, e.g., how the newsletter influences their involvement. Obviously, customer
data contains personal information that is subject to privacy protection.
Data protection and privacy are regulated worldwide at the legislative level. According
to UNCTAD, as of today, 71% of countries worldwide had put in place and 9% drafted
the data protection and privacy legislation. For example, the European Union’s GDPR
(General Data Protection Regulation) stipulates that any organization processing
personal data of a EU citizen or resident can use such data extensively and without
privacy restrictions only after the anonymization. The fines for non-compliance are
very high.
Let’s see how we can address the privacy requirements and anonymize customer data
prior to the analysis using KNIME Analytics Platform.
Use case
Let’s take the example of a company dealing with customer data. The company
collects raw data about the customers such as name, email, age, date and country of
birth, income, etc., in some secure location with restricted access, and assigns each
customer a unique customer key.
This data can be useful to business analysts, marketing specialists, or data scientists
to explore customers’ behavior and get valuable insights for the business. However, in
its pure form, this data identifies individual customers and can’t be shared with all the
company’s employees.
The goal is, therefore, to transform the data in such a way that no customer can be
identified but at the same time keep the maximum amount of non-sensitive
information for the analysis. In other words, the goal is to anonymize the data. After
the data are anonymized and risks of re-identification are assessed, the anonymized
data can be loaded to the space where it is available for further analysis, for example,
to a data mart.
157
Data Protection and Privacy
Customer Data Anonymization
Let’s have a look at the raw data collected by the company. A sample of the data is
presented in the table below.
Disclaimer. The dataset is randomly generated and any similarity to real people is
completely coincidental.
Two attributes –”Name” and “Email”– identify a customer directly. These attributes are
called identifying and will be the first attributes to be anonymized. Although this step
is crucial, it is not sufficient.
Additionally, we need to make sure that no unique combination of attributes can
identify a customer. For example, imagine an attacker –a person who tries to identify
a specific person in the dataset– knows that her neighbor is a customer and was born
on 13.05.1987 in Spain and earns 50k euros. If there is no other customer in the data
with the same combination of these three attributes, the attacker can easily identify
the neighbor and then map all the available information, including sensitive details. The
attributes that can form such combinations are called quasi-identifying.
This problem can be exacerbated by outliers. If an attribute value is unique, it can
identify a person even by itself, without other attributes, like, for example, if a person is
unusually old.
In our case, the combination of attributes “Birthday”, “CountryOfBirth”, and
“EstimatedYearlyIncome” is unique for most customers. This introduces the risk of re-
identification and these attributes should, therefore, also be anonymized.
158
Data Protection and Privacy
Customer Data Anonymization
Disclaimer. For the purpose of this work, we selected three attributes of different
types to anonymize in order to demonstrate the different configuration settings.
We didn’t analyze whether those are optimal and sufficient attributes to
anonymize. There exist quantitative and qualitative methods to decide which
attributes should be anonymized and, for each particular dataset, the attributes
should be thoroughly analyzed to define whether they are identifying, quasi-
identifying, sensitive, or insensitive. For the sake of simplicity, in our example we
consider all the remaining attributes insensitive, but in a real scenario, the columns
“City”, “Country”, “Gender”, and “MaritalStatus” can also be considered quasi-
identifying.
Another attribute worth noting is “CustomerKey”. This column also contains unique
values for each customer and can be used to identify the customer if an attacker has
a dictionary mapping the customer keys with their identities. However, the attribute
itself doesn’t allow to directly identify the person –the key is a simple counter that
doesn’t have any other meaning. We can, therefore, leave it unmodified in the dataset.
Data anonymization
Let’s proceed to data anonymization. To anonymize the data, we will use the nodes
from the community extension Redfield Privacy Extension which is based on the ARX
Java library.
The “Customer Data Anonymization” workflow shown below performs the initial data
anonymization. This workflow reads the raw data, anonymizes the identifying
attributes, creates the hierarchies for the quasi-identifying data, and anonymizes these
data by applying the anonymization model. Next, it assesses the re-identification risks,
and if the assessment is satisfactory, the workflow saves the hierarchies and
anonymization levels that can be reused for the anonymization of new customer data.
The second workflow, “New Customer Data Anonymization - Deployment”, reads the
data for new customers as the data flows in. It anonymizes the identifying attributes
similarly to the first workflow and reuses the hierarchies and anonymization levels
159
Data Protection and Privacy
Customer Data Anonymization
from the first workflow to anonymize quasi-identifying attributes. The whole dataset is
reassessed again for the re-identification risks. If the assessment is satisfactory, the
anonymized data are loaded to the data mart, otherwise a responsible person is
notified. This workflow can be scheduled to execute regularly with KNIME Business
Hub.
We start with the anonymization of the identifying attributes, in our case, “Name” and
“Email”. These attributes contain only unique values and, therefore, aren’t valuable for
analysis. For example, each customer has their own email address which doesn’t
provide any insight. At the same time, these attributes identify the customer directly
and, therefore, shouldn’t be shared with anyone, not even in part.
This makes the anonymization straightforward and easy since we don’t have to look
for a balance between data accuracy and suppression. In our example, we will hash
the values using the Anonymization node.
In the configuration window, we select the identifying columns “Name” and “Email” and
apply salting. Salting is a method that allows to diversify and randomize the original
values before hashing by concatenating the original values with, for example, some
random number. Salting aims to protect the data from attackers enumerating and
hashing all the possible original values.
After anonymization, names and email addresses are replaced by hash values that
can’t identify a person anymore (see figure below). It is worth noting that the second
output port contains the dictionary that should be protected and not shared.
160
Data Protection and Privacy
Customer Data Anonymization
Hierarchies of different levels (they can be seen in the second output of the
Create Hierarchy node).
This kind of anonymization can be performed using the Create Hierarchy node for each
quasi-identifying attribute and applying the privacy model that will find the optimal
levels of optimization using the Hierarchical Anonymization node. The configuration
of the Create Hierarchy node is different for different data types. In our example, we
create the hierarchies for three columns of three different types: date, double, and
string.
161
Data Protection and Privacy
Customer Data Anonymization
Creating a hierarchy for the attribute of Date&Time data type in the configuration
window of the Create Hierarchy node.
Numeric data
Next, let’s create the hierarchy for the “EstimatedYearlyIncome” column. The
configuration of the Create Hierarchy node is more complicated for the numeric data
but we will show you one trick to simplify it. Follow the steps below to configure the
node:
1. First, we select the column “EstimatedYearlyIncome” and the hierarchy type
“intervals” and click “Next”.
2. In the next window, in the tab “General”, you can choose the aggregate function
for all the groups. We stick to the default “Interval” option.
3. In the tab “Range”, you can specify minimum and maximum values as well as
restrict the boundary values of the groups. We increase the default income range
to the range from 0 to 200k.
162
Data Protection and Privacy
Customer Data Anonymization
8. We add just one more group with size 5. Now, all the groups we created cover our
general range from 0 to 200k.
163
Data Protection and Privacy
Customer Data Anonymization
From top to bottom: Creating a hierarchy for the attribute of numeric data type in the configuration window
of the Create Hierarchy node.
Categorical data
1. Finally, let’s create the hierarchy for the “CountryOfBirth” column. Follow the
steps below to configure the Create Hierarchy node:
2. First, we select the column “CountryOfBirth” and the hierarchy type “ordering” and
click “Next”.
3. Next, in the new window, we need to sort the original country values. The values
will be included in groups from this sorted list. To create meaningful groups, we
sort them so that the countries located close to each other are close to each other
in the list. Alternatively, you can sort the values in an alphabetic order.
164
Data Protection and Privacy
Customer Data Anonymization
165
Data Protection and Privacy
Customer Data Anonymization
From top to bottom: Creating a hierarchy for the attribute of string data type in the configuration window of
the Create Hierarchy node.
Data anonymization
Now, we have hierarchies created but as you can see in the 1st outputs of the Create
Hierarchy nodes, the data isn’t transformed yet. By the way, which level of the
anonymization should we use? This is something we can find out by applying the
privacy model to our hierarchies using the Hierarchical Anonymization node.
As an input, this node requires the original data and all the hierarchies connected via
Hierarchy Configuration ports. In the first configuration tab “Columns”, we need to
define the attribute type for each column (see figures below). We define “Birthday”,
“CountryOfBirth”, and “EstimatedYearlyIncome” as quasi-identifying. We define all the
other columns as insensitive, including the two identifying columns that have been
already anonymized. The hierarchies provided via the input port will be used
automatically.
In the “Privacy Models” tab, we need to select the privacy model. Depending on the use
case and the attributes in the data, different privacy models should be used. We will
use the simplest k-anonymity model defined as follows: “A dataset is k-anonymous if
each record cannot be distinguished from at least k-1 other records regarding the
quasi-identifiers”. We use k = 2. This means that in our anonymized dataset, for each
customer, we want at least one other customer to have the same combination of
166
Data Protection and Privacy
Customer Data Anonymization
167
Data Protection and Privacy
Customer Data Anonymization
From top to bottom: Applying a k-anonymity privacy model to the original data
using the created hierarchies in the configuration window of the Hierarchical
Anonymization node.
After the node is executed, the data are anonymized according to the anonymization
levels suggested by the node. However, different combinations of levels can
anonymize the data, and the node allows you to explore all the options and change the
suggested levels in the interactive view. Note that after the anonymization, all the
columns are converted to the String type.
168
Data Protection and Privacy
Customer Data Anonymization
Anonymity evaluation
Now that the data are anonymized, we need to make sure that the risks of re-
identification are acceptable. We can do that using the Anonymity Assessment node.
This node estimates two types of re-identification risks: quasi-identifiers diversity and
attacker models.
We provide the original and the anonymized data to the first and the second input ports,
respectively. In the configuration of the node, we need to select the three quasi-
identifying columns –“Birthday”, “CountryOfBirth”, and “EstimatedYearlyIncome”. We
also need to set the re-identification threshold –the risk threshold which we consider
acceptable. Before defining the acceptable risk, let’s first discuss what is actually the
highest risk.
In general, the individual risk for each person to be re-identified is 1/k, where k = 1 +
the number of people in the dataset who have the same values in the quasi-identifying
columns. For example, for the 2-anonymity model where you have at least 2
indistinguishable people, if you pick one, the risk that you pick the one you try to identify
is 50%. Therefore, the highest risk for the 2-anonymity model is 0.5.
We can then restrict a threshold for the acceptable risk, let’s say, to 0.2. This would
mean that we consider the personal data record at risk if for this person there are less
than 4 people in the dataset with the same values in three quasi-identifying columns.
Now let’s execute the node and explore the output in the second output port:
169
Data Protection and Privacy
Customer Data Anonymization
• The success rate –weighted average risk of individual risks in the dataset, i.e.,
avg(1/k)– is 0.027 which is very low in general and is much lower than the highest
risk.
• The highest risk is 0.143 meaning that for each person in the dataset there are at
least 6 other people in the dataset with the same values in three quasi-identifying
columns. We consider this acceptable for the purposes of this work.
Disclaimer. Note that risk thresholds used here are neither universal nor sufficient
for a real use case. They need to be customized depending on the data, the use
case as well as the organization, ethics board, and country requirements.
Deployment
Now what happens after we anonymized the data for the first time? In our scenario,
the new customer data flows in on a regular basis and should be anonymized
consistently to the existing data.
We can reuse the hierarchies we created and the selected levels of anonymization for
the new customer data. In the first workflow, we save the hierarchies using the
Hierarchy Writer node for all three attributes.
Next, the Hierarchical Anonymization node generates a few flow variables describing
the applied anonymization and including the selected levels of anonymization. We
transform the variables into a compact table that we also save.
Now, let’s move to the “New Customer Data Anonymization - Deployment” workflow.
Here we read the new customer data as well as the hierarchies and the anonymization
levels we saved earlier. The Hierarchy Reader node requires the new data (with the
original data types) as an input, reads hierarchies for all three attributes, and applies
them to the input data. You can see a hierarchy preview, specific for this new input
data domain in the second output port. Note that all the values are converted to the
String type. We will use these hierarchy previews as a dictionary to replace the original
values in the new data.
Let’s anonymize! First, we hash the identifying columns just as we did in the previous
workflow. Before we apply the hierarchies, we will need to change the data types to
type String. And then, we can simply use the Cell Replacer node to replace the original
values with the respective values from the correct level of anonymization. The correct
level of anonymization is controlled by the flow variable coming from the table with the
anonymization levels that we saved earlier.
Before we load these new anonymized customer data onto our data mart, we need to
make sure that the whole dataset still complies with our anonymity requirements. If at
170
Data Protection and Privacy
Customer Data Anonymization
some point the existing anonymization is not sufficient anymore, we might need to
update the initial anonymization process. In this case we can set up an automated
notification via email.
Disclaimer. Note that this is just one option to deploy the data anonymization
process. Depending on security policies in different organizations and customer
dataset size, other solutions might be possible. For example, one could anonymize
the whole data from scratch each time if the data amount allows doing so.
171
Other Analytics
In this last chapter, we will show how to work with other data types that are often used
in Marketing Analytics to analyze visual content or map connections: images and
networks, respectively. More specifically, we will look at use cases in image feature
mining and social media network visualization.
172
Other Analytics
Workflow on the KNIME Community Hub: Extraction of Image Labels and Dominant Colors
It's the backbone of autonomous driving, it's at supermarket self-checkouts, and it's
named as one of the trends to power marketing strategies in 2022 (Analytics Insight).
Computer vision is a research field that tries to automate certain visual tasks —e.g.,
classifying, segmenting, or detecting different objects in images. To achieve this,
researchers focus on the automatic extraction of useful information from a single
image or a sequence of images. Where entire teams were previously needed to scan
networks and websites, computer vision techniques analyze visual data automatically,
giving marketers quick insight to create content that will better attract, retain, and
ultimately drive profitable customer action.
There are different approaches to analyzing image data and extracting features. In this
post, we want to walk through an example showing how to integrate and apply a third-
party service, namely Google Cloud Vision API, in KNIME Analytics Platform to detect
label topicality and extract dominant colors in image data for advertising campaigns.
Image feature mining with Google Cloud Vision in KNIME Analytics Platform.
173
Other Analytics
Image Feature Mining with Google Vision
the power of powerful pre-trained machine learning models for a wide range of
computer vision tasks to understand images, from image label assignment and
property extraction to object and face detection.
It is worth noticing that, while it’s very powerful, Google Cloud Vision API is an
automated machine learning model, meaning it offers little-to-no human-computer
interaction options. A data scientist can only work on the input data, and after feeding
it into the machine, you have little chance of influencing the final model.
To harness the power of Google Cloud Vision API, we need to first set up a Google
Cloud Platform project and obtain service account credentials. To do so:
1. Sign in to your Google Cloud account. If you're new to Google Cloud, you’ll need
to create an account.
2. Set up a Cloud Console project:
a. Create or select a project.
b. Enable the Vision API for that project.
c. Create a service account.
d. Download a private key as JSON.
e. You can view and manage these resources at any time in the Cloud Console.
Note. When you create a Cloud Console project you have to enter payment details
to use the service(s) called via the API (even for the free trial). If you don't do this,
you'll incur client-side errors.
174
Other Analytics
Image Feature Mining with Google Vision
wrapping file ingestion in a component makes the process of selection and upload
reusable, faster, less prone to error, and consumable on the KNIME Business Hub
(figure above).
Inside the “Key upload” component, we use the File Upload Widget node to select the
JSON file from a local directory.
After uploading the private keys, we connect to Google Cloud Vision API and
authenticate the service to start using it. The component “Authentication Google
Vision API” relies on a simple Python script –parameterized with private keys via flow
variables– to generate a JSON web token to send data with optional
signature/encryption to Google Vision API. When using a Python script in your KNIME
workflows, it’s good practice to include a Conda Environment Propagation node to
ensure workflow portability and the automated installation of all required
dependencies, in particular the PyJWT Python library. Once the JSON web token is
generated, it’s passed on outside the component as a flow variable.
175
Other Analytics
Image Feature Mining with Google Vision
After completing the authentication, we read the file paths of image data (e.g. shoes,
drinks, food) using the List Files/Folders node, and prepare it for the creation of a valid
POST request body to call the webservice of Google Cloud Vision.
The request body as defined in Google Cloud Vision and KNIME Analytics Platform.
Using the Container Input (JSON) node, we can reproduce the request body structure
in JSON format, specifying the image mining type we are interested in (e.g.,
“IMAGE_PROPERTIES”), the number of max results, and the input image data. The
crucial transformation is the encoding of images as a base64 representation. We can
do that very easily using the Files to Base64 component, which takes a table of file
paths and converts each file to a base64 string.
Wrangling the base64-encoded images back into the JSON request body with
the String Manipulation node, we create a column where each row contains a request
body to extract features for each input image. We are now ready to call the REST API
using the POST Request node.
176
Other Analytics
Image Feature Mining with Google Vision
• In the “Connection Settings” tab, specify the URL of the web service and the
operation to perform on
images: https://vision.googleapis.com/v1/images:annotate.
• Increase the Timeout time from 2 to 20 seconds in order to extend the server
response time to process images.
• In the “Request Body” tab, point the node to the column containing the JSON
request bodies with the encoded images and the image mining types.
• The “Error Handling” tab handles errors gracefully, and by default outputs missing
values if connection problems or client-side or server-side errors arise.
For large image datasets, sending a sole POST request can be computationally
expensive and overload the REST API. We can adopt two workarounds: feeding
data in chunks using the Chunk Loop Start node and/or checking the box “Send
large data in chunks” in the configuration of the POST Request node.
If the request is successful, the server returns a 200 HTTP status code and the
response in JSON format.
177
Other Analytics
Image Feature Mining with Google Vision
In the “Image properties” and “Label detection” metanodes, we use a bunch of data
manipulation nodes —such as the JSON to Table, Unpivoting, and Column
Expressions nodes— to parse the JSON response and extract information about
dominant colors and label topicality. In particular, using the Math Formula node, we
compute dominant color percentages by dividing each score value by the sum of all
scores for each image. The Column Expressions node converts RGB colors into the
corresponding HEX encoding.
The conversion into HEX-encoded
colors is necessary to enrich the
table view with actual color names
that are easier to understand for
the human user. To do that, we rely
on The Color API. This web service
can be consumed via a GET
Request that identifies each color
unequivocally by its HEX-encoding.
The Color API returns an SVG
image containing the color name
and image. It’s worth mentioning
that the retrieved color names are
purely subjective and are called as
such by the creators of the API.
Table view containing dominant color percentage in
descending order, RGB and HEX encodings, and an SVG
column with color names.
178
Other Analytics
Image Feature Mining with Google Vision
Once dominant colors and topic labels have been mined and parsed, content
marketers may benefit greatly from the visualization of those features in a dashboard
where visual elements can be dynamically selected.
The key visual elements of the dashboard are three JavaScript View nodes: the Tile
View node to select the image, the Pie/Donut Chart node to plot label topicality, and
the Generic JavaScript View node to display a Plotly bar chart with different colors and
percentage values according to the selected image. The essential touch of interactivity
and dynamism in the dashboard is given by the Refresh Button Widget. This node
works by producing a series of reactivity events that trigger the re-execution of
downstream nodes in a component by conveniently connecting the variable output
port to the nodes that the user wishes to re-execute. This means that we can interact
more easily with input data in components without leaving the component interactive
view, and create dynamic visualizations to make UI even more insightful and enjoyable.
For example, content marketers can use this workflow and the resulting dashboard to
identify the two major topic labels in the example image. Perhaps unsurprisingly, “dog”
and “plant” stand out. What is surprising, however, is the most prominent color:red, at
26% dominant. This appears fairly counterintuitive, since red is used only in the ribbon
and hat, whereas other colors, such as black or beige, occupy many more pixels.
179
Other Analytics
Image Feature Mining with Google Vision
Understanding why the underlying model determines that red is the dominant color is
not easy. The official documentation of Google Cloud Vision API does not provide
much information. On the GitHub repository of Google Cloud services, it is possible to
find a few possible explanations. The assumption that the color annotator blindly looks
at the pixels and assigns a value based on how many pixels have similar colors seems
naive. Rather, the model determines the focus of the image, and the color annotator
assigns the highest score to that color. Hence, in the example image, the hat is
identified as the focus of the image and red as the most prominent color, followed by
different shades of black.
180
Other Analytics
Image Feature Mining with Google Vision
Based on this premise, AutoML models for image mining made consumable as web
services via REST APIs have flourished and become popular and powerful alternatives
to self-built solutions. In that sense, Google Cloud Vision API is probably one of the
most innovative technologies currently available, and has reduced considerably
implementation costs, delivering fast and scalable alternatives. Yet very often, web
services based on AutoML models have two major drawbacks: they offer no room for
human-machine interaction (e.g., improvement and/or customization), and their
underlying decision-making process remains hard to explain.
While it’s unlikely that these drawbacks will undermine the future success of AutoML
models as REST APIs, it’s important to understand their limitations, and assess the
best approach for each data task at hand.
181
Other Analytics
Workflow on the KNIME Community Hub: Visualizing Twitter Network with a Chord Diagram
There are two main analytics streams when it comes to social media: the topic and
tone of the conversations and the network of connections. You can learn a lot about a
user from his or her connection network!
Let’s take Twitter for example. The number of followers is often assumed to be an
index of popularity. Furthermore, the number of retweets quantifies the popularity of a
topic. The number of crossed retweets between two connections indicates the
livelihood and strength of the connection. And there are many more such metrics.
@KNIME on Twitter counts more than 6730 followers (data from August 2021): the
social niche of the KNIME real-life community. How many of them are expert KNIME
users, how many are data scientists, how many are attentive followers of posted
content?
Chord diagram visualizing interactions from the top 20 Twitter users around
#knime. Nodes are represented as arcs along the outer circle and connected to
each other via chords. The total number of retweeted tweets defines the size of
the circle portion (the node) assigned to the user. A chord (the connection area)
shows how often a user’s tweets have been retweeted by a specific user, and is
in the retweeter’s color.
182
Other Analytics
Social Media Network Visualization
Let’s check the top 20 active followers of @KNIME on Twitter and let’s arrange them
on a chord diagram (see figure above). Are you one of them?
A chord diagram is another graphical representation of a graph. The nodes are
represented as arcs along the outer circle and are connected to each other via chords.
The chord diagram displayed above refers to tweets including #knime during the week
from the 26th, July to the 3rd, August 2021. The number of retweeted tweets defines
the size of the circle portion (the node). Each node/user has been assigned a random
color. For example, @KNIME is olive, @DMR_Rosaria is light orange, and @paolotamag
is blue.
Having collected tweets that include #knime, it is not surprising that @KNIME occupies
such a large space on the outer circle.
The number of retweets by another user defines the connection area (chord), which is
then displayed in the color of the retweeter. @DMR_Rosaria is an avid retweeter. She
has managed to retweet the tweets by @KNIME and KNIME followers
disproportionately more than everybody else and has therefore managed to make the
color light orange the dominant color of this chart. Moving on from light orange, we
can see that the second retweeter of KNIME tweets for that week has been
@paolotamag.
183
Other Analytics
Social Media Network Visualization
Data access
We access the data by using the Twitter nodes included in the KNIME Twitter API
extension. We gathered the sample data around the hashtag #knime during the week
from the 26th, July to the 3rd, August 2021. Each record consists of the username, the
tweet itself, the posting date, the number of reactions and retweets and, if applicable,
who retweeted it.
Let’s build the network of retweeters. A network contains edges and nodes. The users
represent the nodes and their relations, i.e., how often user A retweets user B is
represented by the edges. Let’s build the edges first:
1. We filter out all tweets with no retweets or that consist of auto-retweets only.
2. We count the number of retweets a user has retweeted tweets of another user.
To clean the data and compute the edges of the network all you need are two Row
Filter nodes and a GroupBy node.
Now we want to build a weighted adjacency matrix of the network with usernames as
column headers and row IDs, and the number of retweets by one username on the
tweets of the other in the data cell. We achieve that by addressing the following steps.
This metanode builds the matrix of interactions between Twitter usernames around #knime.
184
Other Analytics
Social Media Network Visualization
4. To these user pairs we add the previously computed edges by using a Joiner
node.
5. The Pivoting node then creates the matrix structure from the (username1,
username2, count of retweets) data table.
You can read it like this: “The user named in Row ID’s row was retweeted n times by
the user named in the column header’s column.”
• The matrix we created is the data input for a Generic JavaScript View node.
• To draw the chord diagram, we need the D3 library which can be added to the
code in the Generic JS node.
• The JS code required to draw this chart is relatively simple and is shown here.
// creating the chord layout given the entire matrix of connections.
var g = svg.append("g")
.attr("transform", "translate(" + width / 2 + "," + height / 2 + ")")
.datum(chord(matrix));
// creating groups, one for each twitter user.
// each group will have a donut chart segment, ticks and labels.
var group = g.append("g")
.attr("class", "groups")
.selectAll("g")
.data(function(chords) { return chords.groups; })
.enter().append("g")
.on("mouseover", mouseover)
.on("mouseout", mouseout)
.on("click", click);
// creating the donut chart segments in the groups.
group.append("path")
.style("fill", function(d) { return color(d.index); })
.style("stroke", function(d) { return d3.rgb(color(d.index)).darker(); })
.attr("d", arc)
.attr("id", function(d) {
return "group" + d.index;
// creating the chords (also called ribbons) connections,
// one for each twitter users pair with at least 1 retweet.
g.append("g")
.attr("class", "ribbons")
.selectAll("path")
.data(function(chords) { return chords; })
.enter().append("path")
.attr("d", ribbon)
.style("fill", function(d) { return color(d.target.index); })
.style("stroke", function(d) { return d3.rgb(color(d.target.index)).darker(
); });
185
Other Analytics
Social Media Network Visualization
If you would prefer a more traditional method to visualize your results, there are also
common KNIME nodes to analyzing your social media network in the “classic” and
more formal way. Using these nodes also means that you don’t have to use any
JavaScript programming. What we want to do is analyze the same network of the 20
most active followers of @KNIME on Twitter, but this time with the KNIME Network
Viewer node.
Top 20 Twitter users around #knime visualized as a network map using the
Network Viewer node. Nodes of the underlying graph are represented as circles
and are connected via arrows. The size of a circle (node) is defined by the total
number of times a user has been retweeted by one of the other users. The size
of an arrow (edge) represents how often one user retweeted another user’s
tweets.
This network map displays the graph with the following key elements: nodes are
represented by a specific shape, size, color, and position. We arbitrarily chose circles
for the shape. Each node is colored and labeled with respect to the user it represents.
The circle’s size is dependent on the overall number of times the specific user’s tweets
have been retweeted by other users. The position of nodes in this case is defined by
their degree. The more input and output connections a node has, the higher its degree
and the more centric it is displayed on the network map.
Note. For ease of visualization, in the figure above, we have manually rearranged
the position of the @KNIME and @DMR_Rosaria nodes to distinguish edges more
accurately. This is why these nodes do not have a centric position.
186
Other Analytics
Social Media Network Visualization
Another key element are the edges, which connect the nodes. They are visualized as
arrows, as we visualize a directed graph. The direction of the arrow shows which user
has retweeted somebody else’s tweets, while its size depends on the number of
retweets.
187
Final Book Overview and Next Steps
Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Roberto Cadili, KNIME
Meet Your Customers: The Marketing Analytics Collection with KNIME is a book aiming
to amplify the use of marketing analytics methods and processes across academics,
practitioners, and students. Using a low code, visual programing interface, we have
designed and implemented a wide array of workflows (projects) that will allow you to
tackle different marketing problems. All workflows are free, open-source, and can be
downloaded from the KNIME Community Hub in the “Machine Learning for Marketing”
repository.
Let’s recap
The book was divided into seven chapters, which should simplify their potential
adoption in a Marketing Analytics course or as a practice guide for marketers. Below
we summarize the content of each chapter.
In chapter 2, called “Segmentation and Personalization,” we tackled two of the basic
concerns of marketers: How to identify groups of consumers based on their
characteristics, preferences, and behaviors? And how to use that information to
personalize marketing offerings (e.g., communication, products, services)? The
segmentation workflow uses a public dataset from a telecommunications company
and implements k-means clustering to identify groups of consumers based on their
behaviors when contacting customer service (e.g., call center). The Market Basket
Analysis workflow is designed to use consumer buying patterns (e.g., products bought
at a supermarket by a consumer sample) and use that information to develop a set of
association rules, which will result in recommendations to new consumers sharing
similar buying patterns. Finally, the personalization workflow demonstrates how to
build a recommendation engine using big scale datasets. In this case, the workflow is
applied to a Netflix database, and it uses Spark collaborative filtering to give movie
recommendations based on previously watched content by users.
Extract marketing relevant information from consumer mindsets was the focus of
chapter 3. Mindsets or feedback metrics are perceptions, attitudes and emotions
about a product, service, or brand, which have been shown to predict behavioral
outcomes (e.g., conversion rates, sales). The SEO Semantic Keyword Search workflow
demonstrates how to scrape Google SERP (Search Engine Results Page) and Twitter
188
Final Book Overview and Next Steps
tweets to obtain keywords suggestions for a website or landing page. The Customer
Experience (CX) workflow uses a topic model algorithm (LDA) in online reviews (e.g.,
TripAdvisor) to understand which service attributes have the strongest effect on
customer satisfaction (e.g., star rating). The Brand Reputation Tracker workflow shows
how to scrape Twitter data to measure brand reputation using a state-of-the-art
marketing method. In the example, users can measure the brand driver called “Brand”,
which measures the attributes of coolness, excitement, innovativeness, and social
responsibility. The final section involves a series of workflow to measure consumer
sentiment (i.e., sentiment analysis) from text data (e.g., social media conversations)
using valanced lexicons (i.e., words with a positive or negative connotation), machine
learning (e.g., Decision Tree and XGBoost Tree Ensemble), deep learning (e.g., LSTM
deep neural networks) and transformer models (e.g., BERT).
In chapter 4, we moved on to analytics for describing, understanding, and predicting
consumer behavior. The first workflow called Querying Google Analytics focuses on
querying data from the most widely used marketing analytics tool: Google Analytics.
This workflow allows users in possession of a Google Analytics account to query the
service and obtain consumer behavioral data relative to a specific website (e.g., page
views, bounce rate, etc.). In the Predicting Customer Churn workflow, we demonstrate
how to use consumer transactional data (e.g., product usage, calls to customer
service, etc.), to develop a machine learning predictor (i.e., Random Forest) of
customer churn. This is particularly useful for subscription-based business models
where it is crucial to monitor retention and churn. Finally, the third workflow touches
upon Attribution Models, one of most relevant marketing analytics problems
concerning the identification of the marketing channel with the largest implication for
conversion rates. This workflow demonstrates how to use methods such as last
touchpoint attribution, Shapley value, regression-based methods, and field
experiments.
Workflows that deal with marketing mix activities such as pricing, promotion, place
and product are included in chapter 5. Every marketer at some point has to make
decisions concerning any of these four activities. At the moment, we have included
one workflow concerning pricing analytics. The workflow shows how to use “value-
based pricing” using a set of rules based on industry information, and “pricing
optimization” using regression models which involves a greater degree of automation.
We expect future versions of the book to explore other marketing mix activities.
Customer valuation is addressed in chapter 6, and it involves the implementation of
two workflows. The first workflow focuses on measuring customer lifetime value
(CLV). This is an important measurement that enables marketers to understand the
profitability of consumers by considering the revenue that they are projected to
generate and their cost of acquisition. The second workflow uses the RFM (Recency,
Frequency, and Monetary value) framework to extract past transactional data (e.g.,
average order value) and identify segments of consumers based on their transactional
189
Final Book Overview and Next Steps
patterns. This information, together with estimations of CLV can help organizations
take new retention actions (e.g., bundles or targeted advertising) based on past
customer behavior.
Chapter 7 focuses on Data Protection and Privacy. It is designed to include marketing
analytics tools and processes that help organizations anonymize their data and
comply with stricter privacy regulations. Currently, we have included a training and a
deployment workflow that tackle the data anonymization problem, where the goal is to
transform the data in such a way that no customer can be identified but at the same
time keep the maximum amount of non-sensitive information for the analysis. For
example, the workflows use specific nodes that allow replacing any name or birthdate
with unrelated information that has a minimal risk of re-identification.
The final chapter of the book, chapter 8, is called “Other Analytics” and it includes two
workflows that we could not assign to the previous chapters. The first workflow
focuses on the use of image analytics with the help of Google Cloud vision. Marketing
content is mostly visual, and we expect content marketers and advertisers to be
interested in measuring and identifying which visual features in images are resulting
in greater engagement. The image mining workflow allows users to extract features
such as color presence, color concentration, and object identification, in batches of
images. The second workflow concerns network mining. It deals with the problem of
measuring and understanding relationships within a network of users (e.g., social
media), and it can help marketers understand how users interact and its potential
implications for service development.
What’s next?
The chapters and workflows presented in this book are a living repository of projects.
All workflows are open for improvements, so we are happy to receive suggestions and
recommendations on how to adapt them to the ever-evolving marketing analytics
landscape. Thinking about the future, we already have some projects in the pipeline.
We are aiming to deepen the use of attribution models with more advanced algorithms
(e.g., Markov Chain Monte Carlo) and to augment the understanding of unstructured
data such as video (e.g., TikTok) and audio (e.g., Podcast). As stated earlier, we are
also interested in including additional workflows tackling traditional marketing mix
activities, such as promotion, product, and places. For example, we are interested in
using the geospatial functionalities of KNIME to understand how to inform marketing
decisions using aspects related to the geographic coordinates of stores and
consumers. Finally, we expect the development of future workflows around customer
service, and the use of interactive agents such as chatbots. This is a prominent area
that could be of great utility for customer service departments handling thousands of
queries on a daily basis.
190
Final Book Overview and Next Steps
We are looking forward to hearing your feedback on the “Machine Learning and
Marketing” repository on the KNIME Community Hub, and we will keep working on the
development of workflows for the community.
191