0% found this document useful (0 votes)
126 views197 pages

Meet Your Customers v4.7 Ebook

Uploaded by

Sarada Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views197 pages

Meet Your Customers v4.7 Ebook

Uploaded by

Sarada Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 197

0

Copyright © 2023 by KNIME Press


All rights reserved. This publication is protected by copyright, and permission must be
obtained from the publisher prior to any prohibited reproduction, storage in a retrieval
system, or transmission in any form or by any means, electronic, mechanical,
photocopying, recording or likewise.
This book has been updated for KNIME 4.7.
For information regarding permissions and sales, write to:
KNIME Press
Talacker 50
8001 Zurich
Switzerland
knimepress@knime.com

ISBN: 978-3-9523926-3-8
www.knime.com
Preface
Machine learning promises great value for marketing-related applications. However,
the proliferation of data types, methods, tools, and programming languages hampers
knowledge integration amongst marketing analytics teams, making collaboration and
extraction of actionable insights difficult.
Visual-based programming tools come to the rescue. Their uncomplicated data flow
building process and intuitive UI can help marketing and data teams develop,
orchestrate, and deploy advanced machine learning projects in a visual, easy-to-create,
and share fashion.
Motivated by the desire to share the technical expertise and knowledge gathered over
the years, this book is a collection of experiences solving some of the most popular
challenges in Marketing Analytics using a no-code/low-code approach with KNIME
Analytics Platform.
All examples described in this book are the result of a prolific collaboration between
business and academia. We teamed up to create a live repository of Machine Learning
and Marketing solutions tailored to new and expert users, market analysts, students,
data scientists, marketers, researchers, and data analysts. All workflow solutions are
available for free on the KNIME Community Hub
We will update this book as frequently as possible with the descriptions and workflows
from the newest, most recent projects in Marketing Analytics, as they become
available.
We hope this collection of Marketing Analytics experiences will help foster the growth
of practical data science skills in the next generation of market and data professionals.

Francisco Villarroel Ordenes, LUISS Guido Carli University


Roberto Cadili, KNIME
Table of Contents
MARKETING ANALYTICS WITH KNIME 1

MACHINE LEARNING IN MARKETING ANALYTICS 2


HOW A MARKETER BUILT A DATA APP: NO CODE REQUIRED 10

SEGMENTATION AND PERSONALIZATION 19

CUSTOMER SEGMENTATION 20
MARKET BASKET ANALYSIS WITH THE APRIORI ALGORITHM 33
MOVIE RECOMMENDATIONS WITH SPARK COLLABORATIVE FILTERING 38

CONSUMER MINDSET METRICS 45

IMPROVE SEO WITH SEMANTIC KEYWORD SEARCH 46


EVALUATE CX WITH STARS AND REVIEWS 57
BRAND REPUTATION MEASUREMENT 67
ANALYZE CUSTOMER SENTIMENT 74

CONSUMER BEHAVIOR 103

QUERYING GOOGLE ANALYTICS 104


PREDICT CUSTOMER CHURN 111
ATTRIBUTION MODELING 116

MARKETING MIX 127

PRICING ANALYTICS 128

CUSTOMER VALUATION 136

CUSTOMER LIFETIME VALUE 137


RECENCY, FREQUENCY AND MONETARY VALUE 147

iii
Table of Contents

DATA PROTECTION AND PRIVACY 156

CUSTOMER DATA ANONYMIZATION 157

OTHER ANALYTICS 172

IMAGE FEATURE MINING WITH GOOGLE VISION 173


SOCIAL MEDIA NETWORK VISUALIZATION 182

FINAL BOOK OVERVIEW AND NEXT STEPS 188

iv
Marketing Analytics with KNIME

In this chapter, we will provide an introduction to the live repository of Marketing


Analytics solutions hosted on the KNIME Community Hub, zoom in on some of the
seed applications powered by KNIME Analytics Platform, and report the first-hand
experience of a market content creator who used this no-code/low-code tool to build
effortlessly an interactive Data App.

This chapter includes the articles:

• Machine Learning in Marketing Analytics, p. 2


– Francisco Villarroel Ordenes, LUISS Guido Carli University &
– Rosaria Silipo, KNIME

• How a Marketer Built a Data App: No Code Required, p. 10


– Heather Fyson, KNIME

1
Marketing Analytics with KNIME

Machine Learning in Marketing Analytics

Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Rosaria Silipo, KNIME

Workflows on KNIME Community Hub: Machine Learning and Marketing

Many businesses are currently expanding their adoption of data science techniques to
include machine learning. Marketing analytics is one of them. Anything can be reduced
to numbers, including customer behavior and color perception, and therefore anything
can be analyzed, modeled, and predicted.

Marketing analytics already involves a wide range of data collection and


transformation techniques. Social media and web driven marketing have given a big
push in the digitalization of the space; counting the number of visits, the number of
likes, the minutes of viewing, the number of returning customers, and so on is common
practice. However, we can move one level up and apply machine learning and statistics
algorithms to the available data to get a better picture of not just the current but also
the future situation.
Marketers can capitalize on machine learning techniques to analyze large datasets to
identify patterns or perform predictive analytics. Examples include analyzing social
media posts to see what customers are saying, analyzing images to extract insight into
pictorials and videos, or predicting customer churn – to name just three.
In the Machine Learning and Marketing space on the KNIME Community Hub, you will
find a number of use cases applying machine learning algorithms to classic marketing
problems.
In this section, we will describe the use cases that constituted the seed of the initial
Machine Learning and Marketing space, showing the particularity of each one of them
and the insights they bring.

• Prediction of customer churn

• Measuring sentiment analysis in social media

• Evaluation of customer experience through topic models

• Content marketing and image mining

• Keyword research for search engine optimization

2
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

The public workflow repository for marketing analytics solutions on the KNIME Community Hub.

Since its creation in 2021, we have been maintaining and will continue to maintain this
repository by updating the existing workflows and adding new ones every time a
solution from a new project becomes available.

Note. This solution repository has been designed, implemented, and maintained by
a mixed team of KNIME users and marketing experts from the KNIME Evangelism
Team in Constance (Germany), headed by Rosaria Silipo, and Francisco
Villarroel Ordenes, Professor of Marketing at LUISS Guido Carli university in Rome
(Italy).1

Prediction of customer churn


Using existing customer data (e.g., transactional, psychographic, attitudinal),
predictive churn models aim to classify customers who have churned or remained, as
well as estimate the probability of new customers churning, all in an automated

1
F. Villarroel Ordenes & R. Silipo, “Machine learning for marketing on the KNIME Hub: The development of a
live repository for marketing applications”, Journal of Business Research 137(1):393-410,
DOI: 10.1016/j.jbusres.2021.08.036.

3
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

process. If the churn probability is very high and the customer is valuable, the firm
might want to take actions to prevent this churn.
The “Churn Prediction” subfolder in the Machine Learning and Marketing space on the
KNIME Community Hub includes:

• A workflow training a ML classifier (a Random Forest in this case) to distinguish


customers who have churned and customers who have stayed in the training set.

• A deployment workflow applying the previously trained model to new customers,


estimating their current probability to churn, and displaying the result on a simple
dashboard (figure below).

The dashboard reporting the churn risk in orange for all new customers.

Sentiment analysis
Sentiment is another popular metric used in marketing to evaluate the reactions of
users and customers to a given initiative, product, event, etc. Following the popularity
of this topic, we have dedicated a few solutions to the implementation of a sentiment
evaluator for text documents. Such solutions are contained in the “Sentiment Analysis”
subfolder. All solutions focus on three sentiment classes: positive, negative, and
neutral.

4
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

There are two main approaches to the problem of sentiment:

• Lexicon-based. Here, a list of positive and a list of negative words (dictionaries),


related to the corpus topics, are compiled and grammar rules are applied to
estimate the polarity of a given text.
• Machine Learning-based. The solutions here don’t rely on rules, but on machine
learning models. Supervised models are trained to distinguish between negative,
positive, and neutral texts and then applied to new texts to estimate their polarity.
Machine-learning-based approaches have become more and more popular, mainly
because of their capability to bypass all grammar rules that would need hard-coding.
Among the machine learning based solutions, a few options are possible:

• Traditional machine learning algorithms. In this case, texts are transformed into
numerical vectors, where each unit represents the presence/absence, or the
frequency of a given word from the corpus dictionary. After that, traditional
machine learning algorithms, such as Random Forest, Support Vector Machine, or
Logistic Regression can be applied to classify the text polarity. Notice that in the
vectorization process the order of the word in the text is not preserved.
• Deep Learning-based. Deep learning-based solutions are becoming more and
more popular for sentiment analysis, since some deep learning architectures can
exploit the word context (i.e., the sequence history) for better sentiment
estimation. In this case, texts are one-hot encoded into vectors, and the sequence
of such vectors is presented to a neural network, which is trained to recognize the
text polarity. Often, the architecture of the neural network includes a layer of Long
Short-Term Memory units (LSTM), since LSTM performs the task by taking into
account the order of appearance of the input vectors (the words), i.e., by taking
into account the word context.
• Language models. They are also referred to as deep contextualized language
models because they reflect the context-dependent meaning of words. It has been
argued that these methods are more efficient than Recurrent Neural Networks
because they allow parallelized encoding (rather than sequential) of word and sub-
word tokens contingent on their context. Recent language model algorithms are
ULMFiT, BERT, RoBERTa, XLNet, etc. In the Machine Learning repository, we
provide a straightforward implementation of BERT.

5
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

Visualization of tweets with estimated sentiment (red = negative, green = positive, light orange = neutral).

Topic Detection and Customer Experience


Customer experience management and the customer journey are some of the most
popular topics in the marketing industry. Much of the information about customer
experience comes from reviews and feedback, and/or from the star-ranking systems
on websites and social media.
The popularity of topic models has resulted in a continuous development of
algorithms, such as Latent Dirichlet Allocation (LDA), Correlated Topic Models (CTM),
and Structural Topic Models (STM), among others, all of them already implemented in
business research. LDA is available in the KNIME Text Processing extension as a
KNIME native node. The LDA node detects m topics in the whole corpus and describes
each one of them using n keywords –m and n being some of the parameters required
to run the algorithm.
You’ll find an example workflow, showing the usefulness of discovering topics in
reviews, in the subfolder “CX and Topic Models” in the Machine Learning and
Marketing space.
The workflow extracts topics from reviews using the LDA algorithm. After that, it
estimates the importance of each topic via the coefficients of a linear regression –
implemented with a KNIME native node– and via the coefficients of a polynomial
regression –implemented in an R script within the KNIME workflow. It then displays
the average number of stars for all topics extracted from the reviews for two different
hotels (figure below).
In the bar chart for instance, we can see that for hotel 2 the topic “Booking interactions”
is never mentioned. We can also notice that while hotel 1 gets great reviews for the
“Common areas”, hotel 2 excels for the “Front desk”.

6
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

Average number of stars by reviews around one of the 15 detected topics.

Content Marketing and Image Mining


The last ten years have shown an exponential growth of visual data including images
and videos. This growth has resulted in an increasing development of technologies to
classify and extract relevant insight from images. This phenomenon has had an impact
on marketing as well. As both consumers and firms are relying more on pictorials and
videos to communicate, researchers need new processes and methods to analyze this
type of data.
The greater interest in the analysis of visuals and its implications for firm performance
motivated us to develop a workflow that can help with the analysis of visual content.
The workflow takes advantage of Google Cloud Vision services (accessed via POST
Request) to detect labels (e.g., objects, animals, humans) and extract nuanced image
properties, such as color dominance.
A second workflow uses deep learning Convolutional Neural Networks to classify
images of cats vs. dogs. Changing the image dataset and correspondingly adjusting
the network allows you to solve any other image classification task.
Find both workflows in the subfolder “Image Analysis” of the Machine Learning and
Marketing space. The figure below shows the result obtained from the analysis of an
image via Google Cloud Vision services.

7
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

Analysis of the image in the top left corner through Google Vision services.

Keyword Research for SEO


It is known that search engines rank web pages according to the presence of specific
keywords or groups of keywords that are conceptually and/or semantically related. In
addition, keywords should be taken from the specialized lingo by experts as well as
from the conversational language by neophytes. Popular sources for such keywords
are SERP (Search Engine Result Pages) as well as social media.
In the “SEO” subfolder of the Machine Learning and Marketing space, you’ll find a
workflow for semantic keyword research, which was implemented following the article
“Semantic Keyword Research with KNIME and Social Media Data Mining –
#BrightonSEO 2015” written by Paul Shapiro in 2015.
The upper branch of the workflow connects to Twitter and extracts the latest tweets
around a selected hashtag. The lower branch connects to Google Analytics API and
extracts SERPs around a given search term. After that, URLs are isolated, web pages
scraped via GET Requests to Boilerpipe API, and keywords are extracted together with
their frequencies.
Keywords include single terms with the highest TF-IDF score; co-occurring terms with
the highest co-occurring frequency; keywords with the highest score from topics
detected via the Latent Dirichlet Allocation (LDA) algorithm.
As an example, we searched for tweets and Google SERPs around “cybersecurity”. The
resulting co-occurring keywords are shown in the word cloud in the figure below. If you

8
Marketing Analytics with KNIME
Machine Learning in Marketing Analytics

are working in the field of cybersecurity, then including these words in your web page
should increase your page ranking.

Top co-occurring keywords around the topic of cybersecurity using different methods.

Explore Machine Learning and Marketing Examples with


KNIME
A mixed team of KNIME users from industry and academia created, developed, and
maintained several machine learning-based solutions for marketing analysts to some
of the most interesting problems in marketing analytics: churn prediction, sentiment
analysis, topic detection to evaluate customer experience, image mining, keyword
research for SEO, etc.
All workflows are available for free in a public repository on the KNIME Community
Hub named “Machine Learning and Marketing”. They represent a first sketch to solve
marketing analytics problems but can, of course, be downloaded and customized
according to the user’s own business requirements and data specs.

9
Marketing Analytics with KNIME

How a Marketer Built a Data App: No Code Required

Author: Heather Fyson, KNIME

Workflow on KNIME Community Hub: Overall Blog Performance Dashboard

I’m a content marketer who writes about data science. My acquaintance with data
science began when I started working as an assistant at the Chair for Bioinformatics
and Information Mining at Konstanz University. Initially, copywriting papers about
pruning and Decision Trees, fuzzy logic, or Bisociative Information Networks, I
gradually became more familiar with information mining and increasingly in awe of the
data scientists around me.
Since then, I’ve moved into a marketing role, and have written 20+ articles, edited over
1,000, and interviewed dozens of data scientists. But I’ve never done much analysis
myself. I occasionally look at Google Analytics (GA), but I don’t download .csv files,
use Tableau, or even always check the traffic to my blogs. Curious to finally step into
this world of data science myself, I wanted to see how I could use data analytics to
make our content stronger.
KNIME is one of the low-code tools making data analytics accessible to even non-
technical users, like me. So, I decided to set myself a challenge: Build a Data App with
the low-code KNIME Analytics Platform (that would be more helpful than GA).

A shareable Data App to measure web traffic growth


My solution will be an interactive dashboard showing the following metrics:

Month-over-month (MoM) web traffic in %

• Why this is important: It will show how the blog is growing and give insight into
monthly trends. Getting this figure as a percentage lets me compare it more
easily with industry benchmarks. Such insights will help me plan content better.

• Why Google Analytics doesn’t quite hit the mark: In GA, this involves manually
setting time frames for each comparison, and not all metrics can be combined in
custom reports.

• Why my workflow helps: I only need to do the manual work once, configuring the
workflow to query GA correctly. And my dashboard can combine data from any

10
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

level of GA. Even in custom-built GA reports, combinations of data are restricted


depending on whether the data is user-level, session-level, or hit-level.

Average time spent and bounce rates

• Why these are important: They should give me an indication of how interesting
our content is. Are people arriving and staying to read, or are they arriving, looking,
and leaving?

• Why Google Analytics doesn’t quite hit the mark: While I can set a custom report
to get me these figures and send a report to my boss to review, the report is static.
She can’t delve deeper if she wants to compare a different time period.

• Why my workflow helps: The dashboard produced by my workflow is interactive,


enabling further analysis. I can also share it with my colleagues as a browser-
based app.

A dashboard as a browser-based app

The dashboard will be served up as a browser-based Data App, giving my workflow an


interface I and my colleagues can use to access and explore the data. I can construct
this app within my workflow and share it by uploading it to the KNIME Business Hub.
Anyone in my team with the link can access the app independently, exploring the data
for further analysis. We won’t need to have KNIME running in the background to use it.
And in the future, if I need to adjust the underlying workflow, add more metrics, or
change a chart, I can do so without bothering my colleagues, who can continue to
access the app via the link.

Getting started with KNIME


I was not totally unaware of how to use KNIME, but I had never used it for a real-life
project. So before I started building the workflow, I took advantage of these resources:
1. Self-paced online courses: I took L1 for Data Scientists and L1 for Data Wranglers,
which covered most of what I needed to know to wrangle my project –connecting
to databases, cleaning and filtering data, the basic concepts of data science and
reporting. At the end, I took the L1 Certification exam (free of charge) to check
my knowledge.
2. KNIME Community Hub has hundreds of blueprints, which meant I didn’t have to
start from scratch. I could explore and download marketing analytics workflows
contributed by other marketers and business analysts, and get ideas for my own
workflow.

11
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

3. Maarit Widmann, my mentor! I'm lucky enough to work with her directly, but she
regularly teaches courses and her peers run Data Connects, where you can
directly ask data scientists questions and discuss projects you’re working on.

From an outline to a complete workflow


Maarit suggested working out what I needed to do and creating these steps:
1. Access Google Analytics and get the data.
2. Explore the data to remove articles that conceal trends in the blog.
3. Get the data into the right format for calculations.
4. Calculate metrics and display them in a dashboard.

These four steps ultimately translated into different sections of my workflow. I’ll
describe now how I got there, what was easy, and what stumbling blocks I
encountered.

MoM blog performance workflow to connect to Google Analytics, remove outlier articles and produce an
interactive dashboard showing blog traffic, MoM growth, time spent on blog, and bounce rates.

1. Connect to Google Analytics


There’s a lot of information on the Internet about APIs and why they are useful to
marketers. It’s how platforms like Facebook, Twitter, & Co make data available to
applications in our martech stack. Connecting to them sounds a little scary, so I was a
bit concerned about this first step.
KNIME has a Google Authentication (API Key) node for this, which lets you connect to
various Google services. Before I could start configuring this node, I had to create a

12
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

project on Google Cloud Console. I found some useful instructions on how to do this
in related editorial content on the KNIME Blog (e.g., Querying Google Analytics), which
I could adapt for my purposes.
However, I was able to benefit from using a so-called "component". A component is a
group of nodes, (i.e., a sub-workflow) that encapsulates and abstracts the
functionalities of the logical block. Components serve a similar purpose as nodes, but
let you bundle functionality for sharing and reusing. In addition, components can have
their own configuration dialog, and custom interactive views. One of my colleagues
already had a workflow that connects to Google’s API. She preconfigured some nodes
with the right settings to connect to the API and GA, and wrapped them together into
a component. All I had to do was insert this component into my workflow, and I was
ready to go.
After connecting to Google’s API, I needed two more nodes to connect with GA (the
Google Analytics Connection node) and fetch the data I needed (the Google Analytics
Query node). To specify the metrics and dimensions for a query, you have to know the
terms. I kept this overview of the names of dimensions and metrics on hand while
doing this.

2. Remove articles that hide trends


Articles that perform well over a long time and one-off well-performing pieces are
outliers and conceal the actual trends on a blog. To avoid distorting the picture, I had
to remove them from my analysis. Checking the data makes much more sense than
relying on my gut feeling.
Exploring the data in Google Analytics is very time-consuming –even just knowing
what to include takes work. It’s much easier to maintain data tables about the best-
performing pieces which you can then keep checking in on. You can then decide
whether these potential outliers should be included or excluded from your KPI.
To explore the data, I used a Data App developed by a colleague. It automatically
collects data from Google Analytics once a week and shows blog views over time. On
the left (see figure below), I can spot the outliers. Clicking the respective curve gives
me the name of the article, which I can analyze further for scroll-depth (right). A good
scroll-depth indicates that the article is not just being clicked on, but also read.

13
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

Data App for blog performance over time (left). I can select any article and get the scroll-depth for that
article (right), here showing the data for an article on sentiment analysis.

I used this app to identify five well-performing articles to remove from my analysis, and
I then manually listed them in a Google Sheet. My workflow could now be configured
to access this sheet (with the Google Sheets node), take whichever articles are listed
there, and remove them. The Reference Row Filter node performs this task. Finding
this node was my first major stumbling block. Searching the Node Repository for
anything to do with “Row” helped me find it, but it took a lot of trial and error.

Finding the Reference Row Filter node in the Node Repository and adding it to the section of the workflow
that accesses a Google Sheet to get the current list of outliers to remove from the analysis.

Tip. Setting the search box in the Node Repository to enable fuzzy searches was a
useful hint I got on the Forum to make finding things easier. You get all the results
for the word you enter, not just specific names.

3. Checking the progress of my data


As the data about the articles flows through each node in the workflow, I can check
how it’s looking. This helps me work out whether I’ve done something wrong –or even
right! A right-click on the Reference Row Filter node and selecting “Filtered table”
shows me all the articles with my list of outliers removed. I can see that the date

14
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

column is a string. Maarit pointed out that I wouldn’t be able to set time frames in my
interactive dashboard if the format of that column stayed as a string.

Checking the progress of my data as it flows through each node in the workflow. Here, a right-click to open
Filtered Table shows me all my articles with outliers removed.

So I had to do a bit of so-called “data processing” and convert the date column –a
string– into a date format to enable me to enter a given time frame for later analysis.

4. Calculate the metrics for my Data App


Now to tackle the part I had been dreading: finding –and more importantly, configuring
– the nodes to calculate the MoM comparison as a percentage and enable the app to
fetch data for a given time frame.

Calculate MoM Growth

I found a formula to calculate MoM growth online (I’m not so hot at math). Searching
the Node Repository for the respective node was time-consuming since I used the
wrong search terms (“MoM growth” and “Metrics”). Finally, the word “Calculate”
brought up the Math Formula node.
Translating “Subtract the first month from the second month, then divide that by the last
month’s total, then multiply the result by 100” into a mathematical expression was
tricky. I had vague memories of formulas from school, but I needed help to configure
the node.

15
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

Searching by “calculate” gave me a list of related nodes, including the Math


Formula node.

Throughout my challenge, I kept forgetting that I had to tell the workflow each step of
the process. For example, I discovered I needed another Math Formula node to convert
the average time in seconds to minutes. I also stumbled when working out how to enter
MoM growth percentage “one row down” in my table. Such a simple thing, but it was
hard to know what to search for in the Node Repository. My mentors came to the
rescue and told me about the Lag Column node to enter values “lagged” by one row.

Table view showing MoM growth in %, sum of unique page views, and average bounce rates and time on
page (mins).

Add interactive Data App

Working out how to set the workflow to perform the analysis on any given time period
was the most complicated step. How could I even begin to explain to the workflow
which reference date it should use as a basis? I learned about variables: how to set
them up, and how to configure nodes to take variables as input. To be honest, I found
this really hard. The reference date problem was ultimately resolved when I was able
to copy a colleague’s component that works out “today’s date.”
I found a solution of inserting interactive fields to set time frames by exploring the
wonderful world of widgets. Designing the layout of a Data App is explained clearly in

16
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

the “Cheat Sheet: Components with KNIME Analytics Platform”. I liked the fact that I
could preview all my layout designs by right-clicking “Interactive View.”

The dashboard in my Data App showing a line plot of overall blog performance by unique views and a table
of month on month growth.

A shareable Data App and lessons learned


After three weeks of working on this project alongside my normal day-to-day tasks, my
shareable browser-based Data App was ready.
My main stumbling block was “codeless does not mean mathless.” A lack of
background in math made it harder to know what I needed to perform tasks. First of
all, I didn’t think of names such as “Math Formula” to search for the right node in the
repository. Secondly, it’s been a long time since I did any math in school, and I had
forgotten things like how to order mathematical operations properly.

17
Marketing Analytics with KNIME
How a Marketer Built a Data App: No Code Required

Developing a more logical approach to the required steps will help prevent stumbles in
future, too. I frequently left out steps, then didn’t understand why the workflow wasn’t
doing what I wanted. Maarit’s initial advice to break down my project into steps was
spot-on. I realize now that going forward, splitting things into even smaller steps will
help.
Non-technical marketers might shy away from learning a data science tool, fearing it
will be too time-consuming, but a low-code environment helps here. It’s intuitive. If I
needed to leave the project for a few days, it was easy to pick it up again later, with all
my progress documented visually.

Shareable components were my friends, not just for getting non-technical people like
me started, but also for easier team collaboration. Components can be built to bundle
any repeatable operation. Standardized logging operations, ETL clean-up tasks, or
model API interfaces can be wrapped into components, shared, and reused within the
team.

18
Segmentation and Personalization

In this chapter, we will understand who our customers are, whether they can be
clustered in similar groups, what they usually purchase and what product we can
recommend to best meet their needs. More specifically, we will look at use cases to
perform customer segmentation, market basket analysis, and build a
recommendation engine.

This chapter includes the articles:

• Customer Segmentation, p. 20
– Elisabeth Richter, KNIME

• Market Basket Analysis with the Apriori Algorithm, p. 33


– Rosaria Silipo, KNIME

• Movie Recommendations with Spark Collaborative Filtering, p. 38


– Rosaria Silipo, KNIME

19
Segmentation and Personalization

Customer Segmentation

Author: Elisabeth Richter, KNIME

Workflows on the KNIME Community Hub: Basic Customer Segmentation and Customer Segmentation

Customer segmentation, also known as market segmentation, is a technique to


optimize a company’s marketing strategy. The idea is simple: grouping together a
company’s target audience based on certain criteria, such as age, geography, or
income, with the overall goal to maximize customer lifetime value by better targeting
different customers.
Customer segmentation is a valuable process applied in various fields. It helps
companies earn a greater market share, detect their most valuable customers, and
identify the most effective ways to reach those customers. In this section, we want to
give you an overview of the different customer segmentation types and techniques and
then walk you through how to build a browser-based Data App. The customer
segmentation app enables the marketing analyst to investigate the different customer
segments. They can inject their own domain expertise so that meaningful insight can
be directly recorded and shared easily with others in the Data App.

Customer segmentation types


Segmenting customers depends on the industry and size, as well as the marketing
analyst's expertise and domain knowledge. Different types of customer segmentation
exist:

Geographic segmentation

Splitting customers based on their geographic locations, as they might have different
needs in different areas. For example, segmenting them based on country or state, or
the characteristics of an area (e.g., rural vs. urban vs. suburban).

Demographic segmentation

Splitting up customers based on features like age, sex, marital status, or income. This
is a rather basic form of segmentation, using easily accessible information. This is,
hence, one of the most common forms of customer segmentation.

20
Segmentation and Personalization
Customer Segmentation

Psychographic segmentation

Splitting up customers based on mental and emotional characteristics –interests,


values, lifestyles etc. These characteristics provide insights on why customers
purchase the product.

Behavioral segmentation

Splitting up customers based on their behaviors, i.e., how they respond and interact
with the brand. Such criteria include loyalty, shopping habits (e.g., the dropout rate on
a website), or the frequency of product purchases.

Customer segmentation techniques


Regardless of what type of customer segmentation is desired, the techniques
applicable depend on one’s level of expertise and domain and target knowledge.
Roughly, customer segmentation can be divided into three categories:

Rule-based segmentation

Customers are segmented into groups based on manually designed rules. This usually
involves domain and target knowledge, and thus requires at least one business expert.
For example, segmenting customers based on their purchase history as “first-time
customers,” “occasional customers,” “frequent customers,” or “inactive customers” is
highly interpretable, and an expert will know which customer counts in which group. A
drawback is that this type of segmentation is not portable to other analyses. So, with
a new goal, new knowledge, or new data, the whole rule system needs to be
redesigned. Implementing rule-based segmentation in KNIME Analytics Platform can
be done using the Rule Engine node.

Segmentation using binning

This is simple and easily implemented, binning data based on one or more features.
This does not necessarily require domain knowledge, but some knowledge about the
target is required – i.e., the business goal must be clear. For example, when
considering a clothing line designed for teenagers, the age of the target audience is
clear and non-negotiable. In KNIME Analytics Platform, the Auto-Binner or the Numeric
Binner node can be used.

21
Segmentation and Personalization
Customer Segmentation

Segmentation with zero knowledge

If nothing is known about the domain or target, common clustering algorithms can be
applied to segment the data. This is applicable to different use cases. There are many
clustering algorithms in KNIME Analytics Platform, such as k-Means and DBSCAN.

A basic clustering-based customer segmentation


In this section, we want to tackle customer segmentation using clustering through k-
Means, as it is one of the most popular clustering algorithms.

The telco dataset

This is a simple use case in which we have some customer data for a telephone
company. The data was originally available on Iain Pardoe’s website, and can be
downloaded from Kaggle. The data set has been split into two files: the contract-
related data (ContractData.csv), which contains information about telco plans, fees,
etc., and the telco operational data (CallsData.xls), which contains information about
call times in different time zones throughout the day and the corresponding paid
amounts. Each customer is identified by a unique combination of their phone number
(“Phone”) and its corresponding area code (“Area Code”).

The customer segmentation workflow

As we do not hold any domain or target-specific knowledge, we apply k-Means


clustering to segment our data. The corresponding workflow is displayed in the figure
below.

22
Segmentation and Personalization
Customer Segmentation

Basic workflow for customer segmentation using k-Means Clustering.

After reading and joining the two datasets, the data is preprocessed. The
preprocessing tasks depend on the nature of the data and the business. In our case,
we first join the information about each customer from the two datasets using the
telephone number. Then, we convert the columns “Area Code,” “Churn,” “Int’l Plan,” and
“VMail Plan” from integer to string. The “Area Code” column is a part of the unique
identifier, which is why we don’t include it as an input column for clustering. The other
columns are excluded because they are categorical values to which the k-Means
algorithm is not applicable, as categorical variables have no natural origin. Lastly, we
normalize all remaining numerical columns.

Note. It is usually recommended that you normalize the data before clustering,
especially when dealing with attributes with vastly different scales.

Other preprocessing tasks could be added, such as aggregation, or discretization via


binning. We did not do this here for the sake of simplicity.
After reading and preprocessing the data, we reach the crucial part of the workflow:
the clustering node. The k-Means node allows us to segment the data into k clusters.
The number of clusters, k, must be defined beforehand in the configuration dialog of
the node. Here we set k = 10. The node has two output ports: the first outputting the
data and its assigned cluster; the second the cluster centers.
Lastly, with the help of the “Cluster Viz” component, we visualize both the clustered
data and the cluster centers (see figure below). This step is optional, but visualizing
the data often helps detect underlying patterns and better understand customer
segmentation. Inside the “Cluster Viz” component there is a workflow that first de-
normalizes the data back to its original range (the Denormalizer (PMML) node), then
assigns a color to each cluster (the Color Manager and Color Appender nodes), and
finally visualizes the clustered data colored by cluster, with the cluster centers in one
scatter plot each (the Scatter Plot node). To access the interactive view of our

23
Segmentation and Personalization
Customer Segmentation

component, right-click the component and select “Interactive View: Cluster Viz.”
Because our component encapsulates two scatter plots, our interactive view also
contains two scatter plots.

The sub-workflow, encapsulated by the Cluster Viz component. This component de-normalizes the clustered
data back to its original range, and visualizes the clustered data colored by cluster, with the cluster centers in
one scatter plot each.

Investigating the customer segments

The composite view of the component contains two scatter plots: the visualization of
the telco data, colored by cluster, and the prototypes (i.e., centers) of each cluster. In
the figure below, the view of the second scatterplot is displayed. It shows the attributes
“VMail Message” (the number of voicemail messages) on the X-axis and “CustServ
Calls” (the number of customer service calls) on the Y-axis. From there, some
implications can be drawn.

• The group of data points on the right shows the prototypes of the clusters of
customers who use voicemail, and the group of data points on the left shows the
prototype of the clusters of customers who don’t use voicemail.

• The data points in the top left and top right corners show the cluster
representatives of those customers who complain a lot. On average, they call
customer service almost twice a day.

24
Segmentation and Personalization
Customer Segmentation

The visualization of the cluster analysis for k =10, plotting “VMail Message” vs. “CustServ Calls.” The scatter
plot shows the center of each cluster.

We can now easily change the view of the scatter plot by changing the axes of the plot
directly from the interactive view without changing the node settings. In the interactive
view, we can click the settings button in the top-right corner (list icon) and then change
the X and Y columns as we wish.

The view of the scatter plot can easily be changed directly from the interactive view without changing the
node settings.

The resulting plot when changing the axes to “Day Mins” (minutes during day time) vs.
“Night Mins” (minutes during night time) is shown below. Here, we can distinguish
between customers who have significantly higher call minutes during the day (the Day

25
Segmentation and Personalization
Customer Segmentation

Callers) and those who have significantly higher call minutes during the night (Night
Callers). The cluster representative in the top-right corner indicates the group of
customers who call a lot during the day as well as at night (the Always Caller).

The visualization of the cluster analysis for k = 10, plotting “Day Mins” vs. “Night Mins.” The scatter plot
shows the cluster center of each cluster.

Et voilá, after only a handful of nodes, we have already completed a basic customer
segmentation task. Without any particular domain or target knowledge, we were able
to gain some meaningful insights about our customers.

Interacting with customer segments as a Data App


The previous customer segmentation workflow works fine and produces interesting
results. However, we might want to go one step further and add some interactivity to
the application so that it can be called from a web browser and domain experts can
add notes and to-do tasks
Involving a domain expert is frequently beneficial, as they have valuable knowledge
about the data and the business case. With the help of an interactive implementation
of the customer segmentation workflow from a web browser, the domain expert can
be guided through all phases of the analysis, and can immediately interact with the
analysis and the results. This allows the domain expert to gain deeper insights on the
detected customer segments, which improves the quality and interpretability of the
customer segmentation. In addition, by allowing the domain expert to add notes and
annotations during each step of the execution, meaningful insights can be directly
recorded and easily passed on to others.

26
Segmentation and Personalization
Customer Segmentation

Calling the application from a web browser

In order to call the workflow from a web browser, we need to connect to KNIME
Business Hub. One of the most useful features of KNIME Business Hub is the
possibility to deploy workflows as browser-based applications. It allows interaction
with the workflow at predetermined points, or the adjustment of certain parameters.
When executing the workflow, users are guided through each step using a sequence
of interactive webpages. Each composite view of the deployed workflow becomes a
webpage. The following section introduces a Data App deployed on the KNIME
Business Hub that makes use of several Widgets nodes.

Introducing more touchpoints in the Data App


Implementing the workflow via a Data App provides more touchpoints between the
domain expert and the analysis. The Data App enables:
1. Customizing the parameters of the customer segmentation.
2. Visualizing the cluster results and changing the views of the scatter plot.
3. Injecting expert knowledge by adding notes and comments.
Let’s see how these touchpoints are implemented. The workflow deployed to the
KNIME Business Hub is shown below. It is an extended version of the basic example
provided above.

The extended customer segmentation workflow from the previous example. With the help of some Widget
nodes wrapped inside the components, it enables immediate integration of knowledge and direct interaction
with the segmentation result.

1. Set the customer segmentation parameters

With the help of the Integer Widget and Column Filter Widget node, the user can define
the parameters of the clustering. This is the number of clusters, k, and the attributes

27
Segmentation and Personalization
Customer Segmentation

used as input columns. The two nodes are wrapped inside the “Define Cluster
Parameters” component.

On the KNIME Business Hub, the user can execute the customer segmentation workflow as a Data App. In
the first step, the number of clusters as well as the columns used for clustering must be defined. This page
is the composite view of the “Define Cluster Parameters” component.

Once the customer segmentation parameters are defined, the k-Means clustering is
performed. The necessary nodes for clustering are wrapped inside the “Customer
Segmentation” metanode, following the “Define Cluster Parameters” component.

2. Visualize the cluster results

Once the data has been clustered, the results are displayed on the next webpage. The
“Display Cluster Result” component in the workflow (see figure above) contains the
nodes responsible for visualizing the results. The composite view of the component
contains three scatter plots: the cluster centers (top left), the PCA-reduced telco data
colored by cluster (top right), and the non-reduced telco data colored by cluster
(bottom). Also on the KNIME Business Hub, the views of the scatter plot can be easily
changed by tweaking the X and Y axes, as described.

28
Segmentation and Personalization
Customer Segmentation

The clustering results for k = 10 using all attributes as input columns. The scatter plots show the attribute
“Day Mins” against “Night Mins”. The PCA clusters plot shows the data when reduced to two dimensions.

29
Segmentation and Personalization
Customer Segmentation

3. Inject expert knowledge: Add notes and comments to each


customer segment

Once the data is clustered, the workflow iterates over all clusters, presenting a cluster-
wise visualization and providing the possibility to add comments and annotations for
each customer segment. These cluster-wise visualizations are implemented in the
“Label Cluster” component inside the group loop block. Here the expert analyst can
take advantage of their knowledge and annotate the clusters accordingly.

30
Segmentation and Personalization
Customer Segmentation

At the bottom, you can give each customer group a unique name. In addition, you can add a more detailed
description of the customers belonging to that group.

After the group labels and annotations have been assigned, the clusters are updated
accordingly. This last webpage of the Data App refers to the “Displayed Label Clusters
component” of the workflow. This dashboard shows not only the cluster centers and
the colored scatter plot, but also the cluster statistics. Note that each customer group
now has a meaningful name that clearly distinguishes it from the others. A more

31
Segmentation and Personalization
Customer Segmentation

detailed description of the customer group from the annotation box was added as a
“Cluster Annotation” column.

Cluster statistics. The table shows for each customer group its
coverage, mean distance, standard deviation, and skewness.

Share valuable insight from interactive customer


segmentation
Here we walked through how to build a Data App that enables you to do customer
segmentation without any knowledge, all from a web browser! However, the
implications derived from the segmentation are quite meaningful.
For once, it is clearly observable that there are two groups of highly complaining
customers, of which one uses voicemail and the other doesn't. Hence, whether one
has a voicemail plan or not, that’s probably not the reason for the complaints.
Another valuable insight might be that the customers who mainly call during the
daytime are also those who call customer service many times. An implication could be
that more disturbances happen during the day, which the telco company could further
investigate. In contrast, customers who distribute their calls equally throughout the
day don’t call customer service as often.

32
Segmentation and Personalization

Market Basket Analysis with the Apriori Algorithm

Author: Rosaria Silipo, KNIME

Workflows on the KNIME Community Hub: Market Basket Analysis: Building Association Rules and Market
Basket Analysis: Apply Association Rules

A market basket analysis or recommendation engine is what is behind the


recommendations we get when we go shopping online or receive targeted advertising.
The underlying engine collects information about people’s habits and knows that, for
example, if people buy pasta and wine, they are usually also interested in pasta sauces.
In this section, we will build an engine for market basket analysis using one of the many
available association rule algorithms.

What we need
The dataset required needs examples of past shopping baskets with shopping items
(products) in it. If products are identified via product IDs, then a shopping basket is a
series of product IDs that looks like:
<Prod_ID1, Prod_ID2, Prod_ID3, ….>

The dataset used here was artificially generated with the “Shopping Basket
Generation” workflow available for download from the KNIME Community Hub. This
workflow generates a set of Gaussian random basket IDs and fills them with a set of
Gaussian random product IDs. After performing a few adjustments on the a priori
probabilities, our artificial shopping basket dataset is ready to use 2.
This dataset consists of two KNIME tables: one containing the transaction data –i.e.,
the sequences of product IDs in imaginary baskets– and one containing product info
– i.e., product ID, name, and price. The sequence of product IDs (the basket) is output
as a string value, concatenating many substrings (the product IDs).

2 Adä, M. Berthold, “The New Iris Data: Modular Data Generators”, SIGKDD, 2010.

33
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm

Workflow to build association rules with the Apriori


algorithm
A typical goal when applying market basket analysis is to produce a set of association
rules in the following form:
IF {pasta, wine, garlic} THEN pasta-sauce

The first part of the rule is known as the “antecedent”, and the second part,
“consequent”. A few measures, such as support, confidence, and lift, define how
reliable each rule is. The most famous algorithm generating these rules is the Apriori
algorithm3.
The central part in building a recommendation engine is the Association Rule Learner
(Borgelt) node, which implements the Apriori algorithm in either the traditionalError!
Bookmark not defined. or the Borgelt4 version. The Borgelt implementation offers a
few performance improvements over the traditional algorithm. The produced
association rule set, however, remains the same. Both Association Rule Learner nodes
work on a collection of product IDs.

A Workflow to train association rules using the Borgelt’s variation of the Apriori algorithm.

3R. Agrawal and R. Srikant, Proc. 20th Int. Conf. on Very Large Databases (VLDB 1994, Santiago de Chile),
487-499, Morgan Kaufmann, San Mateo, CA, USA 1994.
4“Find Frequent Item Sets and Association Rules with the Apriori Algorithm”, C. Borgelt’s home page:
http://www.borgelt.net/doc/apriori/apriori.html.

34
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm

A collection is a particular data cell type, assembling together data cells. There are
many ways of producing a collection data cell from other data cells. The Cell Splitter
node, for example, generates collection type columns, when the configuration setting
“as set (remove duplicates)” is enabled. We use a Cell Splitter node to split the basket
strings into product IDs substrings, setting the space as the delimiter character. The
product IDs substrings are then assembled together and output in a collection column
to feed the Association Rule Learner node.
After running on a dataset with past shopping basket examples, the Association Rule
Learner node produces a number of rules. Each rule includes a collection of product
IDs as antecedent, one product ID as consequent, and a few quality measures, such as
support, confidence, and lift.
In Borgelt’s implementation of the Apriori algorithm, three support measures are
available for each rule. If A is the antecedent and C is the consequent, then:
𝑩𝒐𝒅𝒚 𝑺𝒆𝒕 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) = # 𝑖𝑡𝑒𝑚𝑠/𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐴
𝑯𝒆𝒂𝒅 𝑺𝒆𝒕 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐶) = # 𝑖𝑡𝑒𝑚𝑠/𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐶
𝑰𝒕𝒆𝒎 𝑺𝒆𝒕 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴 ∪ 𝐶) = # 𝑖𝑡𝑒𝑚𝑠/𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐴 𝑎𝑛𝑑 𝐶
Item Set Support tells us how often antecedent A and consequent C are found together
in an item set in the whole dataset. However, the same antecedent can produce a
number of different consequents. So, another measure of the rule quality is how often
antecedent A produces consequent C among all possible consequents. This is the Rule
Confidence.
𝑹𝒖𝒍𝒆 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ∪ 𝐶) /𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)
One more quality measure ‒the Rule Lift‒ tells us how precise this rule is, compared to
just the Apriori probability of consequent C.
𝑹𝒖𝒍𝒆 𝑳𝒊𝒇𝒕 = 𝑅𝑢𝑙𝑒 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐶) / 𝑅𝑢𝑙𝑒𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(∅ → 𝐶)
∅ is the whole dataset and 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(∅) is the number of items/transactions in the
dataset.
You can make your association rule engine larger or smaller, restrictive or tolerant, by
changing a few threshold values in the Association Rule Learner configuration settings,
like the “minimum set size”, the “minimum rule confidence”, and the “minimum
support” referring to the minimum Item Set Support value.
We also associate a potential revenue to each rule as:
𝑹𝒆𝒗𝒆𝒏𝒖𝒆 = 𝑝𝑟𝑖𝑐𝑒 𝑜𝑓 𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑥 𝑟𝑢𝑙𝑒 𝑖𝑡𝑒𝑚 𝑠𝑒𝑡 𝑠𝑢𝑝𝑝𝑜𝑟𝑡
Based on this set of association rules, we can say that if a customer buys wine, pasta,
and garlic (antecedent) usually ‒or as usually as support says‒ they also buy pasta-
sauce (consequent); we can trust this statement with the confidence percentage that
comes with the rule.

35
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm

After some preprocessing to add the product names and prices to the plain product
IDs, the association rule engine with its antecedents and consequents is saved in the
form of a KNIME table file.

Deployment
Let’s move away now from the dataset with past examples of shopping baskets and
into real life. Customer X enters the shop and buys pasta and wine. Are there any other
products we can recommend?
The second workflow prepared for this use case takes a real-life customer basket and
looks for the closest antecedent among all antecedents in the association rule set. The
central node of this workflow is the Subset Matcher node.

Extracting top recommended items for current basket. Here the Subset Matcher node explores all rule
antecedents to find the appropriate match with the current basket items.

The Subset Matcher node takes two collection columns as input: the antecedents in
the rule set (top input port) and the content of the current shopping basket (lower input
port). It then matches the current basket item set with all possible subsets in the rule
antecedent item sets. The output table contains pairs of matching cells: the current
shopping basket and the totally or partially matching antecedents from the rule set.

36
Segmentation and Personalization
Market Basket Analysis with the Apriori Algorithm

By joining the matching antecedents


with the rest of the corresponding
association rule ‒i.e., with
consequent, support, confidence, and
lift– we obtain the products that
could be recommended to customer
X, each one with its rule’s confidence,
support, revenue, and lift. Only the top
two consequents, in terms of highest
item set support (renamed as rule
support), highest confidence, and
highest revenue, are retained.
Finally, a short report displays the
total price for the current basket and
two recommendations, from the two
Final Recommendations on a Report.
top consequents.
In our case, the recommended product is always lobster, associated once with cookies
and once with shrimps. While shrimps and lobster are typically common-sense advice,
cookies and lobster seem to belong to a more hidden niche of food experts!

37
Segmentation and Personalization

Movie Recommendations with Spark Collaborative


Filtering

Author: Rosaria Silipo, KNIME

Workflow on the KNIME Community Hub: Movie Recommendation Engine with Spark Collaborative Filtering

Collaborative Filtering (CF) based on the Alternating Least Squares (ALS) technique 5
is another algorithm used to generate recommendations. Collaborative Filtering (CF)
is an algorithm that makes automatic predictions (filtering) about the interests of a
user by collecting preferences from many other users (collaborating). The underlying
assumption of the collaborative filtering approach is that if a person A has the same
opinion as person B on an issue, A is more likely to have B's opinion on a different issue
than that of a randomly chosen person. This algorithm gained a lot of traction in the
data science community after it was used by the team that won the Netflix prize.
The algorithm has also been implemented in Spark MLlib6 with the aim to address fast
execution also on very large datasets. KNIME Analytics Platform with its Big Data
Extensions offers the CF algorithm in the Spark Collaborative Filtering Learner (MLlib)
node. We will use it, in this section, to recommend movies to a new user. This use case
is a KNIME implementation of the Collaborative Filtering solution originally provided
by Infofarm.

What we need

A general dataset with movie ratings by users

For this use case, we used the large MovieLens dataset. This dataset contains many
different files all related to movies and movie ratings. We’ll use the files “ratings.csv”
and “movies.csv”.
The dataset in the file “ratings.csv” contains 20M movie ratings by circa 130K users
and it is organized as: “userID”, “movieID”, “rating”, “timestamp”. Each row contains the

5Y. Koren, R. Bell, C. Volinsky, “Matrix Factorization Techniques for Recommender Systems“, in Computer
Journal, Volume 42 Issue 8, August 2009, Pages 30-37: https://dl.acm.org/citation.cfm?id=1608614.
6“Collaborative Filtering. RDD based API” The Spark MLlib implementation:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html.

38
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering

rating to each movie –identified by “movieID”– by one of the users –identified by


“userID”.
The dataset in the file “movies.csv” contains circa 27K movies, organized as:
“movieID”, “title”, “genre”.

Note. The Movie Recommendation Engine with Spark Collaborative Filtering


workflow available on the KNIME Community Hub is built using the CSV
Reader, Row Sampling and Table to Spark nodes to read only 2.5% of
the “ratings.csv” dataset. Reading a subset of the dataset allows for
straightforward execution of the workflow and avoids incurring a Java Heap Space
Error for users with limited RAM capacity on their local machines. If you wish to
use the entire dataset for better results, it is advisable to rely on the much
faster CSV to Spark node and, if necessary, increase the Java Heap Space for
KNIME.

Movie preferences by current user

The idea of the ALS algorithm is to find other users in the training set with preferences
similar to the current, selected user. Recommendations for the current user are then
created based on the preferences of such similar profiles. This means that we need a
profile for the current user to match the profiles of other existing users in the training
set.
Let’s suppose that you are the current user, with assigned userID=999999. It is likely
that the MovieLens dataset has no data about your movie preferences. Thus, in order
to issue some movie recommendations, we would first need to build your movie
preference profile. So, we will start the workflow by asking you to rate 20 movies,
randomly extracted from the movie list in the “movies.csv” file. Rating ranges between
0 and 5 (0 – horrible movie; 5 – fantastic movie). You can use rating -1, if you have not
seen the proposed movie. Movies with a rating of -1 will be removed from the list.
Movies with not -1 ratings will become training set material.
The web page below is the result of a Text Output Widget node and a Table Editor
node displayed via a component interactive view. Your rating can be manually inserted
in the last column to the right.

39
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering

Interviewing the current user (userID =999999) about his/her movie


ratings. We need this information to create the current user profile and to
match it with profiles of other users available in the training set.
Preferences from similar users can provide recommendations for our
current user.

A Spark Context

The CF-ALS algorithm has been implemented in KNIME Analytics Platform via the
Spark Collaborative Filtering Learner (MLlib) node. This node belongs to the KNIME
Extension for Apache Spark, which needs to be installed on your KNIME Analytics
Platform to run this use case.
The Spark Collaborative Filtering Learner node executes within a Spark context, which
means that you also need a big data platform and a Spark context to run this use case.
This is usually a showstopper due to the difficulty and potential cost of installing a big
data platform, especially if the project is just a proof of concept. Indeed, installing a
big data platform is a complex operation and might not be worth it for just a prototype
workflow. Installing it on the cloud might also carry additional unforeseeable costs.

40
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering

Note. Version 3.6 or higher (currently 4.7) of KNIME Analytics Platform and KNIME
Extension for Local Big Data Environments include a precious node: the Create
Local Big Data Environment node. This node creates a simple but complete local
big data environment with Apache Spark, Apache Hive and Apache HDFS, and does
not require any further software installation. While it may not provide the desired
scalability and performance, it is useful for prototyping and offline development.

The Create Local Big Data Environment node has no input port, since
it needs no input data, and produces three output objects:

• A red database port to connect to a local Hive instance.

• A blue HDFS connection port to connect to the HDFS underlying


system.

• A gray Spark port to connect to the Spark context.


By default, the local Spark, Hive and HDFS instances will be disposed of when the Spark
context is destroyed or when KNIME Analytics Platform closes. In this case, even if the
workflow has been saved with the “executed” status, intermediate results of the Spark
nodes will be lost.

Configuration window of Create Local Big Data Environment node.


The “Settings” tab includes: actions to perform “On dispose”; SQL
support; and File System settings. In the “Advanced” tab: Custom
Spark settings; Existing Spark context; Hive settings.

41
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering

The configuration window of the Create Local Big Data Environment node includes a
frame with options related to the “on dispose” action.

• “Destroy Spark Context” destroys the Spark context and all allocated resources;
this is the most destructive, but cleanest, option.

• “Delete Spark DataFrames” deletes the intermediate results of the Spark nodes
in the workflow and keeps the Spark context open to be reused.

• “Do nothing” keeps both the Spark DataFrames and context alive. If you save the
already executed workflow and reopen it later, you can still access the
intermediate results of the Spark nodes within. This is the most conservative
option, but also keeps space and memory busy on the execution machine.
Option number 2 is set as default, as a compromise between resource consumption
and reuse.

Workflow to build the recommendation engine with


Collaborative Filtering
In this workflow, we use the Spark MLlib implementation of the collaborative filtering
algorithm, in which users and products are described by a small set of latent factors.
These latent factors can then be used to predict the missing entries in the dataset.
Spark MLlib uses the Alternating Least Squares (ALS) algorithm for the matrix
factorization, to learn the latent factors.

Note. It is necessary that movie preferences of the current user are part of the
training set. This is why we ask the current user to rate 20 random movies, in order
to get a sample of his/her preferences.

The Collaborative Filtering technique is implemented and trained in the Spark


Collaborative Filtering Learner node, which runs on a Spark cluster. At its input port,
the node receives a number of records with product, user, and corresponding rating.
At the output port, it produces the recommendation model and the predicted ratings
for all input data rows, including user and object.

Note. The matrix factorization model output by the node contains references to the
Spark DataFrames/RDDs used in execution and thus is not self-contained. The
referenced Spark DataFrames/RDDs are deleted, like for any other Spark nodes,
when the node is reset or the workflow is closed. Therefore, the model cannot be
reused in another context in another workflow.

Like the KNIME native Numeric Scorer node, the Spark Numeric Scorer node
calculates a number of numeric error metrics between the original values –in this case
the ratings– and the predicted values. Ratings range between 0 and 5, as the number

42
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering

of stars assigned by a user to a movie. Predicted ratings try


to predict the original ratings between 0 and 5.
The error metrics on the test set show a mean absolute
error of 0.6 and a root mean squared error of 0.8. Basically,
predicted ratings deviate from the original ratings +/- 0.6,
which is close enough for our recommendation purpose.
The original movie rating dataset was split into a training Numerical error metrics
set and a test set. The training set was used to build the calculated on the original
recommendations with a Spark Collaborative Filtering movie ratings and the
predicted movie ratings with a
Learner (MLlib) node and the test set to evaluate their
Spark Numeric Scorer node.
quality with a generic Spark Predictor (MLlib) node
followed by a Spark Numeric Scorer node.

Deployment
We previously asked the current user to rate 20 randomly chosen movies. These
ratings were added to the training set. Using a generic Spark Predictor node, we now
estimate the ratings of our current user (ID=999999) on all remaining unrated movies.
Movies are then sorted by predicted ratings and the top 10 are recommended to the
current user on a web page on the KNIME Business Hub.

Final list of top 10 recommended movies based on my earlier ratings of 20 randomly


selected movies.

43
Segmentation and Personalization
Movie Recommendations with Spark Collaborative Filtering

Since I volunteered to be the current user for this experiment, based on my ratings of
20 randomly selected movies, I got back a list of 10 recommended movies shown
below. I haven’t seen most of them. However, some of them I do know and appreciate.
I will now add “watch recommended movies” on my list of things to do for the next
month.

Note. Please note that this is one of the rare cases where training and deployment
are included in the same workflow.

The Collaborative Filtering model produced by the Spark Collaborative Filtering Learner
node is not self-contained but depends on the Spark Data Frame/RDDs used during
training execution and, therefore, cannot be reused later in a separate deployment
workflow.
The Collaborative Filtering algorithm is not computationally heavy and does not take
long to execute. So, including the training phase in the deployment workflow does not
noticeably hinder recommendation performance. However, if recommendation
performance is indeed a problem, the workflow could be partially executed on KNIME
Analytics Platform, or as a Data App on the KNIME Business Hub, until the collaborative
filtering model is trained and then the rest of the workflow can be executed on demand
for each existing user in the training set.

This workflow asks the new user to rate 20 randomly selected movies via web browser, with this data trains
a Collaborative Filtering model, evaluates the model performance via some numeric error metric, and finally
proposes a list of top 10 recommended movies based on the previously asked ratings.

44
Consumer Mindset Metrics

In this chapter, we gauge how our customers perceive and experience our products
and services. We will start off by introducing techniques to improve SEO and make our
business relevant and easy to reach on the web. We will then analyze customer’s
review and sentiment to learn about their opinions, perceptions and mindsets
regarding our products and brand. More specifically, we will look at use cases in SEO,
customer experience evaluation, brand reputation measurement and sentiment
analysis.

This chapter includes the articles:

• Improve SEO with Semantic Keyword Search, p. 46


– Ali Asghar Marvi, KNIME

• Evaluate CX with Stars and Reviews, p. 57


– Francisco Villarroel Ordenes, LUISS Guido Carli University
– Rosaria Silipo, KNIME

• Brand Reputation Measurement, p. 67


– Francisco Villarroel Ordenes, LUISS Guido Carli University
– Konstantin Pikal, LUISS Guido Carli University

• Analyze Customer Sentiment, p. 74


o 1. Lexicon-based sentiment analysis, p. 75
– Aline Bessa, KNIME
o 2. Machine learning-based sentiment analysis, p. 81
– Kilian Thiel, KNIME
– Lada Rudnitckaia, KNIME
o 3. Deep learning-based sentiment analysis, p. 87
– Aline Bessa, KNIME
o 4. Transformer-based language models for sentiment analysis, p. 97
– Roberto Cadili, KNIME

45
Consumer Mindset Metrics

Improve SEO with Semantic Keyword Search

Author: Ali Asghar Marvi, KNIME

Workflow on the KNIME Community Hub: Search Engine Optimization (SEO) with Verified Components

Semantic search techniques with unstructured data are becoming more and more
common in most search engines. In the context of search engine optimization (SEO),
a semantic keyword search aims to find the most relevant keywords for a given search
query. Keywords can include frequently occurring single words, as well as words
considered in context, like co-occurring words or synonyms of the current keywords.
Modern search engines, like Google and Bing, can perform semantic searches,
incorporating some understanding of natural language and of how keywords relate to
each other for better search results.

Semantic search.

As Paul Shapiro explains in “Semantic Keyword Research with KNIME and Social
Media Data Mining”, a semantic search can be carried out in two ways: with structured
or unstructured data.
A structured semantic search takes advantage of the schema markup —the semantic
vocabulary provided by the user and incorporated in the form of metadata in the HTML
of a website. Such metadata will be displayed in the Search Engine Results Page
(SERP). In the figure below, you can see an example snippet of a SERP as returned by
Google. Notice how it shows the operating system support, the latest stable release,
the link to a GitHub repository, and other metadata from the found webpage. Creating
metadata in HTML syntax can be done with schema.org, a collaborative work by
Google, Bing, Yahoo, and Yandex, used to create appropriate semantic markups for
websites. Relevant metadata helps to improve the webpage ranking in search engines.
Meanwhile, unstructured semantic searches use machine learning and natural
language processing to analyze text in webpages for better SERP ranking. The more
relevant the text to the search, the higher the web position in the SERP list.

46
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

A SERP snippet following a search for “KNIME” on Google.

While structured searches on metadata are present in all search engines, techniques
with unstructured data are now also becoming more common. There are two places
where searches happen: search engines (of course) and social media. We want to
explore the results of search queries on SERP and social media, and learn from the text
in the top-performing pages.

Search by scraping text from Search Engine Results Page


(SERP)
Up until 2013, search engines like Google used to check the frequency of words in
webpages to figure out their relevance. This led people to add random keywords to
either the markup or the webpage text to improve their ranks in Google Search. In 2013,
Google Search introduced the Hummingbird algorithm. This algorithm tries to
understand a user's intent in real time. For instance, if a user writes “beer” in the search
box, the intent is not that clear. Is the user seeking to buy beer, looking for a beer shop,
looking for beer preparation details, or something else? Search engines will show all
possible results concerning “beer.” However, if a person writes “beer bar,” the engine
will show listings for nearby pubs, restaurants, cafes, etc. In the latter query, Google
understood the user's intent better. Hence this algorithm also displays Google Map
results on SERP.
With the Hummingbird algorithm, multiple topics and the relevant keywords are
associated with the search query. Thus, the webpages ranking at the top in SERP
contain such keywords and such topics. If we want our webpage to perform better with
SEO, we might learn from the top-ranking pages how to shape our text, for example
which single keywords and which pairs of words to use. The right collection of
keywords from similar websites can help tremendously with the page ranking in search
results.
In this article, we will pay particular attention to texts in top ranking pages for a given
search query, with the goal to extract meaningful keywords and key pairs of words.

47
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

Search by scraping text from links on social media


While scraping text from SERP may be one way to extract keywords, scraping text from
links on social media might be a secondary source. Opinions and ideas shared by users
on these networks really help expand one’s horizon for collecting the right keywords.
In the context of digital marketing, social media conversations are used to engage with
visitors to make them stay on a website or to get them to perform an action —for
example, getting a visitor on an ecommerce website to make a purchase. This can be
optimized by improving the conversion rate measure (in this example, the number of
visitors divided by number of purchases made). This is called conversion rate
optimization (CRO).
Social media activities often use webpage links for calls to action, and they are heavily
optimized on conversion rate. Therefore, analyzing social media text and the text of
linked webpages can help in the process of keyword extraction, especially those
keywords most “appealing” to visitors.

Three techniques to extract keywords


Based on the previous observations, we build an application which would perform a
search query on a web search engine (Google) and a social media platform (Twitter),
scrape the texts from the links appearing in SERP and tweets, and finally extract the
keywords.
We use three different keyword extraction techniques:
1. Term co-occurrences. The most frequently occurring pairs of words are
extracted.
2. Latent Dirichlet Allocation (LDA). An algorithm which clusters documents in a
corpus into topics, with each topic described by the set of relevant associated
keywords.
3. Term Frequency-Inverse Document Frequency (TF-IDF). The importance of a
single keyword in a document is measured by the term frequency (TF), and the
importance of the same keyword within the corpus is measured by the inverse
document frequency (IDF). The product of those metrics (TF-IDF) gives a good
compromise of how relevant the word is to identify the document and only that
document within the corpus. The highest TF-IDF-ranking words are then
extracted.

48
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

How to implement a keyword search application for SEO


Our “Keyword Search Application” consists of four steps:
1. Run search queries on search engines and social media, and extract the links to
the top-ranking webpages.
2. Scrape text from the identified top-ranking webpages.
3. Extract keywords from the scraped texts.
4. Summarize the resulting keywords in a Data App.

Step 1: URL extraction

The first step consists of extracting the URL for the


top-ranking pages around a given hashtag or
keyword. For this step, we create two very useful
verified components:
Verified components to extract URLs
Google URLs Extractor and Twitter URLs Extractor. from Google search results page and
tweets.

Step 2: From Google SERP and from tweets

From Google SERP


The “Google URLs Extractor” component performs a search on Google Custom Search
Engine (CSE) via a web service. It sends a Google API key, the search query, and the
engine ID, and receives in return a maximum number of SERPs in JSON format. From
this JSON response, the component extracts the web links for all found pages and
returns them together with text snippets, SERP title, and status code. The Google
search can be refined through some extra configuration settings of the components:
a specific language or location for the search result pages (see figure below).
Just like there is no free lunch, Google just provides the refreshments but not the main
meal. The free version of Google Custom Search Engine (CSE) web service can return
only 10 SERPs on consecutive calls, which means a maximum of 100 search results.
A premium CSE web service obviously solves this problem.

49
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

To create the Google API key and the


search engine ID, refer to the Google
Custom Search JSON API
documentation page or the Google
Cloud Console.
To begin, you will need the API key,
which is a way to identify the Google
client. This key can be obtained from the
“Custom Search JSON API” page. If this
link doesn’t work, then a similar key can
be created in Google Cloud Console.
Next, the authentication parameter
needed is Search Engine ID, a search
engine created by a user with certain
settings.

Configuration Window of Google URLs Extractor


component.

Getting search engine ID for configuration dialogue of the Google URL


Extractor component.

50
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

From tweets
Here we use the Twitter URLs
Extractor component. This component
extracts tweets from Twitter around a
given search query and then extracts the
URLs in them. We could have also
chosen to consume Facebook, LinkedIn,
or any other social media site.
In case you are wondering why we opt
for the URLs and not the main text
bodies of the tweets, we personally
believe that relying on text from linked
URLs should provide more information
than what is contained in a simple short
tweet.
To run the component, you need a Twitter URLs Extractor component configuration
Twitter Developer account and the window.

associated access parameters. You can


get all of that from the Twitter Developer Portal.
Additional search filters to apply in the query, like filtering out retweets or hashtags,
are described in the Twitter Search Operator documentation page. The component’s
output includes the URLs, tweet, tweet ID, user, username, and user ID.

Step 3: Scraping text from URLs

With the URLs from both SERP and tweets, the application must proceed with scraping
their texts to get the corpus for keyword extraction. For this step, we have another
verified component, the “Web Text Scraper”. This component connects to an external
web library —BoilerPipe— to extract the text from the input URL. It then filters out

KNIME verified component for scraping meaningful text from web pages.

51
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

unneeded HTML tags and boilerplate text. (Headers, menus, navigation blocks, span
texts, all the unneeded content outside the core text.)
The details about algorithms used in Boilerpipe and its performance optimization are
mentioned in the research paper “Boilerplate Detection using Shallow Text Features”.
This component takes a column of URLs as input and produces a column with the texts
in String format as output. Only valid HTML content is supported —all other associated
scripts are skipped. With long lists of URLs to be crawled, this component can become
slow, because identifying the boilerplates for each URL is time-consuming.

Step 4: Summarize results in a Data App

Keyword search component


Once the document corpus has been created, the application converts it into
Document format using the Strings To Document node. The resulting Document
column becomes the input for the “Keyword Search” component. This component
implements the three techniques described previously (term co-occurrence, topic
extraction using LDA, and TF-IDF) to extract the keywords.
The figure below shows the configuration dialog of the “Keyword Search” component.
This includes the selection of the Document column, a few parameters for the LDA
algorithm (i.e., the number of topics and keywords, along with the Alpha and Beta
parameter values), the IDF measure for the TF-IDF approach, and a flag to enable
default text processing. For further understanding of Topic Extraction using LDA,
please read “Topic Extraction: Optimizing the Number of Topics with the Elbow
Method”.

Configuration dialogue of Keyword Search component.

52
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

This component has three outputs:


1. Terms (nouns, adjectives, and verbs) along with their weights as produced by the
LDA algorithm.
2. Terms (nouns, adjectives, and verbs) of co-occurring pairs, along with the
occurrence counts.
3. Top tagged terms (nouns, adjectives, and verbs) with the highest TF-IDF values.
Each of these output tables is sorted in descending order according to their respective
weights/ frequencies/values.
For our last step, these outputs have been visualized:

1. Top 10 LDA with a bar chart.


2. Co-occurring terms with a network graph, where each node is a term, and the
counts of documents are the edges.
3. Top terms identified by the highest TF-IDF values with a Tag Cloud, where the
word size is defined by the maximum TF-IDF value for each term.

Workflow identifying URLs from SERP and tweets, scraping texts from URLs, and extracting and visualizing
keywords from scraped texts.

An example: Querying for documents around “Data Science”


As an example, let’s query Google and Twitter about the term “Data Science.” The
keywords —alone or in combination— extracted from the top-ranking pages extracted
from both SERP and tweets lead to the charts and maps reported in the following three
figures. Let’s see what those keywords are and how they relate to each other.

53
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

Keywords from LDA

The top 10 topic-related keywords, sorted by their LDA weight, are reported in the bar
chart below. The words “data” and “science” have the highest weightage, which makes
sense, but is also obvious. The next highly meaningful keywords are “machine” and
“learning,” followed by “course.” The reference to “machine learning” is clear, while
“course” is probably due to the recent blossoming of educational material on this
subject. Adding a reference to machine learning and a course or two to your data
science page might improve its ranking.

Top 10 terms returned from LDA, sorted in descending order by their


weights.

Co-occurring keywords

However, in the network graph below, we observe a different set of keywords. “Data”
and “science” still represent the core of the keyword set, as they are located at the
center of the graph and have the highest degree. “Machine” and “learning” are also
present again, and this time we learn that they are connected to each other. In the same
way, “program” is connected to “courses,” as those words are often found together in
highly ranked pages about “data science.”
We can learn as much from this graph about the words that don’t occur together. For
example, “program” is only co-occurring with “science” and “courses,” while “skills” and
“course” only co-occurs with “data” and not with “science,” “machine,” “learning,” or
“business.” It thus becomes clear which pairs of words you should include in your data
science page.

54
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

Network graph of Term Co-occurrence, wherein each unique term is


a node and document co-occurrence counts are edges. However,
weight on the edges are ignored to just visualize related terms.

Keywords from a word cloud

Lastly, if we look at single keywords in the Tag Cloud, we can see that the terms
“master,” “insight,” “book,” and “specialization” appear most prominently. This makes
sense, since the web contains a large number of offers for specialized courses with
universities, online platforms, and books. In the word cloud, you can also see the
names of institutes, publishers, and universities offering such learning options. Only
on a second instance do you see the words “data,” “science,” “machine,” “learning,”
“philosophy,” “model,” and so on.

Tag Cloud for the terms with highest TF-IDF measure.

55
Consumer Mindset Metrics
Improve SEO with Semantic Keyword Search

Single keywords seem to better describe one aspect of the “data science” community,
i.e., teaching, while co-occurring keywords and topics best extract the content of “data
science” discipline.
Notice that the multifaceted approach to visualizing keywords that we have adopted
in this project allows for an exploration of the results from different perspectives.

Low code to enhance SEO


In order to improve SEO by learning from the top-performing pages for a given search
query, we have built a low-code tool that can help marketers extract the most
frequently used keywords, in pairs or in isolation, or from the top-ranking pages around
a given search on whatever topic they choose.
The application identifies the URLs of top-performing tweets and pages in SERP,
extracts texts from URLs, and extracts keywords from scraped texts via LDA algorithm,
TF-IDF metric, and word pair co-occurrences. Of course, to make it run, the user’s own
authentication credentials for the Google and Twitter accounts are required.

56
Consumer Mindset Metrics

Evaluate CX with Stars and Reviews

Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Rosaria Silipo, KNIME

Workflow on the KNIME Community Hub: Topic Models from Reviews

A valuable customer experience (CX) seems to be at the heart of all the businesses we
happen to visit. After using a website, booking system, restaurant, or hotel, we are often
(always?) asked to rate the experience. We often do, and we also often wonder what,
if anything, the business is going to do with my rating. The rating embeds an overall
summary, but stars alone cannot say much about an (un)pleasant experience, or what
needs to improve in the customer journey.

Extract useful insight from stars and reviews to improve CX


To better understand which parts of my customer journey succeeded or failed, a
business would need to match my stars with my text review, if we left any. Then, from
the review, they can start learning about my or any other customer journey. The review
can contain relevant information about the contexts (e.g., weather conditions),
touchpoints (e.g., the reception desk), and attributes (e.g., friendliness) of my
experience. For more information about the touchpoints, qualities, and context
framework (TCQ), please refer to the work by Keyser et al7.
In practice, using customer reviews to learn about the customer journey requires using
star ratings together with the text data and text mining methods. First a business
should match the star ratings with the corresponding reviews. So far, so good; this is
not hard to do. If they do not already come together in the dataset, a join operation on
the customer ID and visit ID should be enough. See, for example, the TripAdvisor
dataset containing star rankings and reviews related to a specific hotel stay.

7De Keyser, A., Verleye, K., Lemon, K. N., Keiningham, T. L., & Klaus, P. (2020). Moving the Customer
Experience Field Forward: Introducing the Touchpoints, Context, Qualities (TCQ) Nomenclature. Journal of
Service Research, 23(4), 433–455. https://doi.org/10.1177/1094670520928390.

57
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

The TripAdvisor dataset containing the reviews and associated star rankings related to a specific hotel stay.

This is not enough, though. We need to identify which touchpoints, context, and
qualities might have been critical in the customer experience. For example, this
customer gave only 2 stars for their visit to “Hotel_1,” with this explanation:

An example of a review associated with a 2-star rating from the dataset (review ID 201855460).

Lots of text, huh? The review points out a crucial element of this experience: This is a
frequent customer, with 100+ visits, and it seems that the key reasons for the
complaint relate to the front desk being non-responsive and room service failing with
the daily cleaning. In this case, the low star rating seems to truly reflect a negative
experience that relates to a frequent customer with several failures of two touchpoints

58
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

(reception and room service). Firms and researchers can perform this type of analysis
at an aggregated level using all reviews posted by customers.
In this work, we are going to show how to extract useful information from the pairs
“(review text, star ranking)” that customers leave when describing their experience at
a hotel (but the same could be applied to any kind of business with the appropriate
categories). We are going to do that by considering reviews for two hotels from the
TripAdvisor dataset to show customer journey differences.
The TripAdvisor dataset contains customer experience evaluations for 2,580 visits to
Hotel_1 and 2,437 visits to Hotel_2. Each evaluation contains the hotel name (Hotel_1
vs Hotel_2), the number of reviews provided in total by an author, the number of
“helpful” votes the author got, the number of “helpful” votes each specific review got,
the date each review was left, and the associated number of stars.

Analysis of customer experience feedback


Our analysis consists of four steps:
1. Prepare the data for a topic modeling algorithm.
2. Optimize the number of topics for the topic modeling algorithm.
3. Visually investigate the relations between a low number of stars and a journey’s
steps.
4. Numerically investigate such relations again via the coefficients of a linear
regression model.

59
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

The workflow we implemented to discover the relation between the steps in a customer journey and a poor
star ranking.

1. Prepare the data for a topic modeling algorithm

Starting from the top left square in the solution workflow, the first steps implement the
text preprocessing operations for the reviews.
The first steps are the classic ones in text preprocessing: Transform the review texts
into Document objects, clean out the punctuation, standard stop words, prepositions
as short words, and numbers from the text, reduce all words to lowercase and their
stem, and remove infrequent words.
Then, instead of using single words, we focus on bigrams. In the English language,
bigrams and trigrams store more information than single words. In the “N-grams”
metanode, we collect all bigrams in the review dataset, and we keep only the 20 most
frequent ones. “Front desk” is the most frequent bigram found in the reviews, which
already makes us suspect that the “front desk” is present in many reviews, for good or
bad. Finally, all bigrams are identified and tagged as such in the review Document
objects via a recursive loop.
Note that this listing of the 20 most frequent bigrams is arbitrary. It could have been
larger; it could have been smaller. We choose 20 as a compromise between the light
weight of the application and a sufficient number of bigrams for the next data analysis.

60
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

The top 20 most frequent bigrams in the dataset.

The last step is implemented in the “Filter Reviews with less than 10 words” metanode
and this is what it does: It counts the words in each review so as to remove the shorter
reviews, those with under 10 words.
We are now left with 432 reviews (246 for Hotel_1, 186 for Hotel_2) and each review
has been reduced to its minimal terms. Remember the review with the ID 201855460?
This is what remains after all preprocessing steps:

The remaining text in review 20185546,0 after the text processing phase.

The processed review documents are now ready for the topic extraction algorithm.

2. Optimize the number of topics for the topic modeling algorithm

Topic models are consistently applied in research to learn about customer


experiences. Their popularity has prompted continuous developments of algorithms
such as Latent Dirichlet Allocation (LDA), Correlated Topic Models (CTM), and
Structural Topic Models (STM). The KNIME Textprocessing extension offers an
implementation of the LDA algorithm. The other algorithms, though non-natively
available, can be also introduced into a workflow via the KNIME R Integration.
Latent Dirichlet Allocation (LDA) is a generative statistical model, whereby each text
Document of a collection is modeled as a finite mixture over an underlying set of

61
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

topics, and each topic can be described by its most frequent words. In that context,
LDA is often used for topic modeling.
The KNIME node that implements LDA is the Topic Extractor (Parallel LDA). The node
takes a number of Documents at its input port and produces the same Document table
with topic probabilities and the assigned topic at its top output port, the words related
to each topic with the corresponding weight at the middle output port, and the iteration
statistics at the lowest port. The configuration window allows you to set a number of
parameters, of which the most important related just to the algorithm are the number
of topics, the number of keywords to describe for each topic, and parameters alpha
and beta. Parameter alpha sets the prior weight of each topic in a document.
Parameter beta sets the prior weight of each keyword in a topic. A small alpha (e.g.,
0.1) produces a sparse topic distribution —that is, the less prominent topics for each
document. A small beta (e.g., 0.001) produces a sparse keyword distribution —the less
prominent keywords to describe each topic.
As in every machine learning algorithm, there is no a priori recipe to select the best
hyperparameters. Of course, the number of topics k has to be manageable if we want
to represent them in a bar chart. It must be lower than 100, but between 1 and 100,
there are still a lot of options. Alpha can be taken empirically as 50/k, where k is the
number of topics.
The number of keywords and the value of parameter beta are less critical, since we will
not use that information in the final visualization. Hence, we use 10 keywords per topic
and beta = 0.01, as we found this to work in previous experiments.
The other configuration settings define the length and speed of the algorithm
execution. We set 1,000 iterations on 16 parallel threads.
We are left with the choice of the best k. We start off by running the LDA algorithm on
a number of different k to calculate the perplexity as 2^(-Log likelihood) for each k from
the last iteration of the LDA algorithm, and to visualize the perplexity values against
the number of topic k with a line plot. Initially, we run this loop for k = 2, 4, 6, 8, 10, 15,
20, 25, 30, 40, 50, 60, 80. The perplexity plot shows that the only useful range for k is
up to 20. Thus, we focus on that range, and we run the loop again for k = 13, 14, 15, 16,
17. The perplexity plot in the figure below shows a minimum at k = 15, and this is the
value we will use for the number of topics.

62
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

Perplexity plot from the last iteration of the LDA algorithm for different values of
number of topics k.

3. Relation between stars and customer journey: Visual investigation

Let’s move now to square #3 in the workflow, the one in the lower left corner. Here we
extract 15 topics from the reviews in the dataset using the Topic Extractor (Parallel
LDA) node. After concatenating the keywords for each topic via a GroupBy node, we
report the topic descriptions in the table below, to which we attach our own
interpretation/identification of the touchpoint, context, or quality of the customer
journey.

day, pool, breakfast, check, kid, rate, accommodate,


topic_0 Check in
arrive, perfect, roof

night, downtown, door, morn, weekend, found, leave, Overall


topic_1
time, comfortable, minute experience

bed, time, king, suite, upgrade, sites, review, guest, enjoy,


topic_2 Room features
outside

call, check, told, front-desk, manager, sheet, name, Location 1


topic_3
phone, left, person measures

floor, main, tower, city, build, street, review, walk- Booking


topic_4
distance, convention, business Interactions

nice, walk, location, lobby, city, love, clean, bit, walk-


topic_5 Location 2 food
distance, coffee

63
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

bathroom, shower, water, front-desk, floor, towel,


topic_6 Free features
cleane, lot, hair, definite

convention-center, downtown, market, comfortable,


topic_7 Front desk
convenient, reade, eat, starbuck, center, connect

staff, friend, clean, location, recommend, little, helpful, Car Park


topic_8
time, night, excellent Entrance

restaurant, location, walk, clean, close, bar, service, food,


topic_9 Food Service
looke, property

day, car, lobby, arrive, key, wait, line, check, elevator,


topic_10 Bathroom
available

park, block, street, garage, lot, breakfast, visit, night, Location 3


topic_11
able, car Market

try, time, breakfast, food, dinner, service, wait, lock,


topic_12 Staff
complaint, door

locate, station, phil, helpful, train, trip, airport, reade-


topic_13 Common areas
terminal-market, visite, near

conference, service, wifi, include, free, check-in, people, Business


topic_14
times, issues, issue features

Topics discovered by LDA algorithm in review dataset.

Finally, the “Topic Analysis” component displays the average number of stars
calculated on all reviews assigned to the same topic for Hotel_1 on the left and Hotel_2
on the right. Just a few comments:

• The overall experience at Hotel_2 is generally rated lower than the overall
experience at Hotel_1. This can also be seen by comparing the other single bars
in the chart.

• For example, the check-in experience at Hotel_1 (4 stars on average) is definitely


superior to the check-in experience at Hotel_2 (only 2 stars on average).

• Hotel_2 should definitely work to improve its bathrooms.

• Reviews about booking interactions are not available for Hotel_2.

64
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

And so on. Just by pairing the average star number and the review topic, this bar chart
can visually indicate to us where the failures might be found in the customer journeys
for both hotels.

Average number of stars for each assigned topic for Hotel_1 and Hotel_2.

The other component, named “Topic View,” also shows a bar chart, displaying the
average score for the assigned topics for each hotel.

4. Relation between stars and journey steps: Linear regression


coefficients

In this final part of the workflow, we will identify which customer experience topics are
predictors of the star ratings (both good and bad). We’ll train a linear regression model
to predict the star rating based on each topic score for that review and the hotel name.
To do so, we select a “baseline” topic that is not included as a predictor in the
regression, because otherwise we would be violating one of the regression properties
related to multicollinearity. Then we will interpret the coefficients of the topics
(predictors) in comparison with the baseline (topic 0 = Check In).
Based on the coefficient values, we first notice that the variable Hotel_2 is non-
significant (p>0.1), which indicates that the hotels do not differ significantly in their
star rating evaluations. The journey touchpoints called “common areas” (Coeff = 4.04,
p < 0.05) and “car park entrance” (Coeff = 3.89, p < 0.05) are the ones with a more
positive association with the star rating. On the other hand, the topics “location around
museums” (Coeff = -3.95, p < 0.05) and “free features” (Coeff = -2.63, p = 0.09) are
associated with more negative star ratings.

65
Consumer Mindset Metrics
Evaluate CX with Stars and Reviews

Conclusions on stars and reviews


We hope we have shown you how to understand the customer experience and journey
by pairing review topics and star-based evaluations. In order to reach this conclusion,
we had to:

• Import the reviews.

• Preprocess the texts.

• Extract the best number of topics for the dataset.

• Pair a review topic with a customer journey step.

• And visually and numerically inspect the customer journey topics with the star
rating.

To show the power of this approach, we used a comparison between customer


experience evaluations for two hotels.

66
Consumer Mindset Metrics

Brand Reputation Measurement

Authors: Francisco Villarroel Ordenes & Konstantin Pikal, LUISS Guido Carli University

Workflow on the KNIME Community Hub: Brand Reputation Tracker

“Your brand is what others say about you when you are not in the room”. This famous
quote, attributed to Jeff Bezos, Amazon’s founder, is one of our favorites. It is not
without criticism, but it is an apt way to approach the measurement of branding.
Building a brand is on every marketer’s task description. But how do you do it? And
foremost: How do you measure what you have built (and hopefully are still building)?
How do you analyze what your stakeholders (e.g., customers, media, competitors)
have to say about you? We could turn our heads and point them towards social media,
one of the channels where people “are talking about us”.

The Brand Reputation Tracker


A research team led by Prof. Ronald Rust of the University of Maryland has developed
a new way to analyze social media to understand a Brand’s reputation. Different from
other famous Brand Metrics, like BAV or Interbrand, this tracker uses real-time social
media data and measures the occurrence of three important drivers of Brand
reputation: Value, Brand and Relationships. 8 Inspired by Rust et al.’s work, we will
construct an interpretable tracker with a codeless approach using KNIME Analytics
Platform. In this section, we focus on the driver Brand and its sub-drivers: Cool,
Exciting, Innovative, and Social Responsibility.

The Brand Reputation Tracker workflow on the KNIME Community Hub.

8Rust, R. T., Rand, W., Huang, M.-H., Stephen, A. T., Brooks, G., & Chabuk, T. (2021). Real-Time Brand
Reputation Tracking Using Social Media. Journal of Marketing, 85(4), 21–
43. https://doi.org/10.1177/0022242921995173.

67
Consumer Mindset Metrics
Brand Reputation Measurement

What the Brand Reputation workflow does


1. Get tweets via Twitter API.
2. Clean Twitter data.
3. Prepare the tweets for text-processing.
4. Tag the text based on Brand Reputation dictionaries.
5. Calculate Brand Reputation scores.
6. Visualize Brand Reputation over time.

1. Get tweets via Twitter API

First, we need some data to use the brand reputation tracker on. Therefore, we suggest
you get developer access on the Twitter API. It sounds complicated, but it really isn’t.
It just gives you access to Tweets that you will have to text-mine later on.

How to get your Twitter API


Firstly, we need to get access to Twitter data. Luckily enough, Twitter offers an
API (Application Programming Interface), where we can get the data from. Now it is
free, but it might become a paid service soon. First, you will have to sign-up for a
Twitter account. If you already have one, you can skip this step. It will ask you for some
basic information and you will need to verify your email address. Afterwards, you will
be asked to configure a couple of things for your developer account: your country (in
our case, Italy), and your use case (if you are following a course, select “Student”).
Now you will also need to verify your account with a phone number. Make sure you do
that. If not, you will not have access to the Developer Tools at Twitter. Finally, it will ask
you to agree to the “Developer agreement & policy”.

Set-up the Twitter API


In your Twitter developer portal, you will have to get four things: the API key, the API
secret, the Access Token and the Access Token Secret. Those are the credentials that
you will need to be able to connect to Twitter and retrieve tweets.

Twitter API Connector node


By right-clicking on the node and clicking on “configure” you will be able to access the
configuration window of the node. Please add your personal Twitter credentials (API
key, API secret, Access Token and Access Token Secret), as you find them in your
Twitter developer account.

68
Consumer Mindset Metrics
Brand Reputation Measurement

Insert your personal keys from Twitter (those displayed here


are for demonstration purposes only).

Twitter Search node


Next, we will have to get tweets that were
written around a certain brand. The way we
do it is by using the Twitter handle. In our
example, we will use “@amazon”. This gives
us access to the tweets that mention
Amazon. The number of rows that you can
set has to do with the access level of your
Twitter API. For example, the Twitter v2 rate
limits are 900 Tweets per look-up, after
which you will have to wait 15 minutes. In
other words, the max number of tweets that
you can get in one go ranges between 15k Configuring the Twitter Search node.
and 16k. Note that we excluded user profile
images because they would slow down the execution. In case you are interested in
profile images, just add it to the field selection.

2. Clean Twitter data

Before we start analyzing the data, we will have to do some cleaning: or better, the
workflow is doing the cleaning for us. It excludes retweets and filters only tweets in
English (this is important because our text-mining dictionary is only in English).

Cleaning the Tweets.

69
Consumer Mindset Metrics
Brand Reputation Measurement

3. Prepare tweets for text processing

Now that we have cleaned our data, the most important part begins. We are going to
work with tweet texts and extract insights using the KNIME Textprocessing extension.

Preprocessing
We start off with pre-processing. For this, we first must convert Strings (e.g., text on
tweets) back into Document data type.
We need the Document data type to be able to perform text-mining operations in
KNIME. In the first step, we stem all our words to make them easier to interpret for the
machine. For example, “exciting” becomes “excit” and “inspiring” becomes “inspir”.
After KNIME has done this for us, our documents are fed into the Dictionary Tagger
node.

4. Tag the text based on Brand Reputation dictionaries

The Dictionary Tagger consists of two inputs: a dictionary, including all the relevant
words; and a tagger, where we specify what tag applies to which word. For example,
“trendi” and “hip” are part of the positive “Cool-dictionary”, whereas “ancient” and
“lame” are part of the negative “Cool-dictionary”. The tagger uses the dictionaries to
tag the document as follows: When the tagger finds a word in the document that is
also in the dictionary, e.g., “modern”, it tags that word with the corresponding tag (FTB-
A).
You might have noticed that in the paragraph above we used a tag type called “FTB”
and its values (e.g., A). This is because KNIME does not have a custom tagger for brand
reputation drivers. Next, we create a “bag of words”. A bag of words is simply a list of
all single words occurring in the dataset.

The KNIME workflow for Tagging. The “cool” dictionary: Positive words on
the left, negative words on the right.

70
Consumer Mindset Metrics
Brand Reputation Measurement

5. Calculate Brand Reputation scores

We now reconvert our tags to strings, and we only keep


the words in our document that have been tagged by our
dictionaries. The reason for this is that way we have less
data to process. After filtering out the words that do not
have any tags using the Row Filter node, we use the TF
node to count the occurrences of each term in the
document. This will give us a document where we see the
count of a specific term in any tweet. If you look closely,
you will see that every tag/tweet combination has its own
row. This means that if we have two tags in a tweet, we
will have two rows. We will later use a Pivoting node to
sum up tweets and tags.
In the upper branch, we group by
To group our data by time (in our case by months, but this the time dimension we want, e.g.,
months, taking the mean of
depends on the data that you have collected –the timestamp. In the lower branch,
workflow on the Hub aggregates data by day and hour), we group by time, taking as the
we must extract date and time fields. We manipulate the pivot the sum of the tag-counts. In
data in such a way that we end up having different tag the Joiner node, we join the tables
by time.
frequencies in the columns and time info in the rows.
After this, we also handle any missing values by fixing them to the value “0”.
It is worth mentioning that the column names correspond to the tag values that we
used during the dictionary tagging process (e.g., A, ADV, etc.). Therefore, we rename
the columns with the names of the brand sub-drivers (cool, exciting, innovative, social
responsibility).

Construct Operationalization. Term frequencies and missing values.

71
Consumer Mindset Metrics
Brand Reputation Measurement

Net and average scores


When you look at the table, you will see that there are positive and negative columns
for each sub-driver. Using a series of Math Formula nodes, we subtract the negative
column from the positive column for each sub-driver. In this way, we obtain the net
scores. After that, we take the net scores and average them over the four brand sub-
drivers. In this way, we obtain the “Brand Driver” average. If we inspect the output table,
we will now have five columns: “Cool Net” (which is obtained subtracting
“Cool_Negative” from “Cool_Positive”), “Innovative Net”, “Exciting Net” and “Soc. Resp
Net”. As stated before, the Brand Driver is the average of those four attributes.

Net scores of the Brand sub-drivers.

6. Visualize Brand Reputation over time

Finally, we normalize all our values to make it easier to understand


changes in time and across drivers (in case you add new drivers, such
as the “Relationship driver” and its sub-drivers). To visualize the
evolution of the selected sub-drivers over time, we use the Line Plot
node. We need to make sure to choose the time dimension on the x-axis and our drivers
(depending on how detailed we want our analysis to be) on the y-axis. If you look at our
example, you can see that Amazon’s Brand has been perceived as less innovative and
exciting throughout the year, while the overall perception of its social responsibility
seems to have improved throughout 2022.

Example visualization of Amazon with the four Brand sub-drivers in 2022.

72
Consumer Mindset Metrics
Brand Reputation Measurement

Automate brand reputation analysis with no-code


Measuring brand reputation is not a trivial task. While most well-established metrics
get the job done, they usually fail to do so in real-time. In this section, we introduced a
brand reputation tracker that uses real-time social media data to measure the
occurrence and trends of one major Brand reputation driver: Brand and its four sub-
drivers, i.e., Cool, Exciting, Innovative, and Social Responsibility.
To ensure a fully automated and transparent process, we relied on KNIME capabilities
to connect, search, and retrieve Twitter data around a chosen brand without a single
line of code. Likewise, the no-code steps to process tweet texts, assign tags and
visualize how brand reputation changes over time can be reused and extended
conveniently beyond the scope of our example.

73
Consumer Mindset Metrics

Analyze Customer Sentiment

Sentiment analysis of free-text documents, also known as opinion mining, is a data


mining task that relies on Natural Language Processing (NLP) techniques to
automatically determine whether a text leaves a positive, negative, or neutral
impression. Sentiment analysis is often used by marketers and researchers to analyze
customer feedback in online reviews or on social media platforms, gauge opinions
around a new product launch, evaluate brand reputation, compare themselves to
competitors, provide proofs of concept, measure marketing/PR efforts, detect or
predict potential PR crises, or identify influencers.
Working with natural languages to extract actionable insights is, however, not trivial.
They are articulated and interdependent “organisms”, which pose a number of
challenges for automatically understanding and processing them. From phrasing
ambiguities to misspellings, words with multiple meanings, phrases with multiple
intentions, and more, designing a sophisticated approach that accounts for context,
variations in meaning, and the full complexity of a message is an arduous task.
There are different approaches to analyzing the sentiment of texts. In this section, we
collected the work of different authors to illustrate four well-established approaches
in the data science community:
1. Lexicon-based sentiment analysis
2. Machine learning-based sentiment analysis
3. Deep learning-based sentiment analysis
4. Transformer-based language models for sentiment analysis

74
Consumer Mindset Metrics
Analyze Customer Sentiment

1. Lexicon-based sentiment analysis

Author: Aline Bessa, KNIME

Workflows on the KNIME Community Hub: Building Sentiment Predictor - Lexicon Based and Deploying
Sentiment Analysis Predictive Model - Lexicon Based Approach

Before purchasing a product, people often search for reviews online to help them
decide if they want to buy it. These reviews usually contain expressions that carry so-
called emotional valence, such as “great” (positive valence) or “terrible” (negative
valence), leaving readers with a positive or negative impression.
In lexicon-based sentiment analysis, words in texts are labeled as positive or negative
(and sometimes as neutral) with the help of a so-called “valence dictionary”. Take the
phrase: “Good people sometimes have bad days”. A valence dictionary would label the
word “Good” as positive; the word “bad” as negative; and possibly the other words as
neutral.
Once each word in the text is labeled, we can derive an overall sentiment score by
counting the numbers of positive and negative words and combining these values
mathematically. A popular formula to calculate the sentiment score (StSc) is:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠
𝑆𝑡𝑆𝑐 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠
If the sentiment score is negative, the text is classified as negative. It follows that a
positive score means a positive text, and a score of zero means the text is classified
as neutral.
Note that in the lexicon-based approach we don’t use machine learning models: The
overall sentiment of the text is determined on-the-fly, depending only on the dictionary
that is used for labeling word valence.

Tip. Valence dictionaries are language-dependent and are usually available at the
linguistic department of national universities. As an example, consider these
valence dictionaries for:
Chinese (http://compling.hss.ntu.edu.sg/omw/)
Thai (http://compling.hss.ntu.edu.sg/omw/)

The workflow we now want to walk through involved building a lexicon-based predictor
for sentiment analysis. We used a Kaggle dataset containing over 14K customer
reviews on six US airlines.

75
Consumer Mindset Metrics
Analyze Customer Sentiment

Each review is a tweet annotated as positive, negative, or neutral by contributors. One


of our goals is to verify how closely our sentiment scores match the sentiment
determined by these contributors. This will give us an idea of how promising and
efficient this approach is. 1 Let’s now walk through the different parts of the workflow.

Preprocessing steps

Before building our sentiment predictor, we need to clean


up and preprocess our data a little. We start by removing
duplicate tweets from the dataset with the Duplicate
Row Filter node. To analyze the tweets, we now need to
convert their content and the contributor-annotated
Preprocessing steps for our
overall sentiment of the remaining tweets into
lexicon-based analyzer.
documents using the Strings To Document node.

Tip. Document is the required data type for most text mining tasks in KNIME, and
the best way to inspect it is with the Document Viewer node.

These documents contain all the information we need for our lexicon-based analyzer,
so we can now exclude all the other columns from the processed dataset with
the Column Filter node.

Tagging words as positive or negative

We can now move on and use a valence dictionary to label all the words in the
documents we have created for each tweet. We used an English dictionary from
the MPQA Opinion Corpus which contains two lists: one list of positive words and one
list of negative words.

Tip. Alternative formulas for sentiment scores calculate the frequencies of neutral
words, either using a specific neutral list or by tagging any word that is neither
positive nor negative as neutral.

Each word in each document is now compared


against the two lists and assigned a sentiment tag.
We do this by using two instances of the Dictionary
Tagger node. The goal here is to ensure that
sentiment-laden words are marked as such and then
to process the documents again keeping only those
words that were tagged (with the Tag Filter node).

Tagging words in tweets as positive or


negative based on an MPQA dictionary.

76
Consumer Mindset Metrics
Analyze Customer Sentiment

Note that since we’re not tagging neutral words here, all words that are not marked as
either positive or negative are removed in this step.
Now, the total number of words per tweet, which we need to calculate the sentiment
scores (see formula above), is equivalent to the sum of positive and negative words.

Counting numbers of positive and negative words

What we now want to do is generate sentiment scores for each tweet. Using our filtered
lists of tagged words, we can determine how many positive and negative words are
present in each tweet.
We start this process by creating bags of words for each tweet with the Bag Of Words
Creator node. This node creates a long table that contains all the words from our
preprocessed documents, placing each one into a single row.
Next, we count the frequency of each tagged word in each tweet with the TF node. This
node can be configured to use integers or weighted values, relative to the total number
of words in each document. Since tweets are very short, using relative frequencies
(weighted values) is not likely to offer any additional normalization advantage for the
frequency calculation. For this reason, we use integers to represent the words’
absolute frequencies.
We then extract the sentiment of the
words in each tweet (positive or negative)
with the Tags to String node, and finally
calculate the overall numbers of positive
and negative words per document by
summing their frequencies with Tagging words in tweets as positive or negative
based on an MPQA dictionary. These nodes are
the Pivoting node. For consistency, if a
wrapped in the Number of Positive and Negative
tweet does not have any negative or Words per Tweet metanode.
positive words at all, we set the
corresponding number to zero with the Missing Value node. For better readability, this
process is encapsulated in a metanode in the workflow.

Calculating sentiment scores

We’re now ready to start the fun part and calculate the
sentiment score (StSc) for each tweet. We get our sentiment
score by calculating the difference between the numbers of
positive and negative words, divided by their sum (see
formula for StSc above) with the Math Formula node. Based Determining the sentiment of
on this value, the Rule Engine node decides whether the tweets by calculating
sentiment scores (StSc).
tweet has positive or negative sentiment.

77
Consumer Mindset Metrics
Analyze Customer Sentiment

Evaluating our lexicon-based predictor

Since the tweets were annotated by actual contributors as positive, negative, or neutral,
we have gold data against which we can compare the lexicon-based predictions.
Remember that StSc > 0 corresponds to a positive sentiment; StSc = 0, to a neutral
sentiment; and StSc < 0, to a negative sentiment. These three categories can be seen
as classes, and we can easily frame our prediction task as a classification problem.
To perform this comparison, we start by setting the column containing the
contributors’ annotations as the target column for classification, using the Category
to Class node. Next, we use the Scorer node to compare the values in this column
against the lexicon-based predictions.
It turns out that the accuracy of this approach is rather low: the sentiment of only 43%
of the tweets was classified correctly. This approach performed especially badly for
neutral tweets, probably because both the formula for sentiment scores and the
dictionary we used do not handle neutral words directly. Most tweets, however, were
annotated as negative or positive, and the lexicon-based predictor performed slightly
better for the former category (F1-scores of 53% and 42%, respectively).

Deploying and visualizing a lexicon-based predictor

After assessing the performance of our predictor, we implement a second workflow to


show how our predictive model could be deployed on unlabeled data. The mechanics
of both workflows are very similar, but there are a few key differences.
First, this deployment workflow implements a component named Tweet Extraction,
which includes the Twitter API Connector node –to connect to the Twitter API– and
the Twitter Search node to query the API for tweets with a given hashtag. We used
configuration nodes inside the component to enable users to enter their Twitter
credentials and specific search query. Configuration nodes within the component
create the configuration dialogue of the components for the Twitter credentials and
the search query. By default, the Twitter API returns the tweets from last week, along
with data about the tweet, the author, the time of tweeting, the author’s profile image,
the number of followers, and the tweet ID.
This process is followed by some post-processing that will help improve visualizations
of our data further down the line. On the one hand we want to remove retweets as they
can become excessive and impair the legibility of visualizations – and on the other
hand – we want to format the tweets’ timestamps to make the visualizations less
cluttered.

78
Consumer Mindset Metrics
Analyze Customer Sentiment

Structure of the Tweet Extraction component.

The process of tagging words in tweets, deriving sentiment scores and finally
predicting their sentiments is no different from what we described for the first
workflow. However, since we do not have labels for the tweets here, we can only
assess the performance of the lexicon-based predictor subjectively, relying on our own
judgment.

Dashboard to visualize and assess performance

To help us assess the performance, we implemented a composite visualization that


combines (1) the tweets; (2) a word cloud in which the tweets appear with sizes that
correspond to their frequency; and (3) a bar chart with the number of tweets per
sentiment per date. This visualization is interactive. Users can click different bars in
the bar chart to modify the content selection of the word cloud, for example.
In terms of structure, the workflow uses the Document Data Extractor node to retrieve
all tweet information stored in the Document column, and the Joiner node to join the
profile image back to the tweet. Next, a dashboard produces the word cloud, the bar
chart, and a table with all extracted tweets.
Now let’s take a closer look at how this predictor works in practice. Consider at first
the content of the following tweet:
@VirginAmerica I <3 pretty graphics. so much better than minimal iconography. :D

The words “pretty” and “better” were tagged by our predictor as positive; no words were
tagged as negative. This led to StSc = 1, meaning the tweet was then classified as
having positive sentiment –something with which most annotators may agree. Let’s
now take a look at a different example, for which our predictor fails:

79
Consumer Mindset Metrics
Analyze Customer Sentiment

@AmericanAir @SKelchlin What kind of response is this from an airline? "When


they can?". How about an apology.

The word “kind” was tagged as positive, even though it does not correspond to a
positive adjective in this context, and no words were tagged as negative. Consequently,
the tweet was classified as positive even though it in fact corresponds to a complaint.

Dashboard of unlabeled tweets and their predicted sentiments per date. A quick inspection confirms the low
performance of this predictor, assessed quantitatively during training.

These examples illustrate a few limitations of the lexicon-based approach: it does not
take the context around words into consideration, nor is it powerful enough to handle
homonyms, such as “kind”.

Sentiment predictor: Insightful for baseline analysis

Lexicon-based sentiment analysis is an easy approach to implement and can be


customized without much effort. The formula for calculating sentiment scores could,
for example, be adjusted to include frequencies of neutral words and then verified to
see if this has a positive impact on performance. Results are also very easy to interpret,
as tracking down the calculation of sentiment scores and classification is
straightforward.
Despite its low performance, a lexicon-based sentiment predictor is insightful for
preliminary, baseline analysis. It provides analysts with insights at a very low cost and
saves them a lot of time otherwise spent analyzing data in spreadsheets manually.

80
Consumer Mindset Metrics
Analyze Customer Sentiment

2. Machine learning-based sentiment analysis

Authors: Kilian Thiel & Lada Rudnitckaia, KNIME

Workflow on the KNIME Community Hub: Sentiment Analysis (Classification) of Documents

In this section, we go beyond the basic solution to sentiment analysis using


dictionaries, and raise a bit the complexity bar. We will show you how to automatically
assign predefined sentiment labels to documents, using the KNIME
Textprocessing extension in combination with the traditional KNIME Learner-Predictor
construct for supervised machine learning.
In a supervised machine learning paradigm, we train one or multiple mining algorithms
(e.g., Decision Tree, Random Forest, Support Vector Machines, Neural Networks, etc.)
using a labelled training dataset to classify data (e.g., a sentiment label) or predict
outcomes (e.g., the average income of students) accurately. As input labelled data is
fed into the model, this learns automatically by adjusting its weights until it has been
fitted appropriately. The trained model is then applied to the test dataset to output
predictions. Finally, we assess the model’s performance and decide whether it is
satisfactory or more training is required.
To train our supervised predictor, a set of 2000 documents has been sampled from the
training set of the Large Movie Review Dataset v1.0. The Large Movie Review Dataset
v1.0 contains 50000 English movie reviews along with their associated sentiment
labels "positive" and "negative". We sampled 1000 documents of the positive group
and 1000 documents of the negative group. The goal here is to assign the correct
sentiment label to each document.

This workflow pre-processes the text documents, creates document


vectors, and trains two machine learning models to assign sentiment
labels to the documents.

81
Consumer Mindset Metrics
Analyze Customer Sentiment

Note. The workflow that we’ll illustrate in this section is a slight variation of the
workflows stored in the Machine Learning and Marketing repository: Building a
Sentiment Analysis Predictive Model - Supervised Machine Learning and Deploying
a Sentiment Analysis Predictive Model - Supervised Machine Learning. Nevertheless,
the underlying rationale and the use of the Learner-Predictor construct for
supervised learning is the same.

The workflow starts with the CSV Reader node, reading a CSV file, that contains the
review texts, its associated sentiment label, the IMDb URL of the corresponding movie,
and its index in the Large Movie Review Dataset v1.0. Important are the text and the
sentiment columns. In the first metanode, "Document Creation", Document cells are
created from the string cells, using the Strings To Document node. The sentiment
labels are stored in the category field of each document to remain available for later
tasks; and all columns, except the Document column, are filtered out.
The output of the first metanode "Document Creation" is a data table with only one
column containing the Document cells.

Preprocess the texts

The textual data is preprocessed by various nodes provided by the KNIME


Textprocessing extension. All preprocessing steps are applied in the second
metanode "Preprocessing", as shown in the figure below.

The “Preprocessing” metanode performs text specific data preprocessing.

First, punctuation marks are removed by the Punctuation Erasure node, then numbers
and stop words are filtered, and all terms are converted to lowercase. After that, the
stem is extracted from each word using the Snowball Stemmer node. Indeed, the
words “selection”, “selecting” and “to select” refer to the same lexical concept and
carry the same information in a document classification or topic detection context.
Besides English texts, the Snowball Stemmer node can be applied on texts of various
languages, e.g., German, French, Italian, Spanish, etc. The node is using the Snowball
stemming library.

Extract features and create document vectors

After all this preprocessing, we reach the central point of the analysis which is to
extract the terms to use as components of the document vectors and as input features
to the classification model.

82
Consumer Mindset Metrics
Analyze Customer Sentiment

To create the document vectors for the texts, first we create their bag of words using
the Bag Of Words Creator node; then we feed the data tables containing the bag of
words into the Document Vector node. The Document Vector node will consider all
terms contained in the bag of words to create the corresponding document vector.
Notice that a text is made of words and a document –which is a text including some
additional information such as category or authors– contains terms i.e., words
including some additional information such as grammar, gender, or stem.

How to avoid performance issues when texts are long?

Since texts are quite long, the corresponding bags of words can contain many words,
the corresponding document vector can have a very high dimensionality (too many
components), and the classification algorithm can suffer in terms of speed
performance. However, not all words in a text document are equally important.
A common practice is to filter out all least informative words and keep only the most
significant ones. A good measure of word importance can be indicated by the number
of occurrences of words in each single document as well as in the whole dataset.
Based on this consideration, after the bag of words has been created, we filter out all
terms that occur in less than 20 documents inside the dataset. Within a GroupBy node,
we group by terms and count all unique documents containing a term at least once.
The output is a list of terms with the number of documents in which they occur.
We filter this list of terms to keep only those terms with a number of documents greater
than 20, and then we filter the terms in each bag of words accordingly, with
the Reference Row Filter node. In this way, we reduce the feature space from 22379
distinct words to 1499. This feature extraction process is part of the "Preprocessing"
metanode and can be seen in the figure below.

Feature extraction –creating and filtering the bags of words.

We set the minimum number of documents to 20 since we assume that a term has to
occur in at least 1% of all documents (20 of 2000) in order to represent a useful feature
for classification. This is a rule of thumb and of course can be optimized.

83
Consumer Mindset Metrics
Analyze Customer Sentiment

Document vectors are now created, based on these extracted words (features).
Document vectors are numerical representations of documents. Here each word in the
dictionary becomes a component of the vector and for each document assumes a
numerical value, that can be 0/1 (0 absence, 1 presence of word in document) or a
measure of the word importance within the document (e.g., word scores or
frequencies). The Document Vector node allows for the creation of bit vectors (0/1) or
numerical vectors. As numerical values, previously calculated word scores or
frequencies can be used, e.g., by the TF or IDF nodes. In our case bit vectors are used.

Use supervised mining algorithm for classification

For classification we can use any of the traditional supervised mining algorithms
available in KNIME Analytics Platform (e.g., Decision Tree, Random Forest, Support
Vector Machines, Neural Networks, and much more).
As in all supervised mining algorithms we need a target variable. In our example, the
target is the sentiment label, stored in the document category. Therefore, the target or
class column is extracted from the documents and appended as a string column, using
the Category To Class node. Based on the category, a color is assigned to each
document by the Color Manager node. Documents with the label "positive" are colored
green, documents with the label "negative" are colored red.
As classification algorithms we used a Decision Tree Learner and a XGBoost Tree
Ensemble Learner nodes applied to a training (70%) and test set (30%), randomly
partitioned from the original dataset. The accuracy of the Decision Tree is 91.3%,
whereas the accuracy of the XGBoost Tree Ensemble is 92.0%. Both values are
obtained using the Scorer node. The corresponding ROC curves are presented below.

ROC Curve for the Decision Tree and the XGBoost models.

84
Consumer Mindset Metrics
Analyze Customer Sentiment

While performing slightly better, the XGBoost Tree Ensemble doesn’t have the
interpretability that the Decision Tree can provide. The next figure displays the first two
levels of the Decision Tree. The most discriminative terms with respect to the
separation of the two classes are "bad", "wast", and "film". If the term "bad" occurs in a
document, it is likely to have a negative sentiment. If "bad" does not occur but "wast"
(stem of waste), it is again likely to be a negative document, and so on.

Decision Tree view allows us to investigate which features, in our case words or their
stems, contribute the most to separating documents in the two sentiment classes.

85
Consumer Mindset Metrics
Analyze Customer Sentiment

How to improve classification

What we’ve shown in this section is just a quick tutorial on how to approach sentiment
classification with supervised machine learning algorithms. Of course, much more
than that can be done on much larger datasets and on much more complex sentences.
There are three broader ways we can improve this classification:

• Use a larger dataset.


• Improve data preparation.
• Improve the models, either by optimizing the trained ones or by using different
architectures.
The preprocessing chain could be optimized for better cleaning and transformations
especially in terms of the adopted classification algorithm. For example, instead of bit
vectors, numerical vectors could also be created by the Document Vector node.
Furthermore n-gram features could be used in addition to single words to consider
negations, such as "not good" or "not bad".
Fine tuning the hyper-parameters can also improve the performance of the trained
models.
Other classification learners, such as the Tree Ensemble Learner, Naive Bayes
Learner, or SVM Learner can be applied as well.
Finally, deep learning methods, based on neural networks, especially Recurrent Neural
Networks (RNN) and Long Short-Term Memory (LSTM) networks, have been
successfully applied to sentiment analysis. Some neural layers can extract the text
features by projecting them on a lower dimensionality space; some neural
architectures, such as LSTM-based networks, can take word order and context into
account. This can help detect the sentiment of more complex sentences based on the
surrounding context.

86
Consumer Mindset Metrics
Analyze Customer Sentiment

3. Deep learning-based sentiment analysis

Author: Aline Bessa, KNIME

Workflows on the KNIME Community Hub: Building a Sentiment Analysis Predictive Model - Deep Learning
using an RNN and Deploying a Sentiment Analysis Predictive Model - Deep Learning using an Recurrent
Neural Network (RNN)

The ability to automatically analyze customer feedback helps businesses process


huge amounts of unstructured data quickly, efficiently, and cost effectively. Feedbacks
are usually in the form of text data, and to leverage it fully we need to make sure that
our models are advanced enough to accurately capture the complexity of the message.
This means that we need models that are able to map long sequences and
disambiguate words according to context.
Traditional algorithms and simple neural networks are not well suited for this task,
since it is unclear how their reasoning about previous events could be used –if at all–
to inform later ones. On the other hand, deep learning techniques, such as Recurrent
Neural Networks (RNNs), have proven to be superior in analyzing and representing
sequential data (i.e., language structures) because they have node connections that
form a graph along a sequence of data, mimicking memory and allowing context
information to persist.
RNNs are also a great candidate for sentiment analysis of text because they are
capable of processing variable-length input sequences, such as words and sentences.
In fact, RNNs have shown great success in many sequence-based problems, such as
text classification and time series analysis.
Most of this success, however, is linked to a very special kind of RNN, LSTM-based
Neural Networks. The performance of the original RNNs decreases as the gap between
previous relevant information and the place where it is needed grows. Imagine you are
trying to predict the last word in the sentence “The clouds are in the sky”. Given this
short context window, it is extremely likely that the last word is going to be sky. In such
cases, where the gap between the context and the prediction is small, RNNs perform
consistently well. As this gap grows, which is very common in more complex texts, it
is better to use LSTMs. The reason why LSTMs are so good at keeping up with larger
gaps has to do with their design: just like traditional RNNs, they also have a graph-like
structure with repeating modules, but LSTMs are more complex and tackle longer-term
dependencies much better.
The goal of our work is to build a predictor for sentiments (positive, negative, or
neutral) in US airline reviews. To train and evaluate the predictor, we use
a Kaggle dataset with over 14K customer reviews in the format of tweets, which are

87
Consumer Mindset Metrics
Analyze Customer Sentiment

annotated as positive, negative, or neutral by contributors. The closer the predicted


sentiments are to the contributors’ annotations, the more efficient our predictor is.
Preliminary steps – KNIME Keras and Python integration
In order to execute the workflows that we are going to discuss, you will need to:

• Install the KNIME Deep Learning - Keras Integration extension.

• Use the Conda Environment Propagation node for the workflows in this tutorial,
which ensures the existence of a Conda environment with all the needed
packages. An alternative to using this node is setting up your own Python
integration to use a Conda environment with all packages.

Note. These workflows were tailored for Windows. If you execute them on another
system, you may have to adapt the environment of the Conda Environment
Propagation node.

How to define the neural network architecture

We start our solution by defining a neural network architecture to work with. In


particular, we need to decide which Keras layers to use and in what order.

Keras Input Layer


In KNIME, the first layer of any Keras neural network, which receives the input
instances, is represented with the Keras Input Layer node. In our application, the input
instances are tweets. This layer requires a specification of the input shape, but since
our instances are tweets with varying numbers of words (that is, with varying shapes),
we set the parameter shape as “?”. This allows the network to handle different
sequence lengths.

Keras Embedding Layer (and why we need one)


After an instance is processed in the input layer, we send it to the Keras Embedding
Layer node, which encodes the tweets into same-sized representations (embeddings).
Working with same-sized sequences is going to be important when we get to the
training part of our workflow. In general, RNNs can handle sequences with different
lengths, but during training they must have the same length.
To understand what this layer does, let’s first discuss a very naive way of representing
tweets as embeddings. The tweets in this project use a vocabulary with a large number
X of words, and we could easily create an X-sized vector to represent each tweet, such
that each index of the vector would correspond to a word in the vocabulary. In this
setting, we could fill these vectors as follows:

88
Consumer Mindset Metrics
Analyze Customer Sentiment

• For each word that is present in a tweet, set its corresponding entry in the tweet’s
X-sized vector as 1.

• Set all the other vector entries as 0.


This is a very traditional way of encoding text, and leads to same-sized representations
for the tweets regardless of how many words they actually contain.
Enter the embedding layer! It receives these X-sized vectors as input and embeds them
into much smaller and denser representations, which are still unique –that is, two
different tweets will never be embedded in the same way. Here, we configure the Keras
Embedding Layer node by setting two parameters: the input dimension and the output
dimension. The input dimension corresponds to the total number of words present in
the set of tweets –that is, the vocabulary– plus one. We will explain how to determine
this value later on in this article, but it basically depends on the native encoding of your
operating system. In Windows, the vocabulary size for our tweet dataset is calculated
as 30125, so we set the input dimension as 30126. The output dimension corresponds
to the size of the embeddings, and we set it as 128. This means that regardless of how
many words each tweet originally has, it will be represented as an embedding (you can
see it as a vector) of size 128. A massive economy in how much space is used!
The larger the embeddings are, the slower your workflow will be, but you also do not
want to use a size that is very small if you have a large vocabulary and many tweets:
for our application, size 128 hits the spot.

LSTM Layer
We connect the Keras Embedding Layer node to the star of our neural network:
the Keras LSTM Layer node. When configuring this layer, we need to set its number
of units. The more units this layer has, the more context is kept – but the workflow
execution also becomes slower. Since we are working with short text in this application
(tweets), 256 units suffice.

Dense Layer
Finally, we connect the Keras LSTM Layer node to a Keras Dense Layer node that
represents the output of our neural network. We set the activation function of this layer
as Softmax to output the sentiment class probabilities (positive, negative, or neutral)
for each tweet. The use of Softmax in the output layer is very appropriate for multiclass
classification tasks, with the number of units being equal to the number of classes.
Since we have 3 classes, we set the number of units in the layer as 3.

89
Consumer Mindset Metrics
Analyze Customer Sentiment

The network architecture that we use for this Sentiment Analysis task.

Preprocess the data before building the predictor

Before building our sentiment predictor, we need to preprocess our data. After reading
our dataset with the CSV Reader node, we associate each word in the tweets’
vocabulary with an index. In this step, which is implemented in the Index Encoding and
Zero Padding metanode, we break the tweets into words (tokens, more specifically)
with the Cell Splitter node.
Here is where the native encoding of your operating system may make a difference in
the number of words you end up with, leading to a larger or smaller vocabulary. After
you execute this step, you may have to update the parameter input dimension in
the Keras Embedding Layer to make sure that it is equal to the number of words
extracted in this metanode.
Since it is important to work with same-length tweets in the training phase of our
predictor, we also add zeros to the end of their encodings in the Index Encoding and
Zero Padding metanode, so that they all end up with the same length. This approach is
known as zero padding.

Structure of the Index Encoding and Zero Padding metanode.

Already outside of the metanode, the next step is to use the Table Writer node to save
the index encoding for the deployment phase. Note that this encoding can be seen as
a dictionary that maps words into indices.
The zero-padded tweets output by the metanode also come with their annotated
sentiment classes, which also need to be encoded as numerical indices for the training
phase. To do so, we use the Category to Number and the Create Collection
Column nodes. The Category to Number node also generates an encoding model with

90
Consumer Mindset Metrics
Analyze Customer Sentiment

a mapping between sentiment class and index. We save this model for later use with
the PMML Writer node.
Finally, we use the Partitioning node to separate 80% of the processed tweets for
training, and 20% of them for evaluation.

Train and apply the neural network

With our neural network architecture set up, and with our dataset preprocessed, we
can now move on to training the neural network.
To do so, we use the Keras Network Learner node, which takes the defined network
and the data as input. This node has four tabs, and for now we will focus on the first
three: the Input Data, the Target Data, and the Options tabs.
In the Input Data tab, we include the zero-padded tweets (column “ColumnValues”) as
input. In the Target Data tab, we use column Class as our target and set
the conversion parameter as “From Collection of Number (integer) to One-Hot-Tensor”
because our network needs a sequence of one-hot-vectors for learning.
In the Target Data tab, we also have to choose a loss function. Since this is a multiclass
classification problem, we use the loss function “Categorical Cross Entropy”.
In the Options tab, we can define our training parameters, such as the number
of epochs for training, the training batch size, and the option to shuffle data before
each epoch. For our application, 50 epochs, a training batch size of 128, and the
shuffling option led to a good performance.
The Keras Network Learner node also has an interactive view that shows you how the
learning evolves over the epochs –that is, how the loss function values drop and how
the accuracy increases over time. If you are satisfied with the performance of your
training before it reaches its end, you can click the “Stop Learning” button.

91
Consumer Mindset Metrics
Analyze Customer Sentiment

One of the tabs of the Keras Network Learner node’s interactive view, showing how accuracy increases over
time.

We set the Python environment for the Keras Network Learner node, and for the nodes
downstream, with the Conda Environment Propagation node. This is a great way of
encapsulating all the Python dependencies for this workflow, making it very portable.
Alternatively, you can set up your own Python integration to use a Conda environment
with all the packages required.
After the training is complete, we save the generated model with the Keras Network
Writer node. In parallel, we use the Keras Network Executor node to get the class
probabilities for each tweet in the test set. Recall that we obtain class probabilities
here because we use Softmax as the activation function of our network’s output layer.
We show how to extract class
predictions from these probabilities
next.

Training and applying our neural network model.

92
Consumer Mindset Metrics
Analyze Customer Sentiment

Evaluate the network predictions

To evaluate how well our LSTM-based predictor works, we must first obtain actual
predictions from our class probabilities. There are different approaches to this task,
and in this application, we choose to always predict the class with the highest
probability. We do so because this approach is easy to implement and interpret,
besides always producing the same results given a certain class probability
distribution.
The class extraction takes place in the
Extract Prediction metanode. First, the Many
to One node is used to generate a column
that contains the class names with the
highest probabilities for each test tweet. We
post-process the class names a bit with Structure of the Extract Prediction metanode.
the Column Expressions node, and then map
the class indices into positive, negative, or neutral using the encoding model we built
in the preprocessing part of the workflow.
With the predicted classes at hand, we use the Scorer node to compare them against
the annotated labels (gold data) in each tweet. It turns out that the accuracy of this
approach is slightly above 74%: significantly better than what we obtained with
a lexicon-based approach.
Note that our accuracy value, here 74%, can change from one execution to another
because the data partitioning uses stratified random sampling.
Interestingly, the isolated performance for negative tweets –which correspond to 63%
of the test data– was much better than that (F1-score of 85%). Our predictor learned
very discernible and generalizable patterns for negative tweets during training.
However, patterns for neutral and positive tweets were not as clear, leading to an
imbalance in performance for these classes (F1-scores of 46% and 61%, respectively).
Perhaps if we had more training data for classes neutral and positive, or if their
sentiment patterns were as clear as those in negative tweets, we would obtain a better
overall performance.

Deploy and visualize an LSTM-based predictor

After evaluating our predictor over some annotated test data, we implement a second
workflow to show how our LSTM-based model could be deployed on unlabeled data.
First, this deployment workflow re-uses the Tweet Extraction component introduced in
the lexicon-based sentiment analysis. This component enables users to enter their
Twitter credentials and specify a search query. The component in turn connects with
the Twitter API, which returns the tweets from last week, along with data about the

93
Consumer Mindset Metrics
Analyze Customer Sentiment

tweet, the author, the time of tweeting, the author’s profile image, the number of
followers, and the tweet ID.
The next step is to read the dictionary created in the training workflow and use it to
encode the words in the newly extracted tweets. The "Index Encoding Based on
Dictionary" metanode breaks the tweets into words and performs the encoding.
In parallel, we again set up a Python environment with the Conda Environment
Propagation node, and connect it to the Keras Network Reader node to read the model
created in the training workflow. This model is then fed into the Keras Network
Executor node along with the encoded tweets, so that sentiment predictions can be
generated for them.
We then read the encoding for sentiment classes created in the training workflow and
send both encoding and predictions to the Extract Prediction metanode. Similar to
what occurs in the training workflow, this metanode predicts the class with the highest
probability for each tweet.
Here we do not have annotations or labels for the tweets: we are extracting them on-
the-fly using the Twitter API. For this reason, we can only verify how well our predictor
is doing subjectively.
To help us in this task, we implement a dashboard that is very similar to the one in our
lexicon-based sentiment analysis project. It combines (1) the extracted tweets; (2) a
word cloud in which the tweets appear with sizes that correspond to their frequency;
and (3) a bar chart with the number of tweets per sentiment per date.
In terms of structure, we use the Joiner node to combine the tweets’ content and their
predicted sentiments with some of their additional fields –including profile images.
Next, we send this information to the Visualization component, which implements the
dashboard.

Note. This component is almost identical to the visualization one discussed in our
lexicon-based sentiment analysis post. Both components just differ a bit in how
they process words and sentiments for the tag cloud, since they receive slightly
different inputs.

94
Consumer Mindset Metrics
Analyze Customer Sentiment

Dashboard of unlabeled tweets and their predicted sentiments per date.

A quick inspection of the tweets’ content in the dashboard also suggests that our
predictor is relatively good at identifying negative tweets. An example is the following
tweet, which is clearly a complaint and got correctly classified (negative):
@AmericanAir’s standards are what these days? Luggage lost for over 48 hour
s and no response AT ALL. All good as long as there’s a customer service st
andard which there isn’t.

This is aligned with the performance metrics we calculated in the training workflow.
The bar chart also suggests that most tweets correspond to negative comments or
reviews. Although it is hard to verify if this is really the case, this distribution of
sentiment is compatible with the one in the Kaggle dataset we used to create our
model. In other words, this adds to the hypothesis that airline reviews on Twitter tend
to be negative.
Finally, the tweets below give us an opportunity to discuss the limitations of our
sentiment analysis. The most common tweet, which is the largest one in the tag cloud,
is a likely scammy spam that got classified as positive:
@AmericanAir Who wants to earn over £5000 weekly from bitcoin mining? You c
an do it all by yourself right from the comfort of your home without stress
. For more information WhatsApp +447516977835.

Although this tweet addresses an American airline company, it does not correspond to
a review. This is indicative of how hard it is to isolate high quality data for a task
through simple Twitter API queries.
Note that our model did capture the positivity of the tweet: after all, it offers an
“interesting opportunity” to make money in a comfortable and stress-free (many
positive and pleasant words). However, this positivity is not of the type we see in good

95
Consumer Mindset Metrics
Analyze Customer Sentiment

reviews of products or companies —human brains quickly understand that this tweet
is too forcedly positive, and likely fake. Our model would need much improvement to
discern these linguistic nuances.

LSTM-based predictor: Better for negative tweets

LSTM-based sentiment analysis is relatively simple to implement and deploy with


KNIME. The predictor we discuss in this article performs significantly better than
a lexicon-based one, especially for negative reviews. This probably happens because
(1) the training data for the model contains many more negative reviews, biasing the
learning; and (2) the language patterns in negative reviews are clearer and thus easier
to be learned. To an extent, we could improve the performance of our predictor by
improving the quality and amount of training data.
Adding a dashboard to the deployment workflow was once again useful. Through
visualizations and at a low cost, we could gather insights on how our predictor works
for unlabeled tweets, and even detect a potential data quality issue with spam. This is
exactly the type of resource that saves data analysts a lot of time in the long run.

96
Consumer Mindset Metrics
Analyze Customer Sentiment

4. Transformer-based language models for sentiment


analysis

Author: Roberto Cadili, KNIME

Workflows on the KNIME Community Hub: Building Sentiment Predictor - BERT and Deploying a Sentiment
Analysis Predictive Model - BERT

In recent years, large transformer-based language models have taken the data
analytics community by storm, obtaining state-of-the-art performances in a wide range
of tasks that are typically very hard, such as human-like text generation, question-
answering, caption creation, speech recognition, etc. These language models, which
rely on advanced deep learning architectures, are also referred to as deep
contextualized language models, because they are exceptionally good at mapping and
leveraging the context-dependent meaning of words to return meaningful predictions.
But there’s no such a thing as a free lunch. While amazingly powerful, these models
require humongous computational resources and data to be effectively trained. So
much that they are usually developed and released only by the world’s top tech labs
and companies.
One of the most ground-breaking examples in the space is BERT. It stands
for Bidirectional Encoder Representations from Transformers and is a deep neural
network architecture built on the latest advances in deep learning for NLP. It was
released in 2018 by Google, and achieved State-Of-The-Art (SOTA) performance in
multiple natural language understanding (NLU) benchmarks. These days, other
transformer-based models, such as GPT-3 or PaLM, have outperformed BERT.
Nevertheless, BERT represents a fundamental breakthrough in the adoption and
consolidation of transformer-based models in advanced NLP applications.

Use BERT with KNIME

Harnessing the power of the BERT language model for multi-class classification tasks
in KNIME Analytics Platform is extremely simple. This is possible thanks to
the Redfield BERT Nodes extension, which bundles up the complexity of this
transformer-based model in easy-to-configure nodes that you can drag and drop from
the KNIME Community Hub. For
marketers and other data
professionals, this means
implementing cutting-edge solutions
without writing a single line of code.
BERT nodes in the Redfield extension.

97
Consumer Mindset Metrics
Analyze Customer Sentiment

Train a sentiment predictor of US airline reviews

We want to build and train a sentiment predictor of US airline reviews, capitalizing on


the sophistication of language representation provided by BERT. For this supervised
task, we used an annotated Twitter US Airline Sentiment dataset with 14,640 entries
available for free on Kaggle. Among other things, the dataset contains information
about the tweetID, the name of the airline, the text of the tweet, the username, and the
sentiment annotation –positive, negative, or neutral.
To build our classifier, we ingest the data, perform minimal preprocessing operations
(i.e., duplicate removal and low casing), and partition the dataset in trainset (80%) and
testset (20%). Next, we rely on the BERT nodes to train and apply our classifier.
The BERT Model Selector node allows us to download the pretrained BERT models
available on TensorFlow Hub and Hugging Face, then store them in a local directory.
We can choose among several model options, considering the size of the transformer
architecture, the language the model was trained in (“eng” for English, “zh” for Chinese,
or “multi” for the 100 languages with the largest Wikipedia pages), or whether the
model is sensitive to cased text. A detailed list of all available models is provided on
the TensorFlow model Hub. For our predictor, we select the uncased English BERT with
a base transformer architecture.

BERT model selection.

The BERT Classification Learner node uses the selected BERT model and adds three
predefined neural network layers: a GlobalAveragePooling, a Dropout, and a Dense
layer. Adding these layers adapts the selected model to a multiclass classification
task.
The configurations of this node are very simple. All we need to do is select the column
containing the tweet texts, the class column with the sentiment annotation, and the
maximum length of a sequence after tokenization. In the “Advanced” tab, we can then
decide to adjust the number of epochs, the training and validation batch size, and the
choice of the network optimizer.

98
Consumer Mindset Metrics
Analyze Customer Sentiment

Notice the box “Fine tune BERT.” If checked, the pretrained BERT model will be trained
along with the additional classifier stacked on top. As a result, fine-tuning BERT takes
longer, but we can expect better performance.

Configuration of the BERT Classification Learner node.

The BERT Predictor node: This node applies the trained model to unseen tweets in the
testset. Configuring this node follows the paradigm of many predictor nodes in KNIME
Analytics Platform. It’s sufficient to select the column for which we want the model to
generate a prediction. Additionally, we can decide to append individual class
probabilities, or set a custom probability threshold for determining predictions in the
“Multi-label” tab.

Configuration of the BERT Predictor node.

99
Consumer Mindset Metrics
Analyze Customer Sentiment

Upon full execution of the BERT Predictor


node, we evaluate the performance of the
model using the Scorer node, and save it for
deployment. The figure below shows that the
multiclass sentiment classifier powered by
BERT obtained 83% accuracy. Increasing the
number of epochs during training is likely to
result in even better performance.

Evaluation of the sentiment classifier with BERT.

Building a sentiment predictor using BERT nodes.

Deploy the sentiment predictor on new tweets

After assessing the performance of our predictor, we implement a second workflow to


show how our predictive model could be deployed on unlabeled data. The mechanics
of both workflows are very similar.
We collected new tweets around @AmericanAir for a week using the KNIME Twitter
Connectors nodes, then wrapped them in the “Tweet Extraction” component.
Next, we replicate the minimal text preprocessing steps of the training workflow,
import the trained BERT model, and apply it to new unlabeled tweet data using the
BERT Predictor node.
Given the lack of sentiment labels in the newly collected tweets, we can only assess
the performance of the BERT-based predictor subjectively, relying on our own
judgment.

100
Consumer Mindset Metrics
Analyze Customer Sentiment

Deploying a sentiment predictor on unlabeled tweets using BERT nodes.

Create a dashboard to visualize and assess performance

To help us assess the performance, we created a dynamic and interactive dashboard


that combines (1) the tweets, (2) a word cloud with color-coded tweets whose size
depends on their usage frequency, and (3) a bar chart with the number of tweets per
sentiment per date. Users can click different bars in the bar chart or tweets in the word
cloud to modify the content selection.
Taking a look at the figure below, the chart indicates that the number of tweets with
negative sentiments is considerably larger than those with neutral and positive
sentiments over the week. Similarly, the word cloud is mostly colored red. If we
examine the actual tweets, we can see that we’ve confirmed the high performance of
the BERT-based predictor as assessed during training. It is worth noticing that BERT is
able to correctly model dependencies and context information for fairly long tweets,
leading to accurate sentiment predictions.

101
Consumer Mindset Metrics
Analyze Customer Sentiment

Dashboard of unlabeled tweets and their predicted sentiments per date. A quick inspection confirms the high
performance of this predictor.

Higher complexity for better language modeling


BERT-based sentiment analysis is a formidable way to gain valuable insights and
accurate predictions. The BERT nodes in KNIME remove the technical complexity of
implementing this architecture, and let everyone harness its power with just a few
clicks.
While BERT’s underlying transformer architecture, with over 110 million parameters,
makes it fairly obscure to interpret and computationally expensive to train, its ability to
effectively model the complexity of natural languages ensures consistently superior
performance on a number of NLP applications compared to other deep learning
architectures for sequential data or traditional machine learning algorithms. The
tradeoff between complexity and accuracy is here to stay.

102
Consumer Behavior

In this chapter, we will devise ways to monitor the behavior of our customers and get
insights into their decision-making process. We will show how to retrieve information
about website usage and traffic, identify the risk of customers stopping buying our
products, and give proper credits to marketing channels and touchpoints. More
specifically, we will look at use cases to query Google Analytics, predict customer
churn, and in marketing attribution modeling.

This chapter includes the articles:

• Querying Google Analytics, p. 104


– Rosaria Silipo, KNIME

• Predict Customer Churn, p. 111


– Francisco Villarroel Ordenes, LUISS Guido Carli University
– Rosaria Silipo, KNIME

• Attribution Modeling, p. 116


– Anthony Ballerino, Sapienza University of Rome

103
Consumer Behavior

Querying Google Analytics

Author: Rosaria Silipo, KNIME

The KNIME Google Connectors extension provides nodes to connect and interact with
Google resources, like Google Sheets, Google Drive, and Google Analytics. Google
Analytics is a set of services provided by Google to investigate traffic and activity on a
website of your property. This is important: the website must belong to you. This
service can be used to give insight into how visitors to your website are using your site.

In this section, we want to show how to use the KNIME Google Connectors to integrate
Google Analytics into your workflow, connect to the Google Analytics API, and recover
the number of pageviews and entrances for new users coming to our web property.

Set up a Google Analytics account


To connect to the Google Analytics API, you need a Google Analytics account. A
Google Analytics account is organized hierarchically. Any Google account –for
example, a Gmail account– can be used to create a Google Analytics account. Each
Google Analytics account refers to a number of web properties. In turn, each web
property is assigned a number of reporting views. On the support.google.com page
you'll find a more detailed description of how Google Analytics accounts are
organized: “Understand the Analytics Account Structure”.

• Log in to Google with a Google account (a Gmail account, for example).

• Go to https://analytics.google.com/analytics/web

• Create an Analytics Account.

• Within the Analytics account, create one or more web properties referring to the
URL of the pages you want to analyze (use the Universal Analytics property).

Connect to Google Analytics in KNIME


On the KNIME side, from your KNIME workflow you will need to:

• Provide authentication on Google to access your Analytics account with


the Google Authentication node.

104
Consumer Behavior
Querying Google Analytics

• Select the web property you would like


to analyze with the Google Analytics
Connection node.

• Run a Google Analytics query to


extract the numbers about the web
traffic with the Google Analytics The three nodes needed to extract measures of
Query node. web traffic on your web property.

Notice the three ports involved in this data flow:

• The blue square port passes authentication for a Google account connection.

• The red square port provides an Analytics account connection for a specific web
property.

• The familiar black triangle port exposes the extracted measures for the web
traffic.

Provide Google authentication


First, you need to log in to Google with your
generic Google account using the Google
Authentication node. This node is quite simple
to use.
In the node configuration window, first define
whether you want to keep your authentication
key in memory, in a file, or within the node. Of
course, if you save the key within the node and
then export the workflow for other people to use
the key will travel with the workflow. So, be
careful!
Now select the scope of this Google
Authentication. In this case it is to establish a
connection with Google Analytics. We ticked
the appropriate checkbox.
Finally, click the “Sign in with Google” button, in Configuration window of the Google
the top left corner of the node configuration Authentication node.
window.
Notice that Google Authentication is performed on the Google site and no password
or other account information will remain with the node besides the key.

105
Consumer Behavior
Querying Google Analytics

Establish a Google Analytics connection


Next, the Google Analytics Connection node establishes the connection to Google
Analytics. Open the node dialog to specify the Google Analytics account, web property,
profile and profile ID.
If a Google Analytics account is associated with your Google account and if the Google
Analytics account already contains web properties, then the Analytics account, web
property list, and related data are automatically uploaded in the configuration window
of the Google Analytics Connection node. Then, you just have to select the web
property of interest to your report.

Once the node is executed successfully, the connection to the Google Analytics API
for the selected web property is established.

Configuration window of the Google Analytics Connection


node.

Define query with Google Analytics Query


Finally, you can use the Google Analytics Query node to define the query and load the
results from Google Analytics into your KNIME workflow.
The query as well as all query parameters are specified in the configuration window of
the Google Analytics Query node.

106
Consumer Behavior
Querying Google Analytics

Configuration window of the Google Analytics


Query node.

Select Dimensions and Metrics

The dimensions and metrics can be selected in the top right part of the configuration
window. There, in the Settings tab, two dropdown menus show the categories and the
related set of metrics and dimensions.
Dimensions are classes such as full referrer, session count, keywords, country,
operating system, and much more. Metrics are aggregations of data such as page
views, users, new users, bounces, AdSense revenues, etc. on the selected dimensions.
Dimensions and metrics can be added to the query via the “Add” button on the right
and they will appear in the respective list on the left. Use the arrow buttons to decide
their order in the output table, the “+” button to add more dimensions and metrics, and
the “X” button to remove them.
For instance, if “operating system” and “browser” are specified as dimensions and
“users” as metric, then each value represents the sum of users for the given
combination of operating system and browser. For country as dimension and page
load time as metric, the resulting value is the average page load time for users of that
country.
From the “Geo Network” category, we selected and added to the query dimensions
“continent” and “country” and from the “Users” category the metric “New Users”. This
translates into extracting the number of new users (metric) coming to our web property
from each continent and country (dimensions).

107
Consumer Behavior
Querying Google Analytics

Such dimensions and metrics apply to the full basis of the web traffic, all users, all
views, all likes, and so on. The data domain, however, can be restricted either via
Segments or via Filters.

Define a Segment

Segments filter the data before the metrics are calculated. The corresponding
dropdown menu in the node configuration window offers predefined segments to
choose from, e.g., new users, returning users, paid traffic, mobile traffic, android traffic,
etc. We selected “mobile traffic”, which means extracting the number of new users
(metric) coming to our web property from each continent and country (dimensions) via
mobile phone.

Set up a Filter based on Dimensions

At the same time, you can restrict the metric domain by setting up a filter based on the
selected dimensions. A filter rule restricts the data after the calculation of the metrics.
Possible operations include less than, greater than, regex matches and many more.
They can also be combined with a logical AND or OR. For a full list of available
operations and details about the syntax please see the node description or the “Google
Analytics developer documentation”.
We introduced “Continent==Europe”, which means extracting the number of new users
(metric) coming to our web property from each continent and country (dimensions) via
mobile phone and exporting only the numbers related to Europe.

Sort Results

Sorts the results by the selected dimension or metric. The sort order can be changed
to descending by prepending a dash.

Specify Start and End Date

Specifies the time frame for the returned data. Both start and end date are inclusive.
We introduced start date 2020-08-22 and end data 2021-08-22, which means
extracting the number of new users (metric) coming to our web property from each
continent and country (dimensions) via mobile phone between August 22, 2020 and
August 22, 2021 included, and exporting only the numbers related to Europe.

108
Consumer Behavior
Querying Google Analytics

Start Index

The API limits one query to a maximum of 10000 rows. To retrieve more rows the index
parameter can be used as a pagination mechanism.

Set Max Results

Max Results sets the maximum number of rows that should be returned. The
maximum value is 10000. For more details about the parameters and settings see
the “Google Analytics developer documentation”.

Example 1: Top 100 referring pages to your web site

Configuration dialog to provide the top 100 referrals that


brought new users to your website.

The result of this configuration are the top 100 referrals that brought new users to your
website. The dimensions “source” and “referralPath” contain the source addresses and
the path of the page from which the new users came to your website. The metric “new
users” results in the number of new users for every referring page. By sorting by “new
users” in descending order and limiting the max results to 100, we get the 100 most
relevant referrals.

109
Consumer Behavior
Querying Google Analytics

Example 2: Top visited topic pages

Configuration to find out the most relevant topics for a


specific month, in this case June 2014.

The results of this configuration are the most relevant topics for the selected date
(here the month of June 2014) ascertained by total number of pageviews. The
dimension “pagePath” is used to list all topics of the web site (it is a forum, all forum
topics). The dimension “pageTitle” contains the corresponding topic name.

• The metric “pageviews” counts the number of views for each page.

• The specified filters filter out all pages that had less views than 100 and filter out
all pages that are not under the forum page.

• The result is then sorted descending by page views to get the most relevant
pages first.

• The specified start and end date keeps only the page views for the selected time.

110
Consumer Behavior

Predict Customer Churn

Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Rosaria Silipo, KNIME

Workflows on the KNIME Community Hub: Training a Churn Predictor and Deploying a Churn Predictor.

While we are not sure which data analytics task is the oldest, prediction of customer
churn has certainly been around for a very long time. In customer intelligence, “churn”
refers to a customer who cancels a contract at the moment of renewal. A company’s
CRM is often filled with such data. Identifying which customers are at risk of churning
is vital for a company’s survival. Because of this, churn prediction applications are very
popular, and were among the earliest data analytics applications developed.
Here we propose a classic churn prediction application. The application consists of
two workflows: a training workflow shown in the first figure, in which a supervised
machine learning model is trained on a set of past customers to assign a risk factor to
each one, and a deployment workflow shown in the second figure, in which that trained
model assigns a churn risk to new customers. 1

The training workflow trains a few random forest models to assign churn risk.

111
Consumer Behavior
Predict Customer Churn

The deployment workflow calculates the churn risk of new customers.

Customer data
Customer data usually include demographics (e.g., age, gender), revenues (e.g., sales
volume), perceptions (e.g., brand liking), and behaviors (e.g., purchase frequency).
While “demographics” and “revenues” are easy to define, the definitions of behavioral
and perception variables are not always as straightforward since both depend on the
business case.
For this solution, we rely on a popular simulated telecom customer dataset, available
via Kaggle. In our effort to provide a broader overview of KNIME functionality, we split
the dataset into a CSV file (which contains operational data, such as the number of
calls, minutes spent on the phone, and relative charges) and an Excel file (which lists
the contract characteristics and churn flags, such as whether a contract was
terminated). Each customer can be identified by an area code and phone number. The
dataset contains data for 3,333 customers, who are described through 21 features.

Preparing and exploring the customer data


The training workflow in the figure above starts by reading the data. Files with data can
be dragged and dropped into the KNIME workflow. In this case we’re using XLS and
CSV files, but KNIME supports almost any kind of file (e.g., parquet, JSON).
The second step consists of joining the data from the same customers in the two files,
using their telephone numbers and area codes as keys. In the Joiner node, users can
specify the type of join (inner, right, left, full).

112
Consumer Behavior
Predict Customer Churn

After that the “Churn” column is converted into string type with the Number to
String node to meet the requirement for the upcoming classification algorithm (in this
case, nominal). Note that KNIME offers a series of nodes to manipulate data (e.g.,
string to date, or vice-versa).
Before continuing with further preparation steps, it is important to explore the dataset
via visual plots or by calculating its basic statistics. The Data Explorer node (or else
a Statistics node) calculates the average, variance, skewness, kurtosis, and other
basic statistical measures, and at the same time it draws a histogram for each feature
in the dataset. Opening the interactive view of the Data Explorer node reveals that the
churn sample is unbalanced, and that most observations pertain to non-churning
customers (over 85%), as expected. There are typically much fewer churning than non-
churning customers. To address this class imbalance, we use the SMOTE node, which
oversamples the minority class by creating synthetic examples. Notice that the
execution of the SMOTE procedure is very time and resource consuming. It was
possible here because the dataset is quite small.

Training and testing the customer churn predictor


For a classification algorithm, we choose the Random Forest, implemented by
the Random Forest Learner node, with 5 as the minimum node size (the minimum
number of observations per node) and 100 trees. The minimum node size controls the
depth of the decision trees, while the number of trees controls the bias of the model.
Any other machine-learning-supervised algorithm would have also worked, from one
simple Decision Tree to a Neural Network. The Random Forest is chosen for illustrative
purposes, as it offers the best compromise between complexity and performance.
A Random Forest Predictor node, which relies on the trained model to predict patterns
in the testing data, follows. The predictions produced by the node will be consumed by
an evaluator node, like a Scorer node or an ROC node, to estimate the quality of the
trained model.
In order to increase the reliability of the model quality estimation, the whole learner-
predictor block was inserted into a cross-validation loop, starting with an X-
Partitioner node, and ending with an X-Aggregator node. The cross-validation loop
was set to run a 5-fold validation. This means it divided the dataset into five equal
parts, and in each iteration, it used four parts for training (80% of the data) and one
part for testing (20% of the data). The X-Aggregator node collects all predictions on
the test data from all five iterations.
The Scorer node matches the random forest predictions with the original churn values
from the dataset and assesses model quality using evaluation metrics such as
accuracy, precision, recall, F-measure, and Cohen’s Kappa. The ROC node builds and
displays the ROC curve associated with the predictions and then calculates the Area

113
Consumer Behavior
Predict Customer Churn

under the Curve (AuC) as a metric for model quality. All these metrics range from 0 to
1; higher values indicate better models. For this example, we obtain a model with 93.8%
overall accuracy and 0.89 AuC. Better predictions might be achieved by fine-tuning the
settings in the Random Forest Learner node.

Deploying the customer churn predictor


Once the model has been trained and evaluated and the researcher is satisfied with its
predictive accuracy, it should be applied to new data for real churn prediction with data
from the current real-world. This is the task of the deployment workflow.
The best-trained model —which in this case turned out to be the one from the last
cross-validation iteration— is read (Model Reader node), and data from new customers
are acquired (CSV Reader node). A Random Forest Predictor node applies the trained
model to the new data and produces the probability of churn and the final churn
predictions for all input customers. The workflow concludes with a composite view,
produced with the “Churn Visualization” component.
The composite view of the “Churn Visualization” component shows the churn risks, as
bars and as tile, for the five new customers read from the CSV file. It predicts that four
customers will not churn (blue), and one will (orange). All the items in a composite
view are connected; selecting a tile prompts the selection of the corresponding bar in
the chart, and vice-versa.

The interactive dashboard of the deployment workflow, reporting the churn risk in orange
for all new customers.

114
Consumer Behavior
Predict Customer Churn

Conclusions
We have presented here one of the many possible solutions for churn prediction based
on past customer data. Of course, other solutions are possible, and this one can be
improved.
After oversampling the minority class (churn customers) in the training set using the
SMOTE algorithm, a Random Forest is trained and evaluated on a 5-fold cross-
validation cycle. The best trained Random Forest is then included in the deployment
workflow.
The deployment workflow applies the trained model to new customer data, and
produces a dashboard to illustrate the churn risk for each of the input customers.

115
Consumer Behavior

Attribution Modeling

Author: Anthony Ballerino, Sapienza University of Rome

Workflow on the KNIME Community Hub: Touch-based, Correlation and Regression, Shapley-based, and
Randomized field experiments

Attribution modeling is the practice of allocating appropriate relevance to each


marketing touchpoint across all online and off-line channels and mapping these to
monetarily relevant events (e.g., a purchase) within a customer journey. A touchpoint
is the moment where a customer interacts with a business, e.g., browsing the e-
ecommerce website, clicking on a banner, etc. A channel is the medium through which
this interaction happens, e.g., radio, social media, etc.
For example, a business that wants to advertise its product to secure high sales should
invest in advertising (i.e., the touchpoint) on television (i.e., the channel), distributing
flyers, or paying for banner displays? Is a combination of these actions more effective,
or either of them alone is the key to success? Answering these questions can be
challenging because modeling consumer purchasing behavior is susceptible to
constant evolutions in the socio-cultural and technological environment.
Using attribution modeling, however, marketers can understand which (combination
of) touchpoint(s) and channel(s) play a key role in leading to a desired customer action
–typically, a conversion or a purchase– and best allocate resources for advertising
campaigns. Attribution modeling is also valuable to measure the incremental effect of
each touchpoint and predict the probability of driving profitable customer action based
on the customer’s path. Lastly, this approach can also serve as a budget and effort
rationalization tool, as it helps marketers identify touchpoints and channels that are
less effective, redundant, or even driving customers away.
There exist many different types of attribution models, and although all take channels
and touchpoints into account, the way each of them weighs those channels and
touchpoints differs. Some of the most common include:

• First-touch attribution. It gives all the credit to the first touchpoint/channel a


customer comes into contact with before a purchase is made.

• Last-touch attribution. It gives all the credit to the last touchpoint/channel a


customer comes into contact with before a purchase is made.

• Linear attribution. It gives equal credit to all touchpoints/channels a customer


comes into contact with before a purchase is made.

116
Consumer Behavior
Attribution Modeling

• Position-based attribution. It gives credit to specific touchpoints/channels in the


conversion path, typically the first and last.

• Time-decay attribution. It gives credit to all of the touchpoints/channels a


customer comes into contact with before a purchase is made and also considers
the time that each touchpoint occurred. The touchpoints that happened closest
to the time of conversion are weighted most heavily.
In this section, we will use KNIME Analytics Platform and the nodes of the R integration
for KNIME nodes to build a low-code solution that marketers can use to measure the
effects of different channels and allocate resources accordingly. After exploring
customer journey data both visually and with descriptive statistics, the workflow
replicates the first part of the chapter "Attribution Modeling" by de Haan 9 and
implements several attribution methods: from the most basic models (i.e., first and
last-touch attribution) to more complex ones (i.e., Shapley values-based attribution
and randomized field experiments). Finally, we conclude the workflow with an
interactive dashboard that reports the results for each model.

Four approaches to attribute profitable customer action

Following the work by de Haan, the workflow is divided into different sections: from the basic touch-based
attribution models to the more complex randomized field experiments.

The KNIME workflow to determine the marketing touchpoint(s) a customer has


encountered before making a purchase implements four attribution models presented
in ascending order of complexity:

• Touch-based attribution.

• Correlation and regression attribution.

• Shapley values-based attribution.

• Randomized field experiments.

9de Haan, E. (2022). Attribution Modeling. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of
Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-05542-8_39-1.

117
Consumer Behavior
Attribution Modeling

Data ingestion and exploration

The customer journey dataset used for the analysis has been artificially generated by
de Haan using R and contains 50k customer journeys, with approximately 25% of the
journeys resulting in a purchase. The length of each journey ranges between 1 and 50
touchpoints. The dataset includes in total eight unique touchpoints, such as banner
impressions, e-mails, website visits, and clicks to search-engine advertising for brand-
and product-related keywords.
We start off by reading in the dataset and performing some simple aggregations to
represent the distribution of the touchpoints. Additionally, we use the Data Explorer
node to compute some descriptive statistics (e.g., mean, median, standard deviation,
min, max, etc.).
To visually explore the dataset, we use the Bar Chart node to plot the most frequent
touchpoints between a company and its (potential) customers. It turns out that for
most customers visiting the website was the most frequent touchpoint. Furthermore,
using the Sunburst Chart node, we observe that about 70% of the total interactions
occurred either via direct visit to the company website or via banner advertising.

Total occurrences per touchpoint. Notice the interactive selection across plots.

Touch-based attribution

Let’s now delve into the core of the workflow and see how we have implemented the
first three basic touch-based attribution models:

• First-touch attribution.

• Last-touch attribution.

• Average-touch attribution.
For the first two models mapping the attribution is straightforward since the dataset
contains the columns “First channel” and “Last channel. All we need to do is sum up
the purchases for these two columns separately, and sort the total purchases in
descending order. This can be easily done with the GroupBy and Sorter nodes. With
last-touch attribution, we conclude that a sale can only be attributed to a channel that

118
Consumer Behavior
Attribution Modeling

leads to a “Direct visit”. On the other hand, with first-touch attribution, also “Banner
impression” can get credit for a conversion.
Attributing credit to a touchpoint with the average attribution method is less intuitive.
We first need to compute the weight of a touchpoint in each single customer journey.
To do that, we divide each touchpoint by the total number of touchpoints in each
customer journey and multiply the result by the “Purchase” column, a dummy variable
indicating if the path to purchase ends with a purchase (1) or not (0). This calculation
is implemented in KNIME using just two nodes: the Math Formula (Multi Column) and
the GroupBy node. Also in this case, direct website visits and banner impressions get
the most credit. The three bar charts below provide a snapshot of the results for each
method.

The bar charts display the touchpoints that get the most credit for a conversion according to the three
different touch-based attribution models.

Correlation and regression attribution

Another, more statistics-driven way of determining attribution is by looking at


correlations or estimating a logistic regression model. In such a way, we can explicitly
relate the touchpoints a customer has come into contact with to the dependent
variables of interest, e.g., a purchase.
We begin by creating a correlation matrix with the Linear Correlation node. By
inspecting the matrix, we can easily observe statistical relationships between the
variables, and identify those that are highly correlated with our variable of interest,
“Purchase”.

119
Consumer Behavior
Attribution Modeling

Although the Linear Correlation node already produces a local view, we prefer to
display the matrix with the Heatmap node, which creates an interactive visualization
that can be integrated in the component composite view. We see that “Purchase”
correlates positively (although weakly) with most other variables. The strongest
correlations can be observed with “Direct visit” and “Amount touchpoints''. While
interesting, correlations have to be interpreted with caution since they only tell us
something about the relationships between variables, not about their causality.
Nevertheless, correlation tables are a convenient way to identify soft patterns in the
data, and a popular starting point for more sophisticated analyses.

The correlation matrix above is the result of the Linear Correlation and Heatmap
node.

A sounder way of investigating the associations between variables is by estimating a


regression model. The advantage of using a regression model over correlations is the
possibility to control for additional variables simultaneously, which takes us a step
closer to finding causal relationships. Since the response variable “Purchase” is binary,
a logistic regression is a very suitable choice. To predict purchase likelihood, we build
two models. Model 1 only looks at the last channel used, and “Banner click” is here the
reference category, meaning that the interpretation of the parameters is relative to this
touchpoint. In addition to the above, in Model 2 we also control for some customer
specific variables, such as customer lifetime value (CLV) and “Relation length”.
In KNIME Analytics Platform, we have the option to use the Logistic Regression
Learner node or to copy the R script provided by de Haan in the R Snippet node. For
model 1, we implement both options in parallel, whereas for model 2 we opt for the R
script. Additionally, we ensure script portability with the Conda Environment
Propagation node. For the sake of visual consistency with de Haan’s work, we format

120
Consumer Behavior
Attribution Modeling

the summary table with the logistic regression coefficients returned by the R Snippet
node and display it with the Table View node.
In model 1, consistently with the results obtained with last-touch attribution, we
observe that “Direct visit” has a strong and positive estimate, indicating that there is a
higher chance for conversion associated with this touchpoint than with “Banner click”
whenever “Direct visit” is the last touchpoint in a customer’s journey. On the other hand,
“Banner impression” and “Email received” have strongly negative estimates, hence
they don’t directly relate to a conversion. In model 2, “Direct visit” becomes statistically
insignificant in favor of “CLV” and “Relation length”. The underlying reason might be
that direct website visits are more likely to happen among loyal and long-existing
customers, who also have a higher chance of conversion.

Logistic regression coefficients for model 1 and 2 (incl. confidence intervals) when we look at the last
channel used. “Banner click” is used as the reference category.

Note. The Logistic Regression Learner node outputs coefficient estimates,


standard errors, and z-scores. The code in the R Snippet outputs a table containing
coefficient estimates with their significance level (either “*”, “**” or “***”), and
confidence intervals of the regression coefficients in parentheses.

121
Consumer Behavior
Attribution Modeling

Regression models are more suitable for attribution modeling than touch-based or
linear correlation attribution since we are able to control for several variables at the
same time. However, they still remain unfit to shed light on causal relationships among
touchpoints and channels –for example, in the event that a particular channel or
touchpoint does not occur.

Shapley values-based attribution

Another approach to the attribution problem that has become more popular than
correlations and regressions in recent years is Shapley values-based attribution. This
approach compares similar customer’s paths to purchase with the only difference that
in some paths a specific touchpoint is not included. Shapley values-based attribution
also allows us to answer the following question: “How would the outcome change if a
specific touchpoint was not included in a specific customer’s journey?”. For example,
let’s take the following path as the starting point to compare similar paths:
Banner impression → Product search → Brand search

The way this approach works is very intuitive. Firstly, all the observations (customer
journeys) that correspond to this path are extracted from the dataset and the average
purchase probability is computed. Afterwards, three sub-paths are obtained and
extracted by removing one of the three touchpoints at the time. The average
conversion probability is then computed also for each of these three sub-paths. For
the example path above, the sub-paths are:
1. Banner impression → Product search
2. Banner impression → Brand search
3. Product search → Brand search
Implementing this procedure is really simple and intuitive in KNIME. We use a series
of Rule-based Row Filter nodes to extract the journey of interest and the
corresponding sub-paths. Next, we use the GroupBy nodes to compute the conversion
probability of each path, and with the Math Formula (Multi Column) we obtain the
difference between the purchase probability of the journey of interest and the purchase
probability when excluding a focal touchpoint. In this way, we are able to identify the
incremental probability of conversion across similar paths.

122
Consumer Behavior
Attribution Modeling

Computing Shapley-value based attribution in KNIME Analytics Platform.

In the table below, we can see that in 16.28% of the cases the full path results in a
conversion. If we exclude “Banner impression”, only in 7.49% of the cases the path
results in a conversion, meaning that whenever we leave out “Banner impression” the
conversion probability drops by 8.79%. This percentage also represents the credit that
“Banner impression” should get to drive profitable customer action.
With Shapley-value based attribution, it is indeed possible to explain the contribution
of each touchpoint in a particular consumer journey.

The table shows the incremental effect of each touchpoint in the


customer path to purchase.

Randomized field experiments

The last approach to attribution modeling proposes the use of field experiment data to
understand the impact a channel/touchpoint has whenever one or multiple of them are
excluded for some groups of customers. To do that, consumers are randomly placed
into two groups: one group is exposed to the channel/touchpoint of interest (i.e., the
treatment group), and one group is not exposed to this channel/touchpoint (i.e., the

123
Consumer Behavior
Attribution Modeling

control group). The dataset used by the authors in the book contains two different
randomized field experiments.
In the first experiment, the treatment group (80% of the customers) encountered a firm
banner, whereas the control group (the other 20% of the customers) encountered an
unrelated banner, from a charity organization in this case. The fact that the alternative
banner is unrelated to the firm allows us to consider its causal effect on firm
performance and customer behavior as null.
In the second field experiment, flyers are distributed to consumers in randomly
selected regions. The treatment group received the flyer, whereas the control did not.
Due to the random allocation, the experiment setup leads to a 50% split. Because
information about the region in which customers live is available, we know exactly
which customers received the flyer. Hence, this experiment allows us to investigate
the effect of a channel/touchpoint at the individual customer level.
We can furthermore investigate if there are synergy effects between the banner ads
and the flyer. For example, does being exposed to both advertising forms increase the
purchase likelihood relative to the two individual effects (positive synergy)? Or do they
weaken each other since they might be substitutes (negative synergy)? A preliminary
step should verify whether consumers in the firm’s banner group and who have
received a flier indeed have a higher likelihood of purchasing.
The KNIME implementation follows once again de Haan’s work and produces several
interactive charts to visualize the conversion rate according to the channel/touchpoint
and group of belonging (either the control or the treatment group). Furthermore, we
inspect the synergy effects between variables both visually and by estimating new
logistic regression models. To do so, we rely both on KNIME’s native JavaScript-based
Views and Logistic Regression Learner nodes, as well as on the R integration for the
sake of exact display of plots and regression coefficients.
In the charts below, we can see the difference between customers in the treatment
group (1) vs. the control group (0) for banner advertising (left) and flyer region (right).
We observe that the conversion rate is 28% for the treatment group vs. 17% for the
control group when banner advertising is used, indicating a strong effectiveness on
purchase likelihood. Similarly, when flyers are used, the conversion rate is considerably
higher for the treatment group (33%) than for the control group (18%).

124
Consumer Behavior
Attribution Modeling

Conversion rates when using a banner (left) or a flyer (right) for different groups: treatment (1), control (0).
Interactive bar charts using the Bar Chart node (top); static error bars using the R View (Table) node
(bottom).

We can also visualize the synergy effect of banner advertising and the flyer. When the
firm distributes a flyer, not being in the firm’s banner group increases the purchase
likelihood from 13.61 to 20.26%, i.e., an increase in conversion of 6.65%. When the firm
distributes a flyer, being in the firm’s banner group increases the purchase likelihood
from 19.70 to 35.91%, i.e., an increase in conversion of 16.21%. In other words, when
the firm distributes a flyer, the banner becomes more effective.
More insights about the synergy effects can be obtained from the wealth of additional
plots and regression coefficient tables, which can all be visualized and interacted with
in the final dashboard.

125
Consumer Behavior
Attribution Modeling

Banner and flyer synergy visualized using the Bar Chart node (left), and the R View (Table) node (right).

A low-code approach to power attribution modeling


Attribution modeling is an important field of research for marketers, as it allows them
to give the correct credit to the contribution made by each channel/touchpoint in a
customer’s journey to purchase.
In this section, we described how we replicated the original analysis conducted by de
Haan using a low-code approach with KNIME. This open-source and free tool allowed
us to build a workflow that implements in a user-friendly way four different attribution
models with increasing complexity.
The results are presented in a comprehensive and interactive dashboard to enable
marketers to easily understand which channel/touchpoint combination contributes the
most to conversion, and to allocate budget more effectively.

126
Marketing Mix

In this chapter, we will focus on strategies to understand how pricing metrics affect
our business –be it to reduce customer churn, increase market share or drive
profitability. More specifically, we will look at a use case to perform price optimization
using two different approaches.

This chapter includes the article:

• Pricing Analytics, p. 128


– Anil Özer, KNIME
– Roberto Cadili, KNIME

127
Marketing Mix

Pricing Analytics

Authors: Anil Özer & Roberto Cadili, KNIME

Workflow on the KNIME Community Hub by STAR COOPERATION: Price optimization: value-based pricing
and regression

Pricing analytics is defined as the combination of tools and metrics used to


understand how pricing metrics affect business, determine revenue and profitability at
specific price points, and optimize a business’ pricing strategy for maximum revenue.
For example, leveraging data from consumer behavior and market research, a
company can use pricing analytics to determine the optimal price for a product or
service that ensures higher product margins, reduces customer churn, and increases
market share within a market.
Companies across every business sector and industry vertical –from manufacturing
and distribution to retail and e-commerce– can benefit from pricing analytics to
improve profitability. Some of the key reasons why pricing analytics helps companies
increase profits and their potential to grow include: the acquisition of new customer
insights, the identification of quick pricing wins, the recognition of which pricing
strategy is most effective, the optimization of pricing for value and the improvement
of operational efficiency. On average, a price increase of 1% leads to a profit increase
of approximately 9% (STAR Cooperation project results). Furthermore, pricing
measures require low investments and show fast results.
Although an inexpensive measure, making pricing decisions that are both highly
sensitive and complex is not trivial. To ensure effective pricing, decisions should take
three perspectives into account: company, customer, and competition. To
operationalize that, companies should track some key metrics:

• Willingness to pay. It refers to the maximum amount a customer is ready to pay


for a product/service. This metric helps understand a product’s perceived value,
competitive benchmarking or if any profuse development effort is yielding the
desired augmented product value.

• Feature value. It measures which features are more or less important to


customers relative to other features. This metric helps prioritize development
efforts and understand better willingness to pay.

• Average revenue per user (ARPU). It is a measure of the revenue generated each
month (or for a different predefined period) from each user. It is calculated by
dividing the total monthly recurring revenue (MRR) by the total number of

128
Marketing Mix
Pricing Analytics

customers. It tells whether a chosen pricing strategy suits the market and allows
the company to stay competitive.

• Customer lifetime value (CLV). Together with customer acquisition cost (CAC),
CLV measures whether the investment to drive and keep customers is profitable.
If CAC outweighs CLV, the costs are jeopardizing the revenues.
Based on the insights extracted from the metrics above, the best pricing strategy can
be applied and adjusted. Some of the most commonly used ones include:

• Cost-plus pricing. With this strategy, production costs (i.e., direct material cost,
direct labor cost, overhead costs, etc.) are summed up and added to a markup
percentage in order to set the final price of the product. This approach can
provide a good starting point, but it is usually not comprehensive enough to
inform a thorough pricing strategy.

• Penetration pricing. This strategy sets the price very low in order to attract new
customers and gain substantial market share quickly. Because pricing low
sacrifices profitability, it is feasible only for a short period of time until market
share is gained.

• Cream pricing. This strategy aims at reaching profitability not by high sales, but
by selling the product at a high price. It is usually used to target early adopters
and for a limited duration to recover the cost of investment of the original
research into the product.

• Value-based pricing. This strategy prices the product based on the value the
product has for the customer and not on its costs of production, under the
condition that the former is considerably higher than the latter. To apply this
strategy, it’s essential to understand customers’ perception, preferences and
alternatives.

• Time-based pricing. This strategy looks at market fluctuations and customer


data (e.g., past purchases, amount spent, geographical location, etc.) to
dynamically adjust the prices of identical goods to correspond to a customer's
willingness to pay.
A notable example of pricing analytics is Netflix. This streaming service has been using
pricing analytics for years to stay on top of the video streaming market competition. It
regularly collects data from various sources and uses this data to inform its pricing
decisions. By analyzing both customer behavior and competitor pricing, Netflix
understands how price changes impact its customers and adjusts its pricing suitably.
Thanks to the abundance of data, businesses can go beyond traditional cost-plus
pricing methods and analyze prices in a more complex and nuanced manner to make
informed pricing decisions. Using KNIME Analytics Platform, we will merge, group,
transform and visualize product, price, and sales data to understand which factors
influence sales performance on a granular level. Next, we will integrate and compare

129
Marketing Mix
Pricing Analytics

competition information to discover the need for price adaptation. Finally, we will
optimize prices systematically using either value-based pricing or linear regression.
While a suitable analytical tool is fundamental, pricing expertise is needed to check the
plausibility of new prices and to weigh up different pricing measures.

Four steps to a long-lasting increase in sales and profit

Price optimization: value-based pricing and regression workflow.

After a short introduction to Pricing Analytics, let’s now delve into the practical
implementation of a price optimization workflow using KNIME Analytics Platform. The
original workflow was developed by STAR COOPERATION. We'll guide you through a
seamless, codeless solution comprised of four steps:
1. Data preparation and analysis for pricing.
2. Integration of competition information.
3. Price optimization with value-based pricing.
4. Price optimization with linear regression.

1. Data preparation and analysis for pricing

To kick things off, we'll start by reading a practice dataset containing product and order
information of an e-commerce shop. To do this, we'll use two distinct reader nodes:
the Excel Reader node to access product prices and categories, and two separate CSV
Reader nodes to retrieve orders spanning across different years (2017-2019).
Next, we’ll perform a few data merging and joining operations. With the Concatenate
node, we bring together orders from 2017-2018 and 2019, and we join them with the
product price and category spreadsheet using the Joiner node.
For data transformation, we use the Row Filter node to eliminate irrelevant information,
such as canceled orders and returns. Then, we isolate the year of the order with the
String Manipulation node, and calculate the turnover (sale price x ordered quantity)
using the Math Formula node. We filter out articles with missing turnover information

130
Marketing Mix
Pricing Analytics

and use the GroupBy node to compute for each Article ID either the mean or the first
occurrence of each feature (e.g., Tax, Shipping Time, Length, Article Name, etc.)
With our data fully prepared, we're now almost ready to bring the data to life through
visualization. Before connecting the view nodes, we perform yet another aggregation
by feeding the data into two different Pivoting nodes. The first one will provide us with
the turnover per product category and year, while the second will give us the sales per
year and product category.
Subsequently, in the “Prepare for visualization” metanode, we bring the data in a
suitable shape for the plots and use the Color Manager node to assign colors to each
product category. To display turnover trends for each product category (x-axis) over
the years (y-axis), we employ the Line Plot node. We can see that ironware purchases
reached a peak in 2017, generating a turnover of about 19000 euros. On the contrary,
in 2018 ironware purchases yielded only 1500 euros.

Turnover trends for each product category and year.

To showcase sales per product category (y-axis) and year (x-axis), we create a grouped
bar chart using the Bar Chart node. Consistent with the line plot above, we observe
high sales for ironware in 2017. However, we can see that despite high sales for
storage in 2018, these generated a fairly low turnover.

131
Marketing Mix
Pricing Analytics

Sales per year and product category.

2. Integration of competition information

To gather information on competitors, we built a web crawling system that collects


product prices from ten different competitors and stores them in an Excel file. With the
Excel Reader node, we can easily ingest it in our KNIME workflow.
With the data already neatly prepared and transformed, we can effortlessly join
information about our competitors with the aggregated product-order dataset. The
resulting table provides a much broader overview of the market in which our company
operates and will establish the basis for the applications of value-based pricing and
price optimization with linear regression.

Compare pricing of the article “screwdriver” relative to ten competitors.

132
Marketing Mix
Pricing Analytics

The table produced by this joining operation can also serve as a useful source for
creating visualizations and gaining further insights. For instance, we can compare and
visualize how we price the article “screwdriver” relative to our competitors using the
Box Plot node. This box plot provides a nice statistical overview of the article-price
range, where we can observe that the average price of our article is comparably low
(8.30 euros) and almost half the price of our competitors’ median price (14.5 euros).

3. Price optimization with value-based pricing

Price optimization with value-based pricing requires extensive domain knowledge and
pricing expertise. Identifying which factors make sense for this approach varies
according to the industry and product/service. Likewise, correctly devising a complex
set of rules that best capture a customer’s perceived product value, preferences and
alternatives often requires the manual setting and fine-tuning of thresholds and
conditions.
We start off by employing two separate Math Formula nodes to calculate sales
development as a percentage in 2017-2019 and determine the average competition
price. To handle missing values in sales development, we use the Rule Engine node to
assign a set of values whenever specific user-defined conditions are met.
Next, we employ in parallel a series of Rule Engine nodes to assign scores for sales
development, competitive pressure and product value based on user-defined
thresholds or labels. An overall score is then computed as the weighted sum of the
scores above with importance weights. Thanks to this new metric, we can define new
rules and intervals to assign price adjustments.
Following this, we wrap in a component a series of Math Formula nodes to calculate
new sales prices, review the contribution margin, and obtain new turnover figures. The
results of this update are aggregated by summing the values of the ordered quantity,
current and new turnover for each product category. From here, we can calculate the
turnover development as a percentage to gain further insights into the data. With the

Implementation of value-based pricing.

133
Marketing Mix
Pricing Analytics

exception of ironware, the value-based pricing that we implemented should ensure


turnover increases between 0.13% and 5.33% depending on the product category.

Resulting turnover increases by product category.

4. Price optimization with linear regression

The second approach to price optimization relies on a simple linear regression analysis
to predict future sale prices. Unlike value-based pricing, this approach allows for a
greater degree of automation and removes several manual and fine-tuning steps.
Since our dataset contains time information (i.e., Order Year), we cannot use the
Partitioning node to split our dataset into training and test set, as doing so would cause
a data leakage problem. Rather, we sort Order Year in ascending order and use the
Rule-based Row Splitter node to divide our data into a training set for 2017-2018 and
a test set for 2019.

We then feed the training set into the Linear Regression Learner node to train our
statistical model, and we apply the latter to test data using the Regression Predictor
in order to make sale price predictions.
Once the predictions have been output, we can proceed as in the previous approach
and compute the new contribution margin and turnover figures. Finally, as a last step
in this workflow, we aggregate the results to inspect current vs. new turnover by
product group, and express turnover development as a percentage. Using regression
analysis for price optimization, we should expect a turnover increase of 10% for all our
product categories.
As a final remark, it’s worth noticing that this example can be expanded to achieve
higher robustness and reliability by inspecting in advance which independent
variable(s) explains current sale prices.

134
Marketing Mix
Pricing Analytics

Implementation of price optimization with linear regression.

Resulting turnover increases by product category.

Pricing analytics to lift off your business


With this article, we successfully explored how to conduct pricing analytics utilizing
KNIME Analytics Platform and a no-code approach.
We started off by ingesting product prices and categories data as well as order data
from various years. After a process of transformation, aggregation and visualization,
we merged our product data with competition’s information.
Finally, we used two separate methods for price optimization: value-based pricing and
linear regression. The former required a considerable amount of domain expertise and
manual tuning, whereas the latter lent itself more easily to automation. Both methods
offered insightful results and could help make well-informed decisions to increase
profitability and potential for growth.

135
Customer Valuation

In this chapter, we will understand how to capture and properly account the value each
customer has to our business by monitoring a few key indicators. More specifically,
we will look at use cases to measure customer lifetime value, and compute recency,
frequency and monetary value scores.

This chapter includes the articles:

• Customer Lifetime Value, p. 137


– Jie Zhao, DVW Analytics

• Recency, Frequency and Monetary Value, p. 147


– Thomas Hopf, DER Touristik
– Alina Chrysa Nikolaidou, DER Touristik

136
Customer Valuation

Customer Lifetime Value

Author: Jie Zhao, DVW Analytics

Workflow on the KNIME Community Hub: SAP ECC Customer Life Time Value Analysis

Customer Lifetime Value (CLV) is a measurement of how valuable a customer is to


your company, not just on a purchase-by-purchase basis but across the whole
relationship. It’s an important metric for any sales team as gaining understanding of
the sales potential of existing customers to increase the value of your existing
customers is a great way to drive sales revenue growth.
The importance of CLV is widely recognized by many successful companies across
many different industries as it gives Marketing and Sales departments alike the ability
to segment customers by the cost of acquisition and revenue into different profitability
groups. This allows businesses to focus marketing and sales budget on the right
customer segment which can yield the best bang for your buck. Here are a few key
reasons to track and use CLV:

• Improvement comes from measurement. Measuring CLV and breaking down its
various components empowers businesses with new valuable insights that can
be employed to adopt ad-hoc strategies around pricing, sales, advertising, and
customer retention with a goal of continuously reducing costs and increasing
profit.

• Better customer acquisition. Knowing which types of customers generate higher


profits, businesses can better allocate resources to increase or decrease budget
spending that maximizes profitability and continues to attract the right types of
customers.

• Better forecasting. CLV forecasts help businesses make forward-looking


decisions around inventory, staffing, production capacity and other costs.
Without a forecast, there exists the risk of overspending in areas that don’t require
it or underspending in areas that are crucial to keep up with demand.
There exist different ways of calculating CLV. One basic yet effective formula is:
𝑪𝑳𝑽 = 𝐶𝑢𝑠𝑡𝑜𝑚𝑒𝑟 𝑟𝑒𝑣𝑒𝑛𝑢𝑒 𝑝𝑒𝑟 𝑦𝑒𝑎𝑟 × 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑖𝑛 𝑦𝑒𝑎𝑟𝑠
− 𝑇𝑜𝑡𝑎𝑙 𝑐𝑜𝑠𝑡𝑠 𝑜𝑓 𝑎𝑐𝑞𝑢𝑖𝑟𝑖𝑛𝑔 𝑎𝑛𝑑 𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟
Customer revenue per year refers to the total annual spend of the customer. This
information can be obtained from the SAP sales order tables.
Relationship in years refers to the total trading period of the customer in years.

137
Customer Valuation
Customer Lifetime Value

Total costs of acquiring and serving the customer is the sum of the various costs
incurred during the customer relation in terms of acquiring the customer and
maintaining it.
Although there are many variations of CLV, starting with the basic formula has the
following advantages:
1. It contains the key ingredients of the CLV method.
2. Easier to understand and implement.
3. With our example workflow, it provides a template to build on.
Other more advanced formulas/techniques, such as traditional and predictive
methods provide different approaches to the CLV calculations. For example, if your
customer revenues don’t stay flat year on year, and you need to factor in changes that
happen across the customer lifetime, the traditional version of the formula takes rate
of discount into consideration and provides a more detailed understanding of how CLV
can change over the years:
𝑪𝑳𝑽 = 𝐺𝑟𝑜𝑠𝑠 𝑚𝑎𝑟𝑔𝑖𝑛 ∗ (𝑅𝑒𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 / (1 + 𝑅𝑎𝑡𝑒 𝑜𝑓 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 – 𝑅𝑒𝑡𝑒𝑛𝑡𝑖𝑜𝑛))
There are many examples of CLV in helping companies to achieve greater success in
sales revenue growth and higher profitability. One of the most well-known case studies
is the success of Amazon Prime. Amazon’s own study found their Kindle owners spend
approximately 40% more per year buying stuff from Amazon, compared to other
customers. As a result, Amazon paid close attention to CLV with the development of
Amazon Prime. By doing so, Amazon understood how to invest and exploit their most
profitable customer segments. Amazon Prime’s growth is what has been most
impressive. They have managed to convert millions of customers into loyal
subscribers at a very fast rate with a much higher than average spend of $1400 per
year.

CLV workflow chart outputs.

Using KNIME Analytics Platform and the DVW KNIME Connector node for SAP, we will
show how a sales/account manager can easily extract the total revenue of a set of
customers from an SAP ERP system (e.g., SAP ECC, SAP S/4HANA), calculate their
CLVs and segment customers into different bands by value and lifetime (e.g., 30+
years and 1m+ as platinum).

138
Customer Valuation
Customer Lifetime Value

Hands-on CLV analytics in three steps with KNIME connector


for SAP

SAP Customer lifetime value analysis workflow.

After a short theoretical intro to CLV, let’s now move to the core of this section: the
analytical implementation. We will walk you through a codeless solution with KNIME
built around three main steps:
1. Extract of sales and customer data from SAP.
2. Process extracted data and calculate CLV.
3. Format and output the data.

Note. This workflow requires the DVW KNIME Connector for SAP. For more details
and to request a trial, visit www.dvwanalytics.com.

1. Extract of sales and customer data from SAP

We start off with the Table Creator


node, which is used to provide a list
of customer numbers that a sales
manager handles along with their
associated costs for SAP sales data
extraction, and CLV calculation later
on.
Next, we transform the data in the
format which the xCS Table Data
Read tool requires for dynamic
SAP CLV analysis workflow –customer and cost input.
input. xCS is DVW Cross-

139
Customer Valuation
Customer Lifetime Value

Application Connector for SAP and works with all core SAP systems, as well as with
analytics applications that support the OData v4 standard.
Our data transformation involves the creation of new constant value columns using
the Column Expressions node, and the renaming and exclusion of some of them with
the Table Manipulator node. The final data table is displayed below.

SAP CLV analysis workflow –dynamic input for xCS Table Read
tool.

To use the xCS Table Data Read tool to extract SAP sales data, drag an SAP Executor
node onto the canvas.

KCS SAP Executor node.

First, we configure the Basic tab of the KCS SAP Executor with the following steps:
1. Select SAP Table Data Read from the SAP Tool dropbox.
2. Select the appropriate SAP system from the SAP Systems dropbox.
3. Enter your SAP User name and Password.

140
Customer Valuation
Customer Lifetime Value

xCS SAP Table Tool Basic tab.

We then configure the Parameters tab of the KCS SAP Executor with the following
steps:
1. Enter VBAK (Sales Document: Header Data) in the Selected Table text box.
2. Click the Search button to search and bring back table metadata from SAP.
3. Select relevant fields for extraction.
4. Click on the Save button to save the configuration.

xCS SAP Table Tool Parameters tab.

We repeat the same steps for table KNA1, containing general customer data.

141
Customer Valuation
Customer Lifetime Value

xCS SAP Table Tool to extract data from SAP table –KNA1 (General Data in
Customer Master).

2. Process extracted data and calculate CLV

Once the sales and customer data have been extracted, we can then use the various
KNIME nodes to work out both how long each customer has been ordering and how
much they have been ordering. With the total revenue information, we can also deduct
the costs supplied to calculate CLV.
We use two GroupBy nodes to work out the first and last order date for the customer.

GroupBy node to get the first and last order date.

142
Customer Valuation
Customer Lifetime Value

A Joiner node is used to combine first and last order date into the same customer
record. Then, the Date&Time Difference node is used to calculate the number of years
the customer has been placing orders.

Date&Time Difference node is used to calculate the number of


years the customer has been placing orders.

We then use the Column Expressions node to segment the customer base to
platinum/gold/silver/bronze etc based on the number of years the customer has been
ordering.

Column Expressions node to segment the customer base.

143
Customer Valuation
Customer Lifetime Value

Next, we can start calculating CLV by summing up the customer’s revenue with a
GroupBy node. Then a Joiner tool is used to bring in the customer cost figures from
the initial input table. The CLV is then calculated below with a Column Expressions
node, applying the formula illustrated above.

Column Expressions node to calculate CLV.

With a new Column Expressions node, we segment the customer base to


platinum/gold/silver/bronze etc based on the CLV value (e.g., > 1m as platinum).

Column Expressions node to segment the customer base with CLV.

144
Customer Valuation
Customer Lifetime Value

3. Format and output the data

The last step of our workflow is about visualizing and reporting results. Using the
Pie/Donut Chart node, we can plot our Customer Lifetime Value Segmentation
Analysis identifying customer groups by the number of years they have been ordering,
or by how much they have been spending on the orders.
In the chart below, we can see that most of the customers have been very loyal to the
business with 88% of them having been ordering between 10-20 years or more.

Customer lifetime segmentation analysis by no. of years they have been


ordering.

Although loyal, our customers still spend too little (see figure below). Indeed, 44% of
them spend less than 5K; while only 18% spend more than 1 mil. This valuable piece
of information could be used by the Marketing and Sales departments to create ad-
hoc campaigns and promotions.

Customer lifetime value segmentation analysis by value of orders.

145
Customer Valuation
Customer Lifetime Value

Finally, we join the 3 sets of data (customer general data, lifetime, and CLV) with the
Joiner nodes and write the data for further processing or reporting to Tableau format
using the Tableau Writer node.

Data output to Tableau.

Easy CLV for every business


In this work, we showed how to perform a CLV analysis using KNIME Analytics
Platform and the KNIME Connector for SAP easily and without a single line of code.
After accessing customer data using the KCS SAP Executor, we aggregated and
transformed it to be able to segment customers by the number of years they have been
placing orders, and to calculate CLV to determine customer groups by the value of the
orders.
Finally, we displayed the results using pie charts and exported the output table to
Tableau for further processing or reporting.

146
Customer Valuation

Recency, Frequency and Monetary Value

Authors: Thomas Hopf & Alina Chrysa Nikolaidou, DER Touristik

Workflow on the KNIME Community Hub: RFM-Score and Customer KPI

Recency, Frequency and Monetary Value (RFM) is an empirical method used to


quantitatively analyze and measure customer value. The underlying idea is to segment
a company’s pool of customers based on their habits, such as purchasing behavior,
browsing history, or prior campaign response patterns. The resulting customer
segments are neatly ranked from most valuable to least valuable, and can be used as
a customer scoring system. Essentially, the RFM method backs up the marketing
saying that “80% of sales comes from 20% of the customers.”
Using RFM analysis businesses can design ad-hoc campaigns to improve low-scoring
customers, identify and maintain high-scoring ones with dedicated offers, and in
general to reasonably predict how likely (or unlikely) it is that a customer will purchase
again. The operationalization of this method relies on three quantitative factors:
Using RFM analysis businesses can design ad-hoc campaigns to improve low-scoring
customers, identify and maintain high-scoring ones with dedicated offers, and in
general to reasonably predict how likely (or unlikely) it is that a customer will purchase
again. The operationalization of this method relies on three quantitative factors:

• Recency. Customers who have purchased from a business recently are more
likely to buy again than customers who have not purchased for a while. To reverse
the situation, businesses may need to nurture non-recent customers with new
promotional offers or even reintroduce the brand.

• Frequency. Customers who often make purchases are more likely to purchase
again than customers who buy infrequently. For frequent buyers, businesses
have the chance to collect a wealth of information to build a comprehensive
overview of purchasing habits and preferences. On the other hand, one-time
customers are much harder to profile. They are good candidates for a customer
satisfaction survey to understand what can be done to improve customer
retention.

• Monetary Value. While all purchases are valuable, customers who spend more
are more likely to buy again than customers who spend less. This third factor
helps understand more clearly the first two letters in the RFM acronym. A recent
customer who is a frequent buyer and makes purchases at a high price point has

147
Customer Valuation
Recency, Frequency and Monetary Value

the potential to turn into a brand loyalist and secure high revenues for the
business.
To successfully conduct RFM analysis, businesses need to rely on extensive customer
data which must include information about a) the date (or time interval) of the most
recent purchase; b) the number of transactions within a specified time interval or since
a certain date; and c) the total or average sales attributed to the customer. Based on
this data, a scoring system can be devised that assigns Recency, Frequency and
Monetary Value scores based on an arbitrarily decided number of categories. For
example, five or less categories can be used to distinguish groups whose purchases
are more or less recent, or at a higher or lower price point.
Once RFM scores are calculated, it’s easy to identify the best customers by ranking
them. The higher a customer ranks, the more likely it is that they will do business again
with a firm. Notice that the order of the attributes in RFM does not necessarily
correspond to the order of their importance in ranking customers. It’s the combination
of all the attributes that defines the importance of a customer.
With these newly acquired insights into the customer base, it’s possible to start
analyzing the characteristics and purchasing behavior of different groups, identify
what distinguishes them from other groups, and address them with relevant offers or
initiatives. RFM analysis has been successfully used, for example, by nonprofit
organizations to target donors, as people who have been the source of contributions
in the past are likely to make additional gifts. The adjustments of the RFM model can
be used very differently depending on the company/business needs. Thus, it usually
requires an ad-hoc design.
Thanks to the abundance of customer data, businesses can implement automated
solutions to measure customer value and get a better understanding of how to target
different customer groups effectively. Using KNIME Analytics Platform, we will analyze
transactional data to devise a scoring system that clusters customers based on RFM
scores. Next, we’ll enrich the analysis with the addition of historical CLV calculation,
and present customer insights using interactive visualizations.

148
Customer Valuation
Recency, Frequency and Monetary Value

Customer segmentation using a simple RFM model

Customer segmentation via RFM model workflow.

Now that we have briefly clarified what the RFM analysis is and why it is useful, let’s
explore how we can implement it with a codeless approach using KNIME Analytics
Platform. The idea is to analyze light-weight transactional data, such as orders, using
a quite simple RFM model and a popular KPI. To do that, we’ll walk you through a
workflow example comprised of three steps:
1. Data ingestion and RFM preparation
2. RFM and historical CLV calculation
3. Visualization of results

1. Data ingestion and RFM preparation

We start off by ingesting a practice dataset containing transactional data with the
following information:

• customer = identifies the customer who will be evaluated via RFM and CLV.

• product id = identifies the product that was ordered.

• product = the product name corresponding to the product id.

• order value = corresponds to the price of the product.

• days ago = number of days


since the order. This
information might also be
represented as a timestamp of
the time of the order.

Input dataset containing transactional data.

149
Customer Valuation
Recency, Frequency and Monetary Value

Next, we start processing our data to engineer the quantitative factors that constitute
the RFM analysis. As the first step we aggregate transactions at the customer level
using the GroupBy node. For each customer, the node returns details about the last
purchase (i.e., Recency), the number of products purchased (i.e., Frequency), and the
total product value (i.e., Monetary Value). For example, customer A purchased a
volume of 6 products worth 6800 euros, the last of which was bought 15 days ago.

Aggregated transactions at the customer level.

After the results have been combined, an RFM model can be calculated individually for
each customer. The aggregated data table is fed in parallel into three different k-
Means nodes, one for each factor of the RFM model. These clustering nodes are key
to building a system that assigns RFM scores to each customer. Here, the number of
categories is controlled by k, the parameter that in the k-Means algorithm determines
the number of clusters to form. We set k = 3 but some RFM models set k = 5 or higher.

RFM preparation for each factor using the k-Means


algorithm.

Once successfully executed, each k-Means node


returns two different outputs: a table with cluster
labels assigned to each transaction, and a
second table with cluster centroids. The second
table is sorted in ascending or descending order
according to the factor, and an ordinal value (0- A higher ordinal value is assigned to
customers who have made purchases for the
2) is given to each centroid. last time ten days ago.

150
Customer Valuation
Recency, Frequency and Monetary Value

For example, for Recency, we sort the centroids of the three clusters formed from
“Min*(days ago)” in descending order, and we use the Math Formula node to append
a new column “Recency_Cluster” that assigns a higher ordinal value to low Recency
(i.e., ten days ago).
With a series of Joiner nodes, we combine our clustered transactions with the scores
of each factor. Our final table now has three new columns: Recency, Frequency and
Monetary Cluster.

Clustered transactions with Recency, Frequency and Monetary Value scores.

2. RFM and historical CLV calculation

To take advantage of the RFM scores, we now have two options. In case we have a
lookup table for each combination of scores, we can use this and attach the customer
segment. This is a valid approach if we want to identify, for example, different Recency
phases of our customers first. In our simple use case, the RFM calculation involves
summing up the cluster scores of each factor in one consolidated RFM score.
At this point, we can adopt the same procedure described above, that is using the k-
Means node to identify clusters, centroids, and assign ordinal values to rank
customers. In this case, we distinguish four clusters, which correspond to the four
customer segments that we aim to identify. The number of clusters can be adjusted
to fit a specific business strategy.
In addition to RFM, we can enrich our analysis with a popular KPI: historical Customer
Lifetime Value (CLV) for each customer. Again, the transactional data serves as input
and is limited to the maximum lifetime of our customers, which varies depending on
the industry (e.g., in the Retail industry, the maximum lifetime of a customer might be
two months only). In the example, we choose to retain the past 6 months (i.e., 183

151
Customer Valuation
Recency, Frequency and Monetary Value

days) and disregard older orders. The historical CLV is calculated as the sum of the
order value for each customer, and corresponds to the Monetary Value.
Finally, we join the tables containing RFM and CLV values to visualize the extracted
insights.

Calculation of RFM and CLV for each


customer based on the available
transactions.

3. Visualization of results

We can now use some popular visualization nodes available in the KNIME JavaScript
Views extension to present and interact with our customer insights.
We start off visualizing the results of the RFM analysis using the Bar Chart node. We
can immediately see that customers C, D and E ranked the highest in all three factors
within our customer base. These are our most valuable customers, and we should
nurture our relationship with them in order to ensure long-lasting revenues. For
example, we could offer exclusive benefits or discounts. Similarly, customers B, H and
S show high Recency (and even high Monetary Value in one case), which indicates
great potential for transforming these customers into brand loyalists.
On the contrary, customers F, G and J are currently the least valuable for the business.
We should consider creating ad-hoc marketing campaigns and promotions to attract
them back.

152
Customer Valuation
Recency, Frequency and Monetary Value

Customer breakdown according to RFM scores.

Next, we visualize RFM and CLV together using the Scatter Plot and the Conditional
Box Plot nodes. In the first plot, we can observe that our best customers according to
the RFM scores are historically those who have also spent the most (high CLV). The
opposite is also true.
Additionally, thanks to the conditional boxplot, we can see more clearly the lower and
upper bound, as well as the median monetary value for each customer segment.
Interestingly enough, two major findings stand out. First, the third quartile of the green
boxplot is closely approaching the first quartile of the red boxplot. This tells us that
there’s a business opportunity to grasp in order to close the gap and convert those
customers into the most valuable ones. Second, we can observe that in the orange
boxplot there are upper and lower outliers. The first should be targeted with initiatives
to encourage spending and frequent purchases, whereas the second requires ad-hoc
action to avoid permanent churn.

153
Customer Valuation
Recency, Frequency and Monetary Value

Plotting RFM scores against historical CLV.

Besides RFM and CLV, we can also identify the cross-selling potential of our products.
This can help us understand which combination is most demanded in order to market
it effectively and drive sales. Therefore, a Pivoting node is needed to aggregate the
product types purchased by each customer. This results in a table with dummy
variables indicating the purchase of a product with 1, and 0 otherwise. With the Linear
Correlation node, we can then display the correlation of pairwise combinations in a
matrix. Values close to 1 indicate strong correlation and, hence, higher cross-selling
potential. That’s the case, for example, for the products ABB-MAC and AAA-MAC.
Finally, the calculated customer KPIs can be saved and exported to an Excel sheet for
further analysis.

Identifying cross-selling potential across products.

154
Customer Valuation
Recency, Frequency and Monetary Value

RFM to improve and nurture customer relationships


In this section, we successfully explored how to conduct customer segmentation with
RFM scores utilizing KNIME Analytics Platform and a no-code approach.
We started off by ingesting transactional data, containing details about product types
and order value for each customer. With the help of the k-Means algorithm, we built an
RFM scoring system to cluster customers and get a better understanding of their
purchasing behavior. Additionally, we enriched our analysis with historical CLV
calculation. Finally, we displayed our findings with light-weight visualizations.
While simple, our analysis provided valuable and actionable insights to design ad-hoc
campaigns and initiatives that aim at improving and nurturing the relationship with our
customer base.

155
Data Protection and Privacy

In this chapter, we will delve into the pressing issue of integrating data privacy and
protection practices in our data flows. We will show how to handle appropriately
sensitive customer information while preserving valuable insights for analysis. More
specifically, we will look at a use case to anonymize customer data.

This chapter includes the article:

• Customer Data Anonymization, p. 157


– Lada Rudnitckaia, KNIME

156
Data Protection and Privacy

Customer Data Anonymization

Author: Lada Rudnitckaia, KNIME

Workflows on the KNIME Community Hub: Customer Data Anonymization

Almost every organization has to collect some data about its customers, be it for direct
business purposes, e.g., sending out a newsletter, or for getting insights about the
customers, e.g., how the newsletter influences their involvement. Obviously, customer
data contains personal information that is subject to privacy protection.

Data protection and privacy are regulated worldwide at the legislative level. According
to UNCTAD, as of today, 71% of countries worldwide had put in place and 9% drafted
the data protection and privacy legislation. For example, the European Union’s GDPR
(General Data Protection Regulation) stipulates that any organization processing
personal data of a EU citizen or resident can use such data extensively and without
privacy restrictions only after the anonymization. The fines for non-compliance are
very high.
Let’s see how we can address the privacy requirements and anonymize customer data
prior to the analysis using KNIME Analytics Platform.

Use case
Let’s take the example of a company dealing with customer data. The company
collects raw data about the customers such as name, email, age, date and country of
birth, income, etc., in some secure location with restricted access, and assigns each
customer a unique customer key.
This data can be useful to business analysts, marketing specialists, or data scientists
to explore customers’ behavior and get valuable insights for the business. However, in
its pure form, this data identifies individual customers and can’t be shared with all the
company’s employees.
The goal is, therefore, to transform the data in such a way that no customer can be
identified but at the same time keep the maximum amount of non-sensitive
information for the analysis. In other words, the goal is to anonymize the data. After
the data are anonymized and risks of re-identification are assessed, the anonymized
data can be loaded to the space where it is available for further analysis, for example,
to a data mart.

157
Data Protection and Privacy
Customer Data Anonymization

Exploring the raw data

Let’s have a look at the raw data collected by the company. A sample of the data is
presented in the table below.

Raw customer data.

Disclaimer. The dataset is randomly generated and any similarity to real people is
completely coincidental.

Two attributes –”Name” and “Email”– identify a customer directly. These attributes are
called identifying and will be the first attributes to be anonymized. Although this step
is crucial, it is not sufficient.
Additionally, we need to make sure that no unique combination of attributes can
identify a customer. For example, imagine an attacker –a person who tries to identify
a specific person in the dataset– knows that her neighbor is a customer and was born
on 13.05.1987 in Spain and earns 50k euros. If there is no other customer in the data
with the same combination of these three attributes, the attacker can easily identify
the neighbor and then map all the available information, including sensitive details. The
attributes that can form such combinations are called quasi-identifying.
This problem can be exacerbated by outliers. If an attribute value is unique, it can
identify a person even by itself, without other attributes, like, for example, if a person is
unusually old.
In our case, the combination of attributes “Birthday”, “CountryOfBirth”, and
“EstimatedYearlyIncome” is unique for most customers. This introduces the risk of re-
identification and these attributes should, therefore, also be anonymized.

158
Data Protection and Privacy
Customer Data Anonymization

Disclaimer. For the purpose of this work, we selected three attributes of different
types to anonymize in order to demonstrate the different configuration settings.
We didn’t analyze whether those are optimal and sufficient attributes to
anonymize. There exist quantitative and qualitative methods to decide which
attributes should be anonymized and, for each particular dataset, the attributes
should be thoroughly analyzed to define whether they are identifying, quasi-
identifying, sensitive, or insensitive. For the sake of simplicity, in our example we
consider all the remaining attributes insensitive, but in a real scenario, the columns
“City”, “Country”, “Gender”, and “MaritalStatus” can also be considered quasi-
identifying.

Another attribute worth noting is “CustomerKey”. This column also contains unique
values for each customer and can be used to identify the customer if an attacker has
a dictionary mapping the customer keys with their identities. However, the attribute
itself doesn’t allow to directly identify the person –the key is a simple counter that
doesn’t have any other meaning. We can, therefore, leave it unmodified in the dataset.

Data anonymization
Let’s proceed to data anonymization. To anonymize the data, we will use the nodes
from the community extension Redfield Privacy Extension which is based on the ARX
Java library.
The “Customer Data Anonymization” workflow shown below performs the initial data
anonymization. This workflow reads the raw data, anonymizes the identifying
attributes, creates the hierarchies for the quasi-identifying data, and anonymizes these
data by applying the anonymization model. Next, it assesses the re-identification risks,
and if the assessment is satisfactory, the workflow saves the hierarchies and
anonymization levels that can be reused for the anonymization of new customer data.

The Customer Data Anonymization workflow.

The second workflow, “New Customer Data Anonymization - Deployment”, reads the
data for new customers as the data flows in. It anonymizes the identifying attributes
similarly to the first workflow and reuses the hierarchies and anonymization levels

159
Data Protection and Privacy
Customer Data Anonymization

from the first workflow to anonymize quasi-identifying attributes. The whole dataset is
reassessed again for the re-identification risks. If the assessment is satisfactory, the
anonymized data are loaded to the data mart, otherwise a responsible person is
notified. This workflow can be scheduled to execute regularly with KNIME Business
Hub.

The New Customer Data Anonymization - Deployment workflow.

Anonymizing the identifying attributes

We start with the anonymization of the identifying attributes, in our case, “Name” and
“Email”. These attributes contain only unique values and, therefore, aren’t valuable for
analysis. For example, each customer has their own email address which doesn’t
provide any insight. At the same time, these attributes identify the customer directly
and, therefore, shouldn’t be shared with anyone, not even in part.
This makes the anonymization straightforward and easy since we don’t have to look
for a balance between data accuracy and suppression. In our example, we will hash
the values using the Anonymization node.

In the configuration window, we select the identifying columns “Name” and “Email” and
apply salting. Salting is a method that allows to diversify and randomize the original
values before hashing by concatenating the original values with, for example, some
random number. Salting aims to protect the data from attackers enumerating and
hashing all the possible original values.
After anonymization, names and email addresses are replaced by hash values that
can’t identify a person anymore (see figure below). It is worth noting that the second
output port contains the dictionary that should be protected and not shared.

160
Data Protection and Privacy
Customer Data Anonymization

Input and output of the Anonymization node.

Anonymizing the quasi-identifying attributes

Unlike “Name” and “Email”, the columns “Birthday”, “CountryOfBirth”, and


“EstimatedYearlyIncome” are valuable and provide insights about the customers. Do
people from Spain tend to prefer some particular product? Does this age group
subscribe to a newsletter? For a person with a high income, is it worth advertising an
expensive product? It would be a waste to suppress this information completely.
To preserve the maximum amount of information without introducing privacy risks, we
will generalize the values in these columns so that no it is not possible to uniquely
identify a person using this attribute combination. Generalization means defining
hierarchies –complex binning rules with multiple layers that go from original data to
less and less accurate, and finally to completely suppressed data.

Hierarchies of different levels (they can be seen in the second output of the
Create Hierarchy node).

This kind of anonymization can be performed using the Create Hierarchy node for each
quasi-identifying attribute and applying the privacy model that will find the optimal
levels of optimization using the Hierarchical Anonymization node. The configuration
of the Create Hierarchy node is different for different data types. In our example, we
create the hierarchies for three columns of three different types: date, double, and
string.

161
Data Protection and Privacy
Customer Data Anonymization

Date & Time data


Let’s create a hierarchy for the first attribute –“Birthday”. In the configuration window
of the Create Hierarchy node (see figure below), we first select the column and the
hierarchy type and click “Next”. Then, we select two levels of generalization –
month/year and year. This means that two levels between the original data and
completely suppressed data will be created: a month and a year, or a year instead of
the original birth date.
To take the outliers –too young or too old people– into account, we set the minimum
and maximum thresholds. The values above or below the thresholds won’t be specified
precisely.

Creating a hierarchy for the attribute of Date&Time data type in the configuration
window of the Create Hierarchy node.

Numeric data
Next, let’s create the hierarchy for the “EstimatedYearlyIncome” column. The
configuration of the Create Hierarchy node is more complicated for the numeric data
but we will show you one trick to simplify it. Follow the steps below to configure the
node:
1. First, we select the column “EstimatedYearlyIncome” and the hierarchy type
“intervals” and click “Next”.
2. In the next window, in the tab “General”, you can choose the aggregate function
for all the groups. We stick to the default “Interval” option.
3. In the tab “Range”, you can specify minimum and maximum values as well as
restrict the boundary values of the groups. We increase the default income range
to the range from 0 to 200k.

162
Data Protection and Privacy
Customer Data Anonymization

4. Click on the 1st available group…


5. …And change its minimum and maximum values. Note that the range you create
here, in our case 20k, will be the range of all the groups, if you create them
automatically, as we suggest in the next step.
6. And here is the trick –do not create the next group on the same level manually!
Instead, create a group on a new level…
7. … and increase its size, for example, to 5. You can now see that 4 additional
groups are created on the 1st level, all with the range of 20k. The group on the
2nd level then has the range of 100k.

8. We add just one more group with size 5. Now, all the groups we created cover our
general range from 0 to 200k.

163
Data Protection and Privacy
Customer Data Anonymization

From top to bottom: Creating a hierarchy for the attribute of numeric data type in the configuration window
of the Create Hierarchy node.

Categorical data
1. Finally, let’s create the hierarchy for the “CountryOfBirth” column. Follow the
steps below to configure the Create Hierarchy node:
2. First, we select the column “CountryOfBirth” and the hierarchy type “ordering” and
click “Next”.
3. Next, in the new window, we need to sort the original country values. The values
will be included in groups from this sorted list. To create meaningful groups, we
sort them so that the countries located close to each other are close to each other
in the list. Alternatively, you can sort the values in an alphabetic order.

164
Data Protection and Privacy
Customer Data Anonymization

4. Click on the 1st interval…


5. …And increase its size. We increase its size to 3 since we want the first three
countries from the sorted list to be in this 1st group.
6. There is no trick here, so we continue to create the groups and increase their sizes
manually.
7. Once we are done with creating the groups of the 1st level, we can add a new
level…
8. …And manually increase the size of the groups to generalize even more.

165
Data Protection and Privacy
Customer Data Anonymization

From top to bottom: Creating a hierarchy for the attribute of string data type in the configuration window of
the Create Hierarchy node.

Data anonymization
Now, we have hierarchies created but as you can see in the 1st outputs of the Create
Hierarchy nodes, the data isn’t transformed yet. By the way, which level of the
anonymization should we use? This is something we can find out by applying the
privacy model to our hierarchies using the Hierarchical Anonymization node.
As an input, this node requires the original data and all the hierarchies connected via
Hierarchy Configuration ports. In the first configuration tab “Columns”, we need to
define the attribute type for each column (see figures below). We define “Birthday”,
“CountryOfBirth”, and “EstimatedYearlyIncome” as quasi-identifying. We define all the
other columns as insensitive, including the two identifying columns that have been
already anonymized. The hierarchies provided via the input port will be used
automatically.
In the “Privacy Models” tab, we need to select the privacy model. Depending on the use
case and the attributes in the data, different privacy models should be used. We will
use the simplest k-anonymity model defined as follows: “A dataset is k-anonymous if
each record cannot be distinguished from at least k-1 other records regarding the
quasi-identifiers”. We use k = 2. This means that in our anonymized dataset, for each
customer, we want at least one other customer to have the same combination of

166
Data Protection and Privacy
Customer Data Anonymization

“Birthday”, “CountryOfBirth”, and “EstimatedYearlyIncome” attribute values. Note, that


k = 2 is a very weak requirement for a real use case.
We set all the other settings to default ones but many additional settings are available
here and allow you to control the anonymization according to your needs and
requirements.

167
Data Protection and Privacy
Customer Data Anonymization

From top to bottom: Applying a k-anonymity privacy model to the original data
using the created hierarchies in the configuration window of the Hierarchical
Anonymization node.

After the node is executed, the data are anonymized according to the anonymization
levels suggested by the node. However, different combinations of levels can
anonymize the data, and the node allows you to explore all the options and change the
suggested levels in the interactive view. Note that after the anonymization, all the
columns are converted to the String type.

168
Data Protection and Privacy
Customer Data Anonymization

From top to bottom: Exploring different options for


anonymization levels combinations in the interactive view of the
Hierarchical Anonymization node.

Anonymity evaluation

Now that the data are anonymized, we need to make sure that the risks of re-
identification are acceptable. We can do that using the Anonymity Assessment node.
This node estimates two types of re-identification risks: quasi-identifiers diversity and
attacker models.
We provide the original and the anonymized data to the first and the second input ports,
respectively. In the configuration of the node, we need to select the three quasi-
identifying columns –“Birthday”, “CountryOfBirth”, and “EstimatedYearlyIncome”. We
also need to set the re-identification threshold –the risk threshold which we consider
acceptable. Before defining the acceptable risk, let’s first discuss what is actually the
highest risk.
In general, the individual risk for each person to be re-identified is 1/k, where k = 1 +
the number of people in the dataset who have the same values in the quasi-identifying
columns. For example, for the 2-anonymity model where you have at least 2
indistinguishable people, if you pick one, the risk that you pick the one you try to identify
is 50%. Therefore, the highest risk for the 2-anonymity model is 0.5.
We can then restrict a threshold for the acceptable risk, let’s say, to 0.2. This would
mean that we consider the personal data record at risk if for this person there are less
than 4 people in the dataset with the same values in three quasi-identifying columns.
Now let’s execute the node and explore the output in the second output port:

169
Data Protection and Privacy
Customer Data Anonymization

• There are no records at risk.

• The success rate –weighted average risk of individual risks in the dataset, i.e.,
avg(1/k)– is 0.027 which is very low in general and is much lower than the highest
risk.

• The highest risk is 0.143 meaning that for each person in the dataset there are at
least 6 other people in the dataset with the same values in three quasi-identifying
columns. We consider this acceptable for the purposes of this work.

Disclaimer. Note that risk thresholds used here are neither universal nor sufficient
for a real use case. They need to be customized depending on the data, the use
case as well as the organization, ethics board, and country requirements.

Deployment

Now what happens after we anonymized the data for the first time? In our scenario,
the new customer data flows in on a regular basis and should be anonymized
consistently to the existing data.
We can reuse the hierarchies we created and the selected levels of anonymization for
the new customer data. In the first workflow, we save the hierarchies using the
Hierarchy Writer node for all three attributes.
Next, the Hierarchical Anonymization node generates a few flow variables describing
the applied anonymization and including the selected levels of anonymization. We
transform the variables into a compact table that we also save.
Now, let’s move to the “New Customer Data Anonymization - Deployment” workflow.
Here we read the new customer data as well as the hierarchies and the anonymization
levels we saved earlier. The Hierarchy Reader node requires the new data (with the
original data types) as an input, reads hierarchies for all three attributes, and applies
them to the input data. You can see a hierarchy preview, specific for this new input
data domain in the second output port. Note that all the values are converted to the
String type. We will use these hierarchy previews as a dictionary to replace the original
values in the new data.
Let’s anonymize! First, we hash the identifying columns just as we did in the previous
workflow. Before we apply the hierarchies, we will need to change the data types to
type String. And then, we can simply use the Cell Replacer node to replace the original
values with the respective values from the correct level of anonymization. The correct
level of anonymization is controlled by the flow variable coming from the table with the
anonymization levels that we saved earlier.
Before we load these new anonymized customer data onto our data mart, we need to
make sure that the whole dataset still complies with our anonymity requirements. If at

170
Data Protection and Privacy
Customer Data Anonymization

some point the existing anonymization is not sufficient anymore, we might need to
update the initial anonymization process. In this case we can set up an automated
notification via email.

Disclaimer. Note that this is just one option to deploy the data anonymization
process. Depending on security policies in different organizations and customer
dataset size, other solutions might be possible. For example, one could anonymize
the whole data from scratch each time if the data amount allows doing so.

Keep your data safe with low code


In this work, we introduced the importance of data protection and privacy on the
example of customer data. We discussed the balance between data accuracy and
suppression, i.e., which attributes are identifying and should be anonymized
completely, and which attributes are safe and valuable to keep for analysis at least in
part.
We then demonstrated how to anonymize identifying and quasi-identifying attributes
of different data types and evaluate the results of anonymization with respect to the
number of personal records at risk, average and the highest risks.
Finally, in addition to the initial customer data anonymization, we showed how to
deploy the data anonymization solution and apply it to the new customer data that
comes in on a regular basis and needs to be automatically anonymized and loaded for
further analysis without privacy risks.
It is important to note that the models, parameters, and risk thresholds used here are
not universal and most likely not sufficient for a real use case. They need to be
customized depending on the data, the use case as well as the organization, ethics
board, and country requirements. Besides, some parameters can be optimized for the
downstream tasks, i.e., one can find the optimal parameters to both protect the data
and preserve maximum information to get insights from it.
Besides, we demonstrated only the k-anonymity model. While the k-anonymity model
protects the data from certain types of attacks, it doesn’t take care of the possibility of
retrieving sensitive information about people. Other models exist to protect the data
and can be useful in the particular use cases.
The Redfield Privacy Extension provides a rich set of configuration settings that allow
you to customize your data anonymization pipeline w.r.t. models, model parameters,
risk thresholds, data accuracy and suppression restrictions, as well as come up with
the deployment solution that works best for the needs and requirements in your
particular use case.

171
Other Analytics

In this last chapter, we will show how to work with other data types that are often used
in Marketing Analytics to analyze visual content or map connections: images and
networks, respectively. More specifically, we will look at use cases in image feature
mining and social media network visualization.

This chapter includes the articles:

• Image Feature Mining with Google Vision, p. 173


– Roberto Cadili, KNIME

• Social Media Network Visualization, p. 182


– Paolo Tamagnini, KNIME

172
Other Analytics

Image Feature Mining with Google Vision

Author: Roberto Cadili, KNIME

Workflow on the KNIME Community Hub: Extraction of Image Labels and Dominant Colors

It's the backbone of autonomous driving, it's at supermarket self-checkouts, and it's
named as one of the trends to power marketing strategies in 2022 (Analytics Insight).
Computer vision is a research field that tries to automate certain visual tasks —e.g.,
classifying, segmenting, or detecting different objects in images. To achieve this,
researchers focus on the automatic extraction of useful information from a single
image or a sequence of images. Where entire teams were previously needed to scan
networks and websites, computer vision techniques analyze visual data automatically,
giving marketers quick insight to create content that will better attract, retain, and
ultimately drive profitable customer action.
There are different approaches to analyzing image data and extracting features. In this
post, we want to walk through an example showing how to integrate and apply a third-
party service, namely Google Cloud Vision API, in KNIME Analytics Platform to detect
label topicality and extract dominant colors in image data for advertising campaigns.

Image feature mining with Google Cloud Vision in KNIME Analytics Platform.

Extract and visualize color dominance and label topicality


In the last decade, Google has consolidated its undisputed position as a leading
company in the development of cutting-edge AI technology, research, cloud-based
solutions, and on-demand web analytics services that automate complex processes
for businesses. One such service is Google Cloud Vision API. It allows users to harness

173
Other Analytics
Image Feature Mining with Google Vision

the power of powerful pre-trained machine learning models for a wide range of
computer vision tasks to understand images, from image label assignment and
property extraction to object and face detection.
It is worth noticing that, while it’s very powerful, Google Cloud Vision API is an
automated machine learning model, meaning it offers little-to-no human-computer
interaction options. A data scientist can only work on the input data, and after feeding
it into the machine, you have little chance of influencing the final model.

1. Connect to Google Cloud Vision API

To harness the power of Google Cloud Vision API, we need to first set up a Google
Cloud Platform project and obtain service account credentials. To do so:
1. Sign in to your Google Cloud account. If you're new to Google Cloud, you’ll need
to create an account.
2. Set up a Cloud Console project:
a. Create or select a project.
b. Enable the Vision API for that project.
c. Create a service account.
d. Download a private key as JSON.
e. You can view and manage these resources at any time in the Cloud Console.

Note. When you create a Cloud Console project you have to enter payment details
to use the service(s) called via the API (even for the free trial). If you don't do this,
you'll incur client-side errors.

To connect to a number of Google services in KNIME


Analytics Platform, it is usually enough to use the nodes
of the KNIME Google Connectors extension. Thanks to
the Google Authentication or Google Authentication
(API Key) nodes, you can connect to Google Analytics,
Google Storage, Google Sheets, Google Drive, Google
Components used to upload
Cloud Storage, and Google BigQuery. These nodes,
private keys in JSON format and
however, do not currently allow you to connect to Google authenticate to Google Cloud
Cloud Vision API. To solve this problem, we can easily Vision.
create shareable and reusable components to upload
private keys as a JSON file and connect to and authenticate Google Cloud Vision.
The “Key upload” component allows you to select a JSON file from a local directory,
upload it onto KNIME Analytics Platform, and parse it to collect leaves. Although for
most applications it is sufficient to copy and paste the keys in the required node,

174
Other Analytics
Image Feature Mining with Google Vision

wrapping file ingestion in a component makes the process of selection and upload
reusable, faster, less prone to error, and consumable on the KNIME Business Hub
(figure above).

Inside the “Key upload” component, we use the File Upload Widget node to select the
JSON file from a local directory.

After uploading the private keys, we connect to Google Cloud Vision API and
authenticate the service to start using it. The component “Authentication Google
Vision API” relies on a simple Python script –parameterized with private keys via flow
variables– to generate a JSON web token to send data with optional
signature/encryption to Google Vision API. When using a Python script in your KNIME
workflows, it’s good practice to include a Conda Environment Propagation node to
ensure workflow portability and the automated installation of all required
dependencies, in particular the PyJWT Python library. Once the JSON web token is
generated, it’s passed on outside the component as a flow variable.

Use a simple Python script to generate a JSON web token.

175
Other Analytics
Image Feature Mining with Google Vision

2. Ingest and encode image data

After completing the authentication, we read the file paths of image data (e.g. shoes,
drinks, food) using the List Files/Folders node, and prepare it for the creation of a valid
POST request body to call the webservice of Google Cloud Vision.

The request body as defined in Google Cloud Vision and KNIME Analytics Platform.

Using the Container Input (JSON) node, we can reproduce the request body structure
in JSON format, specifying the image mining type we are interested in (e.g.,
“IMAGE_PROPERTIES”), the number of max results, and the input image data. The
crucial transformation is the encoding of images as a base64 representation. We can
do that very easily using the Files to Base64 component, which takes a table of file
paths and converts each file to a base64 string.
Wrangling the base64-encoded images back into the JSON request body with
the String Manipulation node, we create a column where each row contains a request
body to extract features for each input image. We are now ready to call the REST API
using the POST Request node.

3. Extract dominant colors and label topicality

As we already mentioned in earlier sections, we want to mine information about label


topicality and dominant colors. Using the POST Request node, we can let Google
Cloud Vision annotate image data and return a response in JSON format that we can
wrangle to bring it in the shape we want.
The configurations of the POST Request node are very easy:

176
Other Analytics
Image Feature Mining with Google Vision

• In the “Connection Settings” tab, specify the URL of the web service and the
operation to perform on
images: https://vision.googleapis.com/v1/images:annotate.

• Increase the Timeout time from 2 to 20 seconds in order to extend the server
response time to process images.

• In the “Request Body” tab, point the node to the column containing the JSON
request bodies with the encoded images and the image mining types.

• In the “Request Header” tab, pass on the JSON web token.

• The “Error Handling” tab handles errors gracefully, and by default outputs missing
values if connection problems or client-side or server-side errors arise.

Configurations of the POST Request node.

For large image datasets, sending a sole POST request can be computationally
expensive and overload the REST API. We can adopt two workarounds: feeding
data in chunks using the Chunk Loop Start node and/or checking the box “Send
large data in chunks” in the configuration of the POST Request node.

If the request is successful, the server returns a 200 HTTP status code and the
response in JSON format.

177
Other Analytics
Image Feature Mining with Google Vision

Raw JSON response of the POST Request node.

In the “Image properties” and “Label detection” metanodes, we use a bunch of data
manipulation nodes —such as the JSON to Table, Unpivoting, and Column
Expressions nodes— to parse the JSON response and extract information about
dominant colors and label topicality. In particular, using the Math Formula node, we
compute dominant color percentages by dividing each score value by the sum of all
scores for each image. The Column Expressions node converts RGB colors into the
corresponding HEX encoding.
The conversion into HEX-encoded
colors is necessary to enrich the
table view with actual color names
that are easier to understand for
the human user. To do that, we rely
on The Color API. This web service
can be consumed via a GET
Request that identifies each color
unequivocally by its HEX-encoding.
The Color API returns an SVG
image containing the color name
and image. It’s worth mentioning
that the retrieved color names are
purely subjective and are called as
such by the creators of the API.
Table view containing dominant color percentage in
descending order, RGB and HEX encodings, and an SVG
column with color names.

178
Other Analytics
Image Feature Mining with Google Vision

Following a similar JSON parsing procedure, for


each image topic labels are extracted, and their
topicality value sorted in descending order.

The table shows the list of topic labels and


descriptions for the first image.

4. Visualize results with an interactive dashboard

Once dominant colors and topic labels have been mined and parsed, content
marketers may benefit greatly from the visualization of those features in a dashboard
where visual elements can be dynamically selected.
The key visual elements of the dashboard are three JavaScript View nodes: the Tile
View node to select the image, the Pie/Donut Chart node to plot label topicality, and
the Generic JavaScript View node to display a Plotly bar chart with different colors and
percentage values according to the selected image. The essential touch of interactivity
and dynamism in the dashboard is given by the Refresh Button Widget. This node
works by producing a series of reactivity events that trigger the re-execution of
downstream nodes in a component by conveniently connecting the variable output
port to the nodes that the user wishes to re-execute. This means that we can interact
more easily with input data in components without leaving the component interactive
view, and create dynamic visualizations to make UI even more insightful and enjoyable.
For example, content marketers can use this workflow and the resulting dashboard to
identify the two major topic labels in the example image. Perhaps unsurprisingly, “dog”
and “plant” stand out. What is surprising, however, is the most prominent color:red, at
26% dominant. This appears fairly counterintuitive, since red is used only in the ribbon
and hat, whereas other colors, such as black or beige, occupy many more pixels.

179
Other Analytics
Image Feature Mining with Google Vision

Interactive dashboard of dominant colors and topic labels.

Understanding why the underlying model determines that red is the dominant color is
not easy. The official documentation of Google Cloud Vision API does not provide
much information. On the GitHub repository of Google Cloud services, it is possible to
find a few possible explanations. The assumption that the color annotator blindly looks
at the pixels and assigns a value based on how many pixels have similar colors seems
naive. Rather, the model determines the focus of the image, and the color annotator
assigns the highest score to that color. Hence, in the example image, the hat is
identified as the focus of the image and red as the most prominent color, followed by
different shades of black.

Disclaimer. The validity of these explanations remains largely unverified by the


official provider.

AutoML models as web services for ease of consumption


Image feature mining is a complex task that requires abundant (annotated) data and
substantial computational resources to train and deploy versatile machine learning
models that perform well across a wide range of subtasks. Additionally, for most tools,
you still need to overcome the coding barrier.

180
Other Analytics
Image Feature Mining with Google Vision

Based on this premise, AutoML models for image mining made consumable as web
services via REST APIs have flourished and become popular and powerful alternatives
to self-built solutions. In that sense, Google Cloud Vision API is probably one of the
most innovative technologies currently available, and has reduced considerably
implementation costs, delivering fast and scalable alternatives. Yet very often, web
services based on AutoML models have two major drawbacks: they offer no room for
human-machine interaction (e.g., improvement and/or customization), and their
underlying decision-making process remains hard to explain.
While it’s unlikely that these drawbacks will undermine the future success of AutoML
models as REST APIs, it’s important to understand their limitations, and assess the
best approach for each data task at hand.

181
Other Analytics

Social Media Network Visualization

Author: Paolo Tamagnini, KNIME

Workflow on the KNIME Community Hub: Visualizing Twitter Network with a Chord Diagram

There are two main analytics streams when it comes to social media: the topic and
tone of the conversations and the network of connections. You can learn a lot about a
user from his or her connection network!
Let’s take Twitter for example. The number of followers is often assumed to be an
index of popularity. Furthermore, the number of retweets quantifies the popularity of a
topic. The number of crossed retweets between two connections indicates the
livelihood and strength of the connection. And there are many more such metrics.
@KNIME on Twitter counts more than 6730 followers (data from August 2021): the
social niche of the KNIME real-life community. How many of them are expert KNIME
users, how many are data scientists, how many are attentive followers of posted
content?

Chord diagram visualizing interactions from the top 20 Twitter users around
#knime. Nodes are represented as arcs along the outer circle and connected to
each other via chords. The total number of retweeted tweets defines the size of
the circle portion (the node) assigned to the user. A chord (the connection area)
shows how often a user’s tweets have been retweeted by a specific user, and is
in the retweeter’s color.

182
Other Analytics
Social Media Network Visualization

Let’s check the top 20 active followers of @KNIME on Twitter and let’s arrange them
on a chord diagram (see figure above). Are you one of them?
A chord diagram is another graphical representation of a graph. The nodes are
represented as arcs along the outer circle and are connected to each other via chords.
The chord diagram displayed above refers to tweets including #knime during the week
from the 26th, July to the 3rd, August 2021. The number of retweeted tweets defines
the size of the circle portion (the node). Each node/user has been assigned a random
color. For example, @KNIME is olive, @DMR_Rosaria is light orange, and @paolotamag
is blue.

Having collected tweets that include #knime, it is not surprising that @KNIME occupies
such a large space on the outer circle.
The number of retweets by another user defines the connection area (chord), which is
then displayed in the color of the retweeter. @DMR_Rosaria is an avid retweeter. She
has managed to retweet the tweets by @KNIME and KNIME followers
disproportionately more than everybody else and has therefore managed to make the
color light orange the dominant color of this chart. Moving on from light orange, we
can see that the second retweeter of KNIME tweets for that week has been
@paolotamag.

How to build a Chord Diagram in KNIME


We’d now like to show how we built the chord diagram using KNIME Analytics
Platform.

KNIME workflow to visualize social media network on Twitter.

183
Other Analytics
Social Media Network Visualization

Data access

We access the data by using the Twitter nodes included in the KNIME Twitter API
extension. We gathered the sample data around the hashtag #knime during the week
from the 26th, July to the 3rd, August 2021. Each record consists of the username, the
tweet itself, the posting date, the number of reactions and retweets and, if applicable,
who retweeted it.
Let’s build the network of retweeters. A network contains edges and nodes. The users
represent the nodes and their relations, i.e., how often user A retweets user B is
represented by the edges. Let’s build the edges first:
1. We filter out all tweets with no retweets or that consist of auto-retweets only.
2. We count the number of retweets a user has retweeted tweets of another user.
To clean the data and compute the edges of the network all you need are two Row
Filter nodes and a GroupBy node.

The matrix of nodes and interactions

Now we want to build a weighted adjacency matrix of the network with usernames as
column headers and row IDs, and the number of retweets by one username on the
tweets of the other in the data cell. We achieve that by addressing the following steps.

This metanode builds the matrix of interactions between Twitter usernames around #knime.

1. We build a comprehensive list of all users (usernames), both tweeting and


retweeting, and count the number of times their tweets have been retweeted
overall. These numbers will fill the nodes of the network.
2. We narrow our analysis down to investigate only the 20 topmost retweeted users.
That is why we sort them in descending order with respect to the number of
retweets on their tweets and keep only the top 20.
3. Using a Cross Joiner node, we build the pairs of users. From an original set of
twenty users, we end up with 400 different user pairs.

184
Other Analytics
Social Media Network Visualization

4. To these user pairs we add the previously computed edges by using a Joiner
node.
5. The Pivoting node then creates the matrix structure from the (username1,
username2, count of retweets) data table.
You can read it like this: “The user named in Row ID’s row was retweeted n times by
the user named in the column header’s column.”

Drawing the Chord Diagram

• The matrix we created is the data input for a Generic JavaScript View node.

• The Generic JavaScript node draws the chord diagram.

• To draw the chord diagram, we need the D3 library which can be added to the
code in the Generic JS node.

• The JS code required to draw this chart is relatively simple and is shown here.
// creating the chord layout given the entire matrix of connections.
var g = svg.append("g")
.attr("transform", "translate(" + width / 2 + "," + height / 2 + ")")
.datum(chord(matrix));
// creating groups, one for each twitter user.
// each group will have a donut chart segment, ticks and labels.
var group = g.append("g")
.attr("class", "groups")
.selectAll("g")
.data(function(chords) { return chords.groups; })
.enter().append("g")
.on("mouseover", mouseover)
.on("mouseout", mouseout)
.on("click", click);
// creating the donut chart segments in the groups.
group.append("path")
.style("fill", function(d) { return color(d.index); })
.style("stroke", function(d) { return d3.rgb(color(d.index)).darker(); })
.attr("d", arc)
.attr("id", function(d) {
return "group" + d.index;
// creating the chords (also called ribbons) connections,
// one for each twitter users pair with at least 1 retweet.
g.append("g")
.attr("class", "ribbons")
.selectAll("path")
.data(function(chords) { return chords; })
.enter().append("path")
.attr("d", ribbon)
.style("fill", function(d) { return color(d.target.index); })
.style("stroke", function(d) { return d3.rgb(color(d.target.index)).darker(
); });

185
Other Analytics
Social Media Network Visualization

More formal network analysis technique

If you would prefer a more traditional method to visualize your results, there are also
common KNIME nodes to analyzing your social media network in the “classic” and
more formal way. Using these nodes also means that you don’t have to use any
JavaScript programming. What we want to do is analyze the same network of the 20
most active followers of @KNIME on Twitter, but this time with the KNIME Network
Viewer node.

Top 20 Twitter users around #knime visualized as a network map using the
Network Viewer node. Nodes of the underlying graph are represented as circles
and are connected via arrows. The size of a circle (node) is defined by the total
number of times a user has been retweeted by one of the other users. The size
of an arrow (edge) represents how often one user retweeted another user’s
tweets.

This network map displays the graph with the following key elements: nodes are
represented by a specific shape, size, color, and position. We arbitrarily chose circles
for the shape. Each node is colored and labeled with respect to the user it represents.
The circle’s size is dependent on the overall number of times the specific user’s tweets
have been retweeted by other users. The position of nodes in this case is defined by
their degree. The more input and output connections a node has, the higher its degree
and the more centric it is displayed on the network map.

Note. For ease of visualization, in the figure above, we have manually rearranged
the position of the @KNIME and @DMR_Rosaria nodes to distinguish edges more
accurately. This is why these nodes do not have a centric position.

186
Other Analytics
Social Media Network Visualization

Another key element are the edges, which connect the nodes. They are visualized as
arrows, as we visualize a directed graph. The direction of the arrow shows which user
has retweeted somebody else’s tweets, while its size depends on the number of
retweets.

Low-code meets no-code for greater visualizations


In this section, we have shown an alternative approach to the more traditional network
visualization techniques by using a Chord Diagram.
We did this by leveraging the flexibility of KNIME Analytics Platform to enrich analyses
and visualizations by integrating different scripting languages, whenever needed. For
example, JavaScript in the Generic JavaScript View node.
The JavaScript code required is an adaptation of an existing D3 template and the
crucial parts are displayed above.

187
Final Book Overview and Next Steps

Authors: Francisco Villarroel Ordenes, LUISS Guido Carli University & Roberto Cadili, KNIME

Meet Your Customers: The Marketing Analytics Collection with KNIME is a book aiming
to amplify the use of marketing analytics methods and processes across academics,
practitioners, and students. Using a low code, visual programing interface, we have
designed and implemented a wide array of workflows (projects) that will allow you to
tackle different marketing problems. All workflows are free, open-source, and can be
downloaded from the KNIME Community Hub in the “Machine Learning for Marketing”
repository.

Let’s recap
The book was divided into seven chapters, which should simplify their potential
adoption in a Marketing Analytics course or as a practice guide for marketers. Below
we summarize the content of each chapter.
In chapter 2, called “Segmentation and Personalization,” we tackled two of the basic
concerns of marketers: How to identify groups of consumers based on their
characteristics, preferences, and behaviors? And how to use that information to
personalize marketing offerings (e.g., communication, products, services)? The
segmentation workflow uses a public dataset from a telecommunications company
and implements k-means clustering to identify groups of consumers based on their
behaviors when contacting customer service (e.g., call center). The Market Basket
Analysis workflow is designed to use consumer buying patterns (e.g., products bought
at a supermarket by a consumer sample) and use that information to develop a set of
association rules, which will result in recommendations to new consumers sharing
similar buying patterns. Finally, the personalization workflow demonstrates how to
build a recommendation engine using big scale datasets. In this case, the workflow is
applied to a Netflix database, and it uses Spark collaborative filtering to give movie
recommendations based on previously watched content by users.
Extract marketing relevant information from consumer mindsets was the focus of
chapter 3. Mindsets or feedback metrics are perceptions, attitudes and emotions
about a product, service, or brand, which have been shown to predict behavioral
outcomes (e.g., conversion rates, sales). The SEO Semantic Keyword Search workflow
demonstrates how to scrape Google SERP (Search Engine Results Page) and Twitter

188
Final Book Overview and Next Steps

tweets to obtain keywords suggestions for a website or landing page. The Customer
Experience (CX) workflow uses a topic model algorithm (LDA) in online reviews (e.g.,
TripAdvisor) to understand which service attributes have the strongest effect on
customer satisfaction (e.g., star rating). The Brand Reputation Tracker workflow shows
how to scrape Twitter data to measure brand reputation using a state-of-the-art
marketing method. In the example, users can measure the brand driver called “Brand”,
which measures the attributes of coolness, excitement, innovativeness, and social
responsibility. The final section involves a series of workflow to measure consumer
sentiment (i.e., sentiment analysis) from text data (e.g., social media conversations)
using valanced lexicons (i.e., words with a positive or negative connotation), machine
learning (e.g., Decision Tree and XGBoost Tree Ensemble), deep learning (e.g., LSTM
deep neural networks) and transformer models (e.g., BERT).
In chapter 4, we moved on to analytics for describing, understanding, and predicting
consumer behavior. The first workflow called Querying Google Analytics focuses on
querying data from the most widely used marketing analytics tool: Google Analytics.
This workflow allows users in possession of a Google Analytics account to query the
service and obtain consumer behavioral data relative to a specific website (e.g., page
views, bounce rate, etc.). In the Predicting Customer Churn workflow, we demonstrate
how to use consumer transactional data (e.g., product usage, calls to customer
service, etc.), to develop a machine learning predictor (i.e., Random Forest) of
customer churn. This is particularly useful for subscription-based business models
where it is crucial to monitor retention and churn. Finally, the third workflow touches
upon Attribution Models, one of most relevant marketing analytics problems
concerning the identification of the marketing channel with the largest implication for
conversion rates. This workflow demonstrates how to use methods such as last
touchpoint attribution, Shapley value, regression-based methods, and field
experiments.
Workflows that deal with marketing mix activities such as pricing, promotion, place
and product are included in chapter 5. Every marketer at some point has to make
decisions concerning any of these four activities. At the moment, we have included
one workflow concerning pricing analytics. The workflow shows how to use “value-
based pricing” using a set of rules based on industry information, and “pricing
optimization” using regression models which involves a greater degree of automation.
We expect future versions of the book to explore other marketing mix activities.
Customer valuation is addressed in chapter 6, and it involves the implementation of
two workflows. The first workflow focuses on measuring customer lifetime value
(CLV). This is an important measurement that enables marketers to understand the
profitability of consumers by considering the revenue that they are projected to
generate and their cost of acquisition. The second workflow uses the RFM (Recency,
Frequency, and Monetary value) framework to extract past transactional data (e.g.,
average order value) and identify segments of consumers based on their transactional

189
Final Book Overview and Next Steps

patterns. This information, together with estimations of CLV can help organizations
take new retention actions (e.g., bundles or targeted advertising) based on past
customer behavior.
Chapter 7 focuses on Data Protection and Privacy. It is designed to include marketing
analytics tools and processes that help organizations anonymize their data and
comply with stricter privacy regulations. Currently, we have included a training and a
deployment workflow that tackle the data anonymization problem, where the goal is to
transform the data in such a way that no customer can be identified but at the same
time keep the maximum amount of non-sensitive information for the analysis. For
example, the workflows use specific nodes that allow replacing any name or birthdate
with unrelated information that has a minimal risk of re-identification.
The final chapter of the book, chapter 8, is called “Other Analytics” and it includes two
workflows that we could not assign to the previous chapters. The first workflow
focuses on the use of image analytics with the help of Google Cloud vision. Marketing
content is mostly visual, and we expect content marketers and advertisers to be
interested in measuring and identifying which visual features in images are resulting
in greater engagement. The image mining workflow allows users to extract features
such as color presence, color concentration, and object identification, in batches of
images. The second workflow concerns network mining. It deals with the problem of
measuring and understanding relationships within a network of users (e.g., social
media), and it can help marketers understand how users interact and its potential
implications for service development.

What’s next?
The chapters and workflows presented in this book are a living repository of projects.
All workflows are open for improvements, so we are happy to receive suggestions and
recommendations on how to adapt them to the ever-evolving marketing analytics
landscape. Thinking about the future, we already have some projects in the pipeline.
We are aiming to deepen the use of attribution models with more advanced algorithms
(e.g., Markov Chain Monte Carlo) and to augment the understanding of unstructured
data such as video (e.g., TikTok) and audio (e.g., Podcast). As stated earlier, we are
also interested in including additional workflows tackling traditional marketing mix
activities, such as promotion, product, and places. For example, we are interested in
using the geospatial functionalities of KNIME to understand how to inform marketing
decisions using aspects related to the geographic coordinates of stores and
consumers. Finally, we expect the development of future workflows around customer
service, and the use of interactive agents such as chatbots. This is a prominent area
that could be of great utility for customer service departments handling thousands of
queries on a daily basis.

190
Final Book Overview and Next Steps

We are looking forward to hearing your feedback on the “Machine Learning and
Marketing” repository on the KNIME Community Hub, and we will keep working on the
development of workflows for the community.

191

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy