Report - Mohamed Amine Jebari
Report - Mohamed Amine Jebari
Building an Insights
Stack
20 Septembre 2017
Supervisors’ validation
Mr. Mohamed Heny Selmi
On the very outset of this report, I would like to extend my sincere & heartfelt obligation
towards any person who have helped me in this endeavor. Without their active guidance, help,
cooperation & encouragement, I would not have made headway in the project.
Firstly, I would like to express my sincere gratitude to my company and University supervisors Mr.
Tim Schmitz, Mr. Ashish Karla and Mr. Mohamed Heny Selmi for their continuous support of
my final project, their patience, motivation, and immense knowledge about business and technical
fields. Their guidance helped me to improve myself and finish writing this thesis with having my real
first steps in the world of employment. I would have never been able to do all this without their help.
Second, I am also ineffably indebted to my colleagues Mr. Jasper Boskma, Mr. Sebastian Van
Lengerich, Mr. Alexander Linewitch, Mr. Alessio Avellan Borgmeyer, Mr. Niklas Henkell and
Mr. Ryan Cooley for conscientious guidance and encouragement to accomplish this assignment.
I am extremely thankful and pay my gratitude to my family members for their valuable guidance
and support on completion of this thesis in its presently. They were always keeping an eye on my
evolution and finding the ways to coach be and get the best out of me.
My sincere thanks also goes to all the tech team who gave me all the necessary tools and facilities
to accomplish my mission and make the roads easier for me.
I extend my gratitude to ESPRIT School of Engineering for giving me this opportunity.
Nevertheless, I would like to thank the big family that were around me, including Tunisians,
Chineses, Frenchs, Americans, Germans, Czechs, Italians, Venezuelans and Turkishs who
directly or indirectly helped me and encouraged me to complete my thesis, managed my stress and
knew how to give me the moral support that I needed.
Any omission in this brief acknowledgement does not mean lack of gratitude.
Abstract
The project is entitled “Building an Insight Stack” and was implemented within the company
The Jodel Ventures GmbH, a product-oriented startup specialized in Social Media.
This work forms part of the graduation project presented to obtain the Engineer's degree on
Business Intelligence-ERP at ESPRIT, Private Higher School of Engineering and Technology,
Tunis.
Its aim is to understand the user behavior and the results of changes of the app on the user
experience and have insights and deliveries within a small amount of time. It was realized with the
BI Team and touches 3 main important chapters of Business intelligence: Data warehouse, Process
automation and Machine learning.
Keywords: insight, product-oriented, user behavior, user experience, BI, data warehouse, process
automation, machine leaning.
Résumé
C'est dans le cadre de la préparation à l'intégration au milieu professionnel que ce stage a été lancé. Il
s'agit d'un projet de fin d'études de l'École Supérieure Privée d'Ingénierie et de Technologie, Tunis.
(ESPRIT).
Ce projet s'est déroulé au sein de l’entreprise The Jodel Ventures Gmbh qui offre une solution mobile
(Social Media).
Son principal but est de comprendre le comportement utilisateur et le resultat du changement effectué
dans l’application sur l experience de l utilisateur et livrer des analyses rapidement. Le projet a été
realisé avec l’equipe BI et touche 2 chapitres important de l’informatique decisionelle: data
warehousing, l’automatisation des process et la Machine Learning.
1
This project is left in 7 part: Chapter 0 will be about the general presentation of the company
and their field of activity. The second one devoted to present the general context by introducing the
company and the business in which I will be involved. The third section concentrates on the data
understanding. Fourth part will be about the data preparation. The fifth part presents the main
theoretical study and modeling. The sixth chapter reflects the evaluation of our results. Finally, last
part will be the deployment of our solution in its different aspects.
2
Chapter 0: General context
In this first chapter, we present The Jodel Ventures GmbH; the Company in which this project was realized,
its business market, the services that it offers and also the team in which I fulfilled my end of studies
internship. Besides, we present our solution and how we ended up doing this project (the challenges).
1. Project Context:
1.1. Hosting organism and team:
1.1.1. Hosting company:
The Jodel Ventures Gmbh is a startup based in Berlin which main field of activity is Social Media.
Created in 2015 by Alessio Von Borgmeyer, Its main Mission is to let you instantly engage with the
community around. When it comes to the Vision, it is about developing a platform to discover,
follow, and participate in the most relevant conversations with people around 10 KM, anonymously.
The Jodel Ventures Gmbh provides iOS and Android location based app that only requires your
geolocalization to let you interact with people around you by reading, posting and even sharing on
other platforms like Whatsapp or Facebook.
The communities are active in more than 10 countries including Saudi Arabia and the number of
daily active users is more than 1.000.000.
1.1.2. Hosting team:
During the realization of this project, I’ve been part of the Business Intelligence Team freshly recruited
and built by the first days of 2017. I was also working with all the other departments due to their needs for
insights of analysis. Here is the general structure of The Jodel Ventures Gmbh :
Board
3
Most of my time was spent in my department (BI) working on different project that were either related to our
own analytics or other needs. Here is the general diagram of the BI Team:
4
Figure 3: Objectives and Key Results Flow
Every OKR Quarter includes a weekly grading of the objectives and key results that is usually made by the
individuals, then the department, and finally the company ones.
In Jodel case, at the moment we are fixing departments and company OKRs only. We are also willing to have
it for the individuals to insure personal development.
As long as one has a result of 60-70% of his objectives & key results done, the quarter is considered as
successful. On the other side, if the final grading is too low or too high, one has to reconsider how he is
estimating every task.
1.3. Project Presentation:
All over the past years, Jodel (As a company) did some giant steps in the social media world. Although
the road is still long and full of obstacles, we can easily say that the growth was quite impressive. But
after reaching a certain level of user experience, it becomes difficult to understand what is happening
with your users unless you analyze the data.
4
1.3.1. Process Automation:
Not that they were not doing it before, Jodel employees were obliged to wait for a important amount
of time before having the needed results. But time is not a luxury that one can afford as it depends only
on some hours or days to see a whole new product rising from nowhere and competing with you, trying
to take the best out of your market.
the schema above explains the process of experimentation. As you know, the best way for a company
to improve is to run different experiments and fixing the KPIS willing to increase before completely
rolling out a new feature. As a matter of fact, Facebook is able to run up to 3000 experiments at the
same time to choose the best new features from the one they are willing to launch. Consequently, in
Jodel case, analyzing the data was taking a lot of time due to the lack of resources (Material and
Human). To remedy to this issue, automating seems to be the best solution.
The goal of this part of the project is to make all the needed data for analysis getting stored in one,
unique data warehouse with keeping the integrity of everything able to be extracted. By doing this, we
would be able to connect any reporting tool and create monitoring dashboards for all the teams:
6
1.3.3. Machine Learning:
Last but not least was the concern about the knowledge of our users. The more you know your product
(the users), the more you would be able to improve it. Starting from that point, we agreed on doing
some analysis and choose to carry on with one important question: what is the optimal user?
Answering this main question will for sure give us more insights about the consumers of the social app
and help us to get better decisions than before. this will be the machine learning part of our work.
1.3.4. Summary:
The project fulfills three important Objectives that will give a new value to the company:
Achieving these points will help Jodel to compete with the direct opponents and find the right answers
for the right questions.
2. Project Methodology:
2.1. Methodologies:
2.1.1. SEMMA:
SEMMA is a methodology created and developed by the SAS institute that makes the exploration,
visualization, along with the selection, transformation and modeling easier and more understandable
by the user of the tool od SAS for data mining. Here is a graphic representation of this methodology:
7
Figure 8: SEMMA process flow
• Sample:
Getting a small portion from a high amount of volume in order to make the data handling more agile.
It reduces the time and resources costs. By the end of this task, we would be having a training set, a
validation set, and a test set.
• Explore:
this phase is about finding weird trends and anomalies among our data. This exploring can be either
through numbers or visualizations. The most common techniques are clustering, factor analysis or
correspondence analysis.
• Modify:
Modify is about manipulating the data needed for the modeling, modifying its format if needed, and
removing if what id not needed for our data set.
• Model:
After getting our clean ready data, time to apply some methods, algorithms, statistical methods like
neural networks, decision trees or logistic models.
• Assess:
Last part of this method is to evaluate your modeling. It is made to check whether our modeling was
good. It orders to know this, we apply another testing set and check the results.
8
2.2. Choice:
2.2.1. CRISP-DM:
Cross Industry Standard Process for Data Mining is a Data Mining process model (Along with KDD
and SEMMA) developed in the late 1990s by IBM and is, until today considered as the best and most
generic solution for managing Data Science projects (source: KDnuggets.com).
the main causes of using this methodology are the fact that it is really independent of any tool or
technique (unlike SEMMA which is for SAS), the support of document of projects and knowledge
and transfer training.
It splits any data mining projects in six important phases but insure also different iterations (going
back and forth in the different steps):
• Business Understanding:
This first step is essentially about understanding the business, the need for the specific project and the
resources that we have. It also includes the risks, the costs, the benefits and finally developing a project
plan according to these variables.
• Data Understanding:
This step is about selecting the data requirement and doing an initial data collection in order to explore
and get an overview the its quality.
9
• Data Preparation:
The data preparation task is the final selection of the data, acquiring it and doing all the possible
cleaning, formatting and integration. it also should be extended to some transformation and
enrichment (for wider possibilities of analysis). It could sometimes be the longest part of your data
mining project.
• Modelling:
When reaching the modelling phases, a modelling technique should be selected. the more you did the
last steps in a good way, the you’ll do a good choice. it is about how you understood the need and the
data. the data scientist/analyst needs also to divide his set in training and testing subsets for evaluation.
Finally, one should examinate the alternative modeling algorithms and parameter settings (Gradient
descent, etc.)
• Evaluation:
The Evaluation is about asking oneself:” is this result answering my business question?” or "is this
the wanted output?". it would also include the model approval according to some specifications.
• Deployment:
Whether it is an API or a result saved in an excel sheet/word, the deployment means creating a report
containing the findings of your analysis. In case it is an API, it would be planning the deployment on
an operational system like a server on which it would be executed according to your needs. this step
would also contain a final review of the project and a planning for the next steps.
In this chapter, along with all the other upcoming ones, we will try to always have three subchapters
about the three main projects we participated to. Chapter 1 is about understanding the goal of these
projects, the success criteria, or the resources. To summarize it, it would be the pillars of the next steps
of this project.
1. Business Goals:
1.1. Project Background:
Like every new thing that one is going to build, it is important to get an overview of the current state.
In other words, every engineer has to go through what is already existing to know what to do, what to
plan. This is what the project background is about.
1.1.1. Datawarehouse project:
Jodel is a company that is evolving in a quite fast way. Therefore, it sometimes happens that problems
have to be solved in a very fast way. As a result, we ended up having data in different sources:
• Postgres Datawarehouse: first datawarehouse created in Jodel by the previous
CTO. Written in node.js, the ETL Process in quite fast and contains the basic interactions of the app
(register, post, reply, voteup, votedown). The timeframe of this Datawarehouse is from 2015 until
summer 2017. the data is still not cleaned and a lot of tables are empty because the work wasn't finished.
• RLT Cache: Third source of data, it is the cache used by the reporting tool
of our investors. Containing some precalculated queries used to plot some charts, it is the most updated
and directly connected source of data. Whenever a new server side or even client side event is added,
it is added using a specific pipeline to the RLT cache.
1.1.2. Experiment Automation project:
As a company willing to reach a high amount of retained users is a specific restrained timeframe, it is
important the throw multiple experiments in order to increase the retention and the engagement of the
users. This is how an experimentation process works:
Acceptance by the
Development of the
Head of Product
feature
Department
12
For a company who want to evolve fast, these steps can take a considerable amount of time especially
if multiple experiments are run at the same time.
1.1.3. Machine Learning project :
When launching a new feature or choosing to launch one, you are involving two important things for
every company: Money and employees time. Therefore, choosing what to change, what to add to the
app should be a very wise and calculated step. A false change or a triggering of the closest need of the
users’ base can lead to a complete crashing and end of the app. One of the main examples was Yik
Yak1: started as an anonymous app, it invited people after collecting a good mass to make their profile
non anonymous anymore. after some weeks of struggling, it leads to the death of the app and a shutting
down of the servers.
Jodel, as a company that is not monetizing and with a specific amount of money to spend in a specific
amount of time, have to face the same challenges. Willing to always push the app forward and do
changes that could lead to higher engagement and retention of the users, a need of knowing what is
really critical and important to add is more than urgent.
For now, apart of general values like Daily Active Users and Retention, there is no real other
understanding of the real activity and influencing milestones for the users.
1.2. Business Goals List:
Jodel needs to fulfill the needs of her users of socializing, interacting and increase their stickiness to
the app. For that this is what should be done:
• Make data more accessible and understandable for everyone inside the company to know where
and how to track the users’ behavior.
• Provide fast analytics for the different departments to get results about the experiments and
take decisions.
• Increase the knowledge about the users to know the different trends among them and what to
change exactly to increase their engagement and therefore retention.
1.3. Business Success Criteria:
The business success criteria are what makes the company or the people in charge of a task say that
they reached their goal or not. In our case, according the the company methodology, the different key
results of the objectives are what makes a task a success or failure:
1
https://www.theverge.com/2017/4/28/15480052/yik-yak-shut-down-anonymous-messaging-app-square
13
1.3.1. Datawarehouse project:
14
Technical Business • Machine • Model with more than 60%
Intelligence Team Learning Accuracy
Analysis • top 20 features influencing the
• Data optimal user experience
Manipulation
2. Situation State:
2.1. Inventory of resources:
• Personnel:
o Backend team.
o Product Team.
o Head of Business Intelligence Department (Bachelor and Master in Mathematics).
• Data:
o MongoDB Production Data.
o live access to old datawarehouses and databases.
o Live access to new datawarehouse.
• Computing resources:
o Lenovo, windows 8.1, i5 processor, 8 GB RAM
o Macbook Air, early 2014, Sierra, i5 processor, 8 GB RAM
o m4. xlarge AWS instance with 16 CPUs
• Softwares :
o RStudio 3.2.4.
o DataGrip.
o Draw.io.
o Excel.
o AWS.
o Shiny.
As explained above, after the whole implementation of the new datawarehouse, we would be able to
decommission the Postgres datawarehouse and the 1-Table Database and save 𝟏𝟕𝟖𝟕 + 𝟔𝟖𝟒 − 𝟒𝟓𝟎 =
𝟏𝟕𝟐𝟏, 𝟑𝟎𝟕 €/𝑴𝒐𝒏𝒕𝒉.
2.2.2. Experiment automation project:
The process automation is maybe not easy for benefits calculations because it created a whole new
behavior inside the company. But, we can use some assumptions to know what was Jodel spending
and how this changed.
Before the automation, Jodel could run a maximum of 2 Experiments per week. And due to the lack
of resources (1 product analyst), the analysis of 2 experiments was also taking 1 week. Therefore, the
monthly cost of analyzing 8 experiments was 2500 euros per month which is the wage of an analyst.
Coming to our web application, R studio is a free open source software that allows you to create these
process automation. Moreover, letting the product managers do their own analysis will free the other
BI resources. And finally, when creating a shiny app, R gives you the possibility to use their server but
with a limited number of hours.
16
Figure 14: R Studio Shiny pricing
At Jodel, we chose to deploy it ourselves on an amazon EC2 instance. This way, we would be able to
debug it, scale it according to the needs, with an unlimited amount of hours or users.
For performance needs, we are most of the time using m4.xlarge which have these specs :
number and is making the company benefit of a number of up to 16 experiments analysis per month
with saving 𝟏𝟕𝟑𝟗 €/𝒎𝒐𝒏𝒕𝒉.
2.2.3. Machine Learning project:
The machine learning task scheduled in this project is purely for investigating. The output of it is some
advice about which new feature to implement not to lose money in developers to end up rolling back
all the efforts. Usually, a bad feature would cost 1 week of salary of an Android developer (1K), an
IOS developer (1K) and Backend developer (1K). this give you a total of 3K Euros. The machine
learning task, on the other side, was performed for 1 month on RStudio server installed on an
m4.4xlarge Amazon EC2 instance which costs a total of 213.12 € for being exploited 8 hours per day
for 30 days. Adding to this a wage of data analyst for 1 month, we end up having a total of 2713.12 €
paid only one time to give us a valuable amount of information as a guidance for developing the
features.
18
4. Project Plan:
This part of the plan is about describing the different steps that would occur during this project and the
different iterations involved. Moreover, we will talk about the different tools and techniques used to
reach the data mining and business goals.
4.1. Planning schema:
as specified above, the plan that would be used in this presentation is CRISP-DM. Although not all our
tasks are a data mining task, they contain some of the iterations in this methodology, that thanks to its
agile specification, can make us reach our goals easily.
19
4.2. Initial assessment of tools and techniques:
4.2.1. Data warehouse project:
• Draw.io: a free online tool for designing and modeling that let you import your schemas in
different formats like xml, pdf, etc.
• Slack: a collaborative professional platform for discussing and meetings.
• Trello: a collaborative tool for task assignment.
• Data Grip: an IDE from JetBrains for SQL, used also to visualize the database schema
4.2.2. Process Automation project:
• R and RStudio: an open source programing language statistics oriented used for data mining
and data analysis. RStudio is an IDE for R.
• Shiny: Shiny is a web application framework for R that allows to transform all your analysis
and designs in an interactive web app
• Plotly Package: Plotly package in R is the R version of plotly, an online data analytics
visualization tool that allows you to create beautiful visualization.
• Bootstrap: Bootstrap is a free open-source front-end web framework that contains different
fonts, buttons, and other interface component for designing user-friendly websites.
• Slack: a collaborative professional platform for discussing and meetings.
• Trello: a collaborative tool for task assignment.
• DataGrip: an IDE from JetBrains for SQL, used also to visualize the database schema.
4.2.3. Machine learning project:
• R and RStudio: an open source programing language statistics oriented used for data mining
and data analysis. RStudio is an IDE for R.
• Slack: a collaborative professional platform for discussing and meetings.
• Trello: a collaborative tool for task assignment.
• DataGrip: an IDE from JetBrains for SQL, used also to visualize the database schema.
20
Chapter 2: Data
Understanding
This chapter, will go through the data involved in our project, whether in the data warehouse, the process
automation or the machine learning part, its quality and how it can give answers to our questions.
1. Data Description:
1.1. Data sources:
Here are the different data sources we used to achieve our goals:
• Old Redshift: this transactional database named internally "Old Redshift" is an analytical
database used by the product team to do some simple analysis. Each row represents one event.
It runs on Amazon Redshift using the ds2.xlarge instance of AWS. The data is stored there
from September 2016 until May. It is updated with 22 Million rows per day.
• Old Postgres: A data warehouse named internally "Old Postgres" is the first data warehouse
containing one fact table, multiple dimensions. It has the design of traditional DW and runs on
PostgreSQL with using a m4.xlarge instance. The data is stored there from October 2014
until august. The ETL process was coded in Node.js.
In Order to be able to go through all of them, here all the used tools:
• Data Grip.
• RPostgreSQL and R: one of the most known packages in R that plays a role as a Database
Interface. As the name tells, it is a Postgres driver but can handle Redshift too.
1.2. Data Specifications:
1.2.1. Old Redshift:
• fact_interaction table :
interaction_key What interaction did user do? See: description of events tracked in the DW
21
city_name City name of a mapped city
metadata Additional data for some interactions, for instance flag reasons.
Interaction Meaning
action.post.reply Reply
action.post.create Post
22
action.post.pin Pin a jodel
§ Administration:
Interaction Meaning
§ Me Section:
Interaction Meaning
23
action.getposts.mine_replied Go to “My replies”
§ Channels:
Interaction Meaning
action.getposts.channel_combo Go to “channels”
§ Hashtags:
Interaction Meaning
action.getposts.hashtag_channel_combo Go to “channels”
24
action.getposts.hashtag_combo Same as for hashtag_channel
§ Hometown Feature:
Interaction Meaning
§ Experiment Assignation:
Interaction Meaning
interaction_key What interaction did user do? See: description of events tracked in the DW
lifecycle_key Number of minutes from registration of given user to the moment of interaction
content_key Unused
Created Unused
Last_edited Unused
• dim_interaciton table:
Id Id of the interaction
• dim_city table:
26
name City name
Processed Unused
• dim_user table:
• dim_content table:
27
message The content itself, pictures as a link
created Empty
• dim_time table:
28
Figure 19: Old Datawarehouse relations
2. Data Exploration:
2.1. Datawarehouse project:
As we know, one of the main differences between a database and a datawarehouse is the purpose. One
is purely operational, the other one is analytical. We also know that every fact table should be related
to a specific business case. Our exploration part here is to understand how are all these columns are
business related and the values structures.
• Experiment Assignation: while going through the database (Old Redshift) we see that there
is a column named experiment that contains comma-separated values.
29
Figure 20: Experiment column data in Old Redshift
most of these words can seem nonsense for the reader but they are just a naming of a specific
experiment running for specific users.
• Geohashes: Jodel is a location-based app and, therefore needs a system to quantify the
location in a way that the users will see the posts according to their locations. The company is for that
using a geocoding system that divides space into bucket of grid shape. the longer is the name of the
geohash, the more precise it is. For business purpose (The radius of every user), Jodel uses the 5 letters
Geohash.
30
Examples:
u1vu1:
-latitude: 55.59013367
-longitude: 8.17314148
The image above is what we call a funnel. It shows you that, depending on the positioning of the user
inside the app, he can leave. Then you would be able to know why and where he disconnected.
2.2. Process Automation project:
• Core actions:
The process automation includes calculating some specific KPIs for the product team. For this, one
needs to understand the actions that are mostly made by the users.
33
2.3. Machine Learning project:
• Retention:
One of the most important metrics for Jodel is the retention. In fact, it really shows who are the users
that really come back to the app after using it. But for the timeframe of comeback, it depends on the
business point of view. Jodel always choose to calculate it monthly.
- Retention (Weekly):
• All over the world: 27%
• DE: 31.69%
• FI: 38.95%
• DK: 35.21%
• NO: 23.78%
• AT: 26.58%
• SE: 29.04%
• IT: 25.58%
• FR: 32.27%
• Happy Ratio:
𝑼𝒑𝒗𝒐𝒕𝒆𝒔
One of the other important KPIs of Jodel is the Happy ratio. Its calculation is and is also
𝑫𝒐𝒘𝒏𝒗𝒐𝒕𝒆𝒔
used as a proxy to know in what kind of environment the user is growing. In fact, the option of upvoting
is used when someone likes a post and the option of downvoting is used when someone dislikes
something. Therefore, 2 analyses were run for D1 comeback and D14 comeback.
Here are some results using a simple correlation between happy ratio and number of comebacks:
As these two plots are showing, there could be a clear trend that express how the happy ratio and the
retention of the users could be linked. and by the same opportunity, the upvotes and the downvotes.
3. Data Quality:
3.1. Data warehouse project:
3.1.1. Client data:
After checking all our different data sources, it clearly appears that one the most missing data is the
client side. Some meetings with the other departments showed a high need of it. Most of the use cases
they talked about were related to the number of the clicks, where did these clicks happened and when.
3.1.2. Experiment Assignation:
Although the data is available in the Old Redshift, the way the experiment names is written makes the
querying very difficult. Here are some examples of what you can find in the experiment field:
Experiment
hashtag_prompt_android_hashtag_prompt_android;inapp_notis_global;mentioning_repliers_menti
oning_repliers_global;picture_feed_global_picture_feed_global;screen_shot_sharing_screen_shot
_sharing_global;user_profiling_user_profiling_global
flag_reason_change_flag_reason_change_global;inapp_notis_global;mentioning_repliers_mention
ing_repliers_global;picture_feed_global_picture_feed_global
35
cell_new_design_cell_new_design_GLOBAL;flag_reason_change_flag_reason_change_global;m
ark_repliers_ios_mark_repliers_global;mentioning_repliers_mentioning_repliers_global;picture_f
eed_global_picture_feed_global;pin_main_feed_pin_main_feed_global;reply_in_feed
channels_berlin_old;flag_reason_change_germany
thankajodler_thankajodler_no
As weird as it seems, the experiment field contains 1,2,3 or 7 experiments that are assigned
simultaneously to the users and the querying them can take ages. for that, one has to use queries and
adding LIKE %name_of_the_experiment%. As a result, the product employees need to wait 20 to
30 minutes to get 1000 IDs to analyze.
3.1.3. User location inside the app:
Although the data is “available” in the database of the old warehouse, the column in which it is written
is changing its content from a row to another. To have a better understanding about that point, here is
a concrete example:
36
3.2. Experiment Automation project:
3.2.1. core actions:
- Getpostdetails:
The experiment automation is about creating a web app that will automatically calculate the specific
KPIs for the business teams. Any discrepancy would lead eventually to false values on which they
could base their decisions. One of the flaws that we found was the naming of the event of reading a
post. For a specific period of time, they run an experiment and changed it to
action.getposts.details_new when it was action.getposts.details. As a result, for a specific period of
time, we had the same data divided into two different interactions.
- Received replies:
While computing and calculating the metrics according to the business-oriented employees, the web
app needs to do some merging and linking between dataframes or columns. Therefore, columns that
have the same role should contain the same format for better filtering and using. Practically, when
searching for who received the reply, or what is the ID of the post on which the reply was written, the
only result is found in the metadata column in this format: parentId:58b69e78ff23cd0f3d40f1af while
a normal content is 55bo0e78ffq33dyf3d40f12f. That makes any data scientist either wait for ages to
get the related content or do more manipulations that will cost more resources and time.
3.1. Machine learning project:
3.1.1. Location from a user perspective:
Our machine learning part will be about finding out what is the optimal user according to the actions
they get, they do and what is happening in their feed. A user who is living in Koln would be seeing
what is happening in his geohash and 26 others geohashes around him. Therefore, the most accurate
way to see what is going on in a user’s feed, what is he seeing, sending and receiving should be using
geohashes and no the city or even the country name.
As a simple example, I would ask you to consider yourself living in the borders of Germany. Having
your geohash representing this border would make you see posts from Germany but also the
Netherlands. This said, when a data scientist will query using country_key=’DE’, he will only see
what the user will see from Germany and would be missing half of what is happening in the
Netherlands (as the structure is one geohash where the user is + 26 other geohashes around him).
37
Chapter 3: Data Preparation
After understanding the business Jodel is working on, and getting to know the different data sources and
the possibilities, comes the time of the data preparation. No modeling or analysis is possible with a messy
data. Before applying any algorithm, loop or statistical method, one should have a clean platform of base
to work on. This is the data preparation chapter. For coherence purposes, this chapter will be divided in
our 3 main projects as the subchapters are changing every time.
1. Datawarehouse project:
After finding out all these discrepancies between the different data sources, one of the main steps is to
create a taxonomy for the events you are willing to send to the new datawarehouse.
As a reminder, a taxonomy is a categorization of different entities.
1.1. Taxonomy:
1.1.1. Sources:
When sending the report that will the data engineers or the backend developers use for creating the
new datawarehouse, it is really important to explain from where is coming every event. Thus said we
tried to find out what where the different possible access point that we have:
- RLT: The database of the reporting tool of the inverstor called RLT stats. It contains almost
everything related to our app as most of the backend engineers are the investors engineers.
- DW: The old Redshift specified previously. It is the most recent and updated database that we
can access as a BI Team. It contains the events of all the new experiments.
1.1.2. Categories:
As the software programming is nowadays module oriented, same thing goes for the data. As a result
of all these code lines, one can easily find from where is coming every event and how to categorize it.
Here are the different categories found by the BI Department:
Category Meaning
First time user The very first actions that a user does/sees when he uses the app
App start The actions that happens when he connects to the Jodel community
38
Mainfeed Loading and filtering the main feed of the app
Three dot Using the three dots of the app (deleting flagging)
InAppNotification The events fired when the user clicks on a notification inside the app
PushNotification The events fired when the user clicks on a push notification
1.1.3. Events:
In order to have a unified view of all what is happening in our mobile app and have a great
understanding of the different phenomenon that could occur, the BI department chose to be the owner
of the naming of any event that will be fired in the new datawarehouse. Therefore and after multiple
iterations with the different departments, we collected the most important interactions that would be
part of our new datawarehosue :
Category Event Client/Server
WelcomeScreenShow Client
WelcomeScreenTapConnect Client
App start
OpenSession Client
LocationPermissionRequestAccept Client
LocationPermissionRequestDeny Client
39
ProfilingShow Client
ProfilingConfirm Client
App close
CloseSession Client
Channels
ChannelsSearch Client
ChannelsJoin Server
ChannelsTapJoin Client
ChannelsUnjoin Server
ChannelsSelect Client
ChannelsEnter Client
ChannelsLeave Client
Hashtags
HashtagsTapMostCommented Client
HashtagsTapLoudest Client
HashtagsTapNewest Client
Core user actions
Upvote Server
Downvote Server
LoadConversation Server
EnterConversation Client
ViewImage Client
TapPin Client
Pin Server
TapUnpin Client
Unpin Both
PostTapPlus Client
40
PostTapCancel Client
PostTapCamera Client
PostTapSend Client
Post Server
Reply Server
LeaveConversation Client
TapSharePost Client
SharePost Server
GiveThanks Server
Mainfeed
MainTapNewest Client
MainTapMostCommented Client
MainTapLoudest Client
MainSelect Client
Three dot
Flag Server
DeletePost Server
Moderation
ModerationRegister Server
ModerationRemove Server
ModerationAllow Server
ModerationBlock Server
ModerationSkip Server
ModerationTapRules Client
ModerationUpdateStatus Server
Picture feed
PictureFeedEnter Client
PictureFeedLeave Client
41
Hometown
HometownSwitch Client
HometownStartSetup Client
HometownConfirmSetup Server
Me section
MeTapPins Client
MeTapReplies Client
MeTapSettings Client
MeTapVotes Client
Administration
BlockUser Server
UnblockUser Server
BlockPost Server
UnblockPost Server
Other
TakeScreenshot Client
LocationFilterTapButton Client
InAppNotification
InAppNotificationView Client
InAppNotificationTap Client
InAppNotificationDismiss Client
PushNotification
PushNotificationTap Client
PushNotificationDismiss Client
PushNotificationView Client
43
Figure 32: Dimension content (new datawarehouse)
44
Figure 34: Dimension date (new datawarehouse)
- dim_inApp_Location: used to localize the position of the user inside the app
o id_inapplocation: business key for dim_inapplocation
o entry_point: different entry points inside the app (ex. Me, Main,
Hashtag,Channel,SearchChannels,etc)
o sorting: type of sorting applied when the action is executed (ex. Newest,
MostCommented,etc)
o filter: type of filtering applied when the action is executed (ex. timeNew,timeDay,
locationHere,locationClose,etc)
o Conversation : true if it is happening in a conversation view
45
Figure 36: Dimension in app location (new datawarehouse)
46
Figure 39: Dimension location (new datawarehouse)
47
o id_flag : business key of the dimension dim_flag
o flag_source : contains the source of the block (ex. internalModeration, autoSpam,etc)
o flag_reason : contains the reason of the block (ex. Disclosure of personal information,
repost,etc)
o flag_subreason : contains the subreason of the block (ex. subreason 101, subreason
225,etc)
- dim_value : dimension that will be populated with the different values that will occur in a
specific experiment.
o id_value : business key of the dimension value
o value : a string containing a value according to a specific experiment
- dim_experiment : name of the different experiments and their starting and ending dates.
o id_experiment : Business key of the experiment
o name : name of the experiment
48
o start_date : starting date of the experiment
o end_date : ending date of the experiment
1.2.2. Facts:
- fact_product : first fact of our datawarehouse. This first fact table will be linked to different
dimensions that will give us answers about questions mainly related to the product department.
The design was made in a way so that it could be linked to a reporting tool for further analysis.
o fk_interaction : foreign key linking the fact product to the dimension dim_interaction
o fk_user : foreign key linking the fact product to the dimension dim_user
o fk_date : foreign key linking the fact product to the dimension dim_date
o fk_location : foreign key linking the fact product to the dimension dim_location
o fk_content : foreign key linking the fact product to the dimension dim_content
o fk_inapplocation : foreign key linking the fact product to the dimension
dim_inapp_location
o fk_property : foreign key linking the fact product to the dimension dim_property
o karma : It is the result of the total point the user is getting fro receiving actions like
votes, replies, blocks, etc.
§ Example : you get +2 karma if you receive an upvote. You also get -2 karma
if you receive a downvote.
o blocked_count : number of times the user has been blocked.
*moderators: users that reached a high amout of karma (changing from a country to another) and didn t get a ban before.
They are able to allow or deny posts
- fact_experimentation : fact table answering questions about how experiments afre going on
without altering the old clean data. It will be used by the BI team who’s task will be to run
some A/B tests to orient the business employees toward the best solution.
50
o fk_exp_interaction : foreign key linking the fact product to the dimension
dim_exp_interaction
o fk_value : foreign key linking the fact product to the dimension dim_value
o fk_user : foreign key linking the fact product to the dimension dim_user
o fk_date : foreign key linking the fact product to the dimension dim_date
o fk_location : foreign key linking the fact product to the dimension dim_location
o fk_content : foreign key linking the fact product to the dimension dim_content
o fk_inapplocation : foreign key linking the fact product to the dimension
dim_inapp_location
o fk_experiment: foreign key linking the fact product to the dimension dim_experiment
o fk_property: foreign key linking the fact product to the dimension dim_property
o count_usage : number of times the new feature was used by the assigned users to the
experiment
51
1.2.3. Relations:
- Fact_product:
52
- Fact_moderation:
53
- Fact_experimentation:
54
2. Process Automation project:
The process automation part is about building a web application that would calculate all the needed
KPIs and make the experiment process automatic. To reach this purpose, our app will be divided in 3
main files according to every shiny app architecture:
- ui.r: this is the file that controls the layout of your app and its appereance. You can as well
call a css file and insert some js code in it.
- server.r : this file contains all the instructions needed to build the app. It is also used to
program a specific action for specifi widgets, buttons, frames, tables,etc.
- helpers.r : this file contains some generic functions that would be called multiple times in the
sever file.
all the different functions and needs that the business employees asked for will be coded in these three
different files. Throughout all the different steps that will be explained, we always tried to make the
coding as simple, fast and efficient as possible to make the availability of the data on the needed speed
scale of the users of the app.
2.1. Experiment parameters:
While running an experiment, the product team need to choose the users that will be involved and the
timeframe. This part of the app is about giving the opportunity to upload the the IDs and picking the
starting and ending dates.
55
• ui.r:
The shiny package has provided for this purpose a component called fileUpload that give you the
possibility to upload a csv or text file and specify the separator, the quotes, etc.
From our side, we tuned it by adding the start date and end date component, the hour specification, and
the type of experiment (same ids, different ids):
• server.r:
the server-side part of this module saves the ids in the file using the function fread from the data.table
package. The choice was based on the speed as it is faster to read.csv. it will also use shQuote, a
function that will transform the dataframe of the IDs to a vector of comma separated IDs with quotes
for the SQL function
the function takes the dates that the user entered in the previous ui and the vector of users’ IDs that has
been created from the text files. This way, and by not taking the country or city, we can simply track
all what the users did as this is what is really important for us, what they are doing after assigning them
to an experiment. This comes without forgetting the error handling in case the vector of users in empty.
In this case, the function will just return a null vector.
58
o Calculating the KPIS:
By joining the different dimensions of our new datawarehouse, we end up having a view similar to the
transactional database we had, but with clean and more complete data. Therefore, having our results
needs from us to do some calculations. The function doing this task parameter are the result of the data
gathering which is a table called “newdata" in this format:
interaction_key user_key utc_date_key
action.post.create 58hrh957292ndvqk2 2017-05-01 15:08:22.00000
action..post.reply 4730jdurrkt03845sje 2017-05-01 15:08:25.00000
action.post.flag 37391hduen02840sb 2017-05-01 15:08:28.00000
… … …
It also takes again the starting and ending date specified in the first step of our experiment.
As we are fetching all the different actions our cohort executes (for other analysis of the app), it is very
important to create multiple small data frames by filtering with the interactions the product team want
to track:
Second step of this KPI calculation is to create the dataframe that will contain the result of our
calculations by creating a sequence of dates, putting the name of our actions in a small table for loop,
creating the column names and the number of rows.
59
Figure 60: Dataframe creation code lines
60
Figure 61: DAU Calculations code lines
61
Figure 63: New Users difference code lines
the code creates one table with the user ids and their registration+24 hours (users.registrations) and
another containing all their interactions(users.interactions.filtered). The rest is about merging the two
tables using the user_key column and then keeping the rows that contains the utc_date_key column
values (from the interaction table) less than the onedayafter column values (from the registration table)
• server.r:
we use the server file to program the tasks of the components. In our case, we would be linking the
functions written in the helper with the components in the ui. We have two solutions. Either we use
eventReactive or observeEvent. The difference between them is that eventReactive is creating a reactive
value that changes based on the expression we will write and observeEvent is triggered automatically
based on the expressions. As we have our button already existing in the ui and to keep the same way of
programming, we chose observeEvent.
Inside it, 2 important steps have to be fulfilled: fetching the date from the date component and checking
if the user have selected the time to fetch the time with it and execute the functions explained above
for both experiments and control group.
62
Figure 64: Saving the date and time code lines
this first part initializes the date and time variables as the date component doesn’t give us the
opportunity to get the time too. For that specific reason we created some hours’ sliders that will only
be seen by the user if he wants to add time to date.
the renderUI function is responsible of the dynamic part of this component:
the last part of it is of course about putting the output of our data gathering and KPIs calculations in
some variables for future use:
65
Figure 69: Same users condition checkbox
o Proportion test:
The proportion test in a statistical test used to see if the proportions in several groups (in our case 2)
are same or have equal values. Like any other test, it contains the H0 hypothesis that means that our
proportions difference is significant and H1 which is the alternative that says that the difference is
insignificant. In R, the proportion test is called using the function prop.test and is used when the cohorts
are not paired, means different.
o McNemar test:
The McNemar z-test is a proportion test used when we have the same cohort. It is a useful test that tells
us if there is a statistical significance between proportion testing.
second part of this function consists on grouping the number of times an action was done by user ids
and counting. For this one need to use “group_by” and “summarize”, two dplyr package functions:
67
Finally, we will apply on it our super.t function explained above:
sig = super.t(x$count,y$count,Paired)
now that we have our significance, we need to fill our table with understandable data. For this, we need
to calculate the mean and put a condition on where it should be written in the “Change” column. It
would be written increase when
𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 > 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑚𝑒𝑎𝑛
and would be written decrease when
𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 < 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑚𝑒𝑎𝑛
Same thing about the “how it is column, if the p-value is < 0.05 then it is a significant result, otherwise,
it is insignificant.
68
Figure 72: % of DAU doing the action data manipulation code lines
Finally, depending of the type of cohorts (if it is the same users or not) we would be using the prop.test
(not paired, different cohorts) or the mcnemar.test (paired,same cohort) explained above :
69
that there is no merging with the starting IDs uploaded as we don t need in this test users who didn’t t
do the action. The only data manipulation done is the grouping per user id and counting.
o Table Rendering:
One of the most important things one should consider while writing and R web app that will be part of
the BI stack is the optimization. As the significance is used and will be used multiple times, we wanted
to make the table formatting and rendering generic as the users were satisfied by it. By rendering and
formatting, we are talking about the colors of the rows according to the significance for example.
Therefore, we created a function called “TableRendering” that we will apply on the different
significance functions. This function is using a package called DT which is an interface to the javascript
library “DataTables”. The function takes as a parameter the dataframe and uses the command
“datatable” to apply the changes on our dataframes. Using the parameter “extensions”, you can specify
some add-ons like the buttons for pdf and excel that will appear and get linked to your dataframe. The
“formatStyle” command allows to add the colors on the rows and the “formatPercentage” command to
give a limited number of decimals to show.
• server.r:
the server part of this significance testing is also linked to the button of the KPI calculations as we need
all the results to be seen at the same time for instant consuming. On the other side, the different tests
70
showed that these mathematical functions that we have been using were weak in handling errors. Here
is how we managed to solve this problem:
- if the cohort is the same and the dates are the same, so significance testing is run. This can be
done when the product team just need to fetch the KPIS of a single cohort.
- if not, then the significance tests are calculated just after the KPIS.
71
Figure 77: Plotting per day box
o plotlyOutput: this component will be handling the result of the plotly charts. By writing
the name of the plot used in the server side, it will make it apprear in the brower.
• server.r:
In the server side, for this part of the project, we will be essentially consuming the variables in our
global environment.
The plotly package affords a function called “renderPlotly” that will be responsible sending the plotly
result to the chart names in the ui side of the app. Here is an example with the multiple parameters in
the plotly function:
72
Figure 80: Plots rendering code lines
Playing around with these multiples different possibilities leads you t have different charts:
by adding a z parameter, it is also possible to draw some 3D plots for a better understanding and other
insights:
2.5. Retention:
As new users with no idea about the product, they are the one that can give you the best feedback about
a new design or feature. Therefore, we use the retention to see how stick they become to the app.
• ui.r:
Designed to be another tab in the web app, the UX of the retention is essentially a data frame containing
the different cohorts and the different days plus an average of the retention and how it is evolving. We
74
also added two radio buttons to give the user the possibility to choose between the retention of new or
existing ids.
• helpers.r:
o New users:
In order to get the new users retention, we created a function containing the start date, end date and
data of the experiment fetched previously. Therefore, a filtering to get the event of the user creation is
made and a dataframe is created containing as rows the cohorts and as columns the days. Every cohort
represents the number of users that registered per day. The days columns represents the different days
of the experiment.
75
Figure 87: Creatin dataframe for retention code lines
After getting all this ready, we are using a for loop to create a temporary variable containing ids of
users registered x day and another loop inside that one to check how much of them were present every
day.
the rest of the function in mainly based on changing all the NAs to 0 and using a function called assign()
that will allow us to push multiple variables in the global environment from a function.
76
Finally, the weighted average of every retention day will be calculated for future plotting.
the table contains a column called day that specifies all the days of the experiment and another column
called value that sums all the percentages of a specific column in the retention of the cohorts table and
divide it by the rows that have percentage ≠ 0%.
o Existing users:
While trying to do the same thing for the existing users, we found out a big obstacle. If we keep on
using the same function for the new users, the algorithm will be counting the same users multiple times
because we don’t have the registration event that divides them. As a remedy we had to add a 4th
parameter to the function which is the user IDs. Like this, we just add a variable inside the function
that will filter with every loop the users that were present in our calculation, take them out from our
variable and go through the remaining users again:
77
As it can be seen here, we assigned the IDs of the experiment to a variable called “remaining.users”.
the number of ids in it will be decreasing after every loop by keeping the ids who are “%not in%” the
users we have been using.
• server.r:
As we are endlessly running behind the speed and time with our web app, we made the retention
functions only query the data from the datawarehouse when it doesn t exist in our global environment.
This way, after getting the KPIs, the product team can directly get the retention of the users in seconds.
They can also only open the app to see the retention of the users. In both cases, it is for sure a time-
gaining condition:
Second condition is about which function to trigger. If the radio button chosen is the new users one,
then the new users retention will be calculated. In the other case, it would be the existing users’
retention. .
Last part of it consists on rendering the datatable. To make it readable for the user, we used an extension
of DT called “fixedColumns”. This extension helps us to make a number of columns fixed when they
78
contain names or values representing the other columns. this is indeed important especially that the
retention calculation can be on a high number of days, a number that a normal laptop screen can’t
handle.
79
Figure 95: Radio buttons for charts dynamic changing
• server.r:
While working on this part, the BI Manager prepared a function in the helpers that would fetch the
clustering results from S3 bucket and filter IDs upload by the user so that we only get those involved
in the experimentation. After getting a general overview of it and checking it closely, he is the output
of the function:
As you can clearly see, we have 20 columns which are the features used for the clustering, the ids of
the users, the clusters they belong to, and the week and year.
The clustering, in his 2 different versions (3 clusters and 9 clusters) is explained using all the different
components shown in the ui file.
For the dataframe containing the summary of the different features leading to the clustering, we used
the function isoweek that gives the number of the week in the year. This function takes as a parameter
a date. By writing isoweek(Sys.date)-1 we are filtering this dataframe that contains the data of the 4
last week to get the closest information to our experiment.
80
Figure 97: Clustering datatable rendering
this gives as a concrete table containing the means of the different interactions that the users do and
receive in the last week plus to which cluster they do belong.
Second important chart is the number of users in the clusters. For us, the more users we have going to
the active cluster (cluster 3). Therefore, this chart was created to see the general flowing of the users
in the last 4 weeks. Having a great number of them going from 1 to 2 or from 2 to 3 means that our
experiment was efficient. The plot was drawn by taking x as the week number, y as the number of
users and the color of the lines in the plot will be changing according to the cindex column, containing
the index of the cluster.
last part of this coming chart is to make it dynamic. For that, the Shiny package provides us with a
function called reactive. This function will create a dynamic dataframe that will be changing according
the user input and therefore changing the chart with it. As a result, whenever the product employee in
entering the ID od a user, the charting is changing and showing his path.
82
Figure 102: User inside cluster flow chart
this same function was used for the two last charts as they also should be changing according to the
user input. On the other side, we are no more using the id this time but the the feature we want to see
the path. This will give us a view of how our clusters in general are changing and answering questions
like: How was the upvoting in cluster 1 two weeks ago? How is the downvoting mean this week? etc.
the charting is changing according to the radio buttons called actionTypeuser.
83
Figure 104: Plot and radio buttons fusion
84
Figure 105: Registrations pie chart per country
Second step of this data selection is to find how many of these users were still with us after a specific
time frame. From a business perspective, the chosen timeframe was 4 weeks later (almost one month).
In fact, we saw that in most of the cases, this is when the users stop churning on a high rate. Therefore,
the date in which we would be checking if our users were still stick to the app or not is from 12 to 18
of June 2017. Any unique appearance of an ID in this timeframe would be considered as a retained
one.
85
𝐼𝐷𝑠 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑒𝑑 4 𝑤𝑒𝑒𝑘𝑠 𝑙𝑎𝑡𝑒𝑟
𝑅𝑒𝑡𝑎𝑖𝑛𝑒𝑑 𝑢𝑠𝑒𝑟𝑠 =
𝐼𝐷𝑠 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑒𝑑
ps: “contentvec” is a vector containing the IDs of the content that have been voted up, voted down or
flagged.
86
3.1.3. Received engagement:
the received engagement for the users is not very different from the sent engagement. The only
changing parameter for our data fetching is a first part consisting on getting all the content that the
users posted (posts and replies, text and image) and then using the content ids to get the actions that
contains these ids.
3.1.4. Environment:
Last part of our data gathering is the environment data. Every user has a proper feed that he sees
according to his position. The activity he will see in the feed (negativity, positivity) could really
influence his stickiness. As a result, we chose to get the data that will be a proxy to reflect the general
engagement or health of the environment
In order to do this, our SQL query for fetching the data should only be focused on the geohash based
location. We need to get the geohashes where the registrations happened and get the data that help us
to get these values:
88
Figure 111: Filtering the dataframe
This column should be used in the future to merge this table with the table of our posts created to get
only the replies that were the result of the interaction of other users with ours. As a result, the only way
to be able to merge is to clean this column by keeping only the ID withouth the “parentId” string using
this command:
replies$metadata<-sub(“parentId:”, “”, replies$metadata)
Finally, same columns should be kept for future manipulations which are the interaction_key, user_key
and content_key columns.
89
3.3. Data Construction:
3.3.1. Sent engagement:
the construction part of the sent engagement is focused on how to get, from all the data that we fetched
a table containing the IDs f the users and the number of times they did the specific actions we chose.
But before doing this, and as explained in previous chapters, we need to do a differentiation between
the actions they did it on replies and the actions they did on posts. Therefore, after getting the difference
between them, we end up having a merged table in this format:
content_key interaction_key.x interaction_key.y
579dhfkg93750f action.post.voteup action.post.create
457dbdfnlso379 action.post.flag action.post.reply
… … …
We will use setDT() a function of the package data.table and paste() to add “_reply” or “_post” after
the interaction.x depending on what we have in interaction_key.y :
here, new engagement is the table described above. After all these modifications, we merge it to the
other dataframe containing the other interactions.
This done, we use a loop that will go through the days and get the users who registered. After that the
loop go through the dataframe containing all the actions and filter is by the IDs of these users with a
condition that the rows in our new dataframe should have a date of registration +2.
As a final step, we end up with a temporary dataframe othe the used IDs and the interactions. To have
it according to our needs, we use table() , a function that will return a contingency table with as columns
the interaction_key column, and as rows the user_key column. All this is transformed using
as.data.frame.matrix() and rownames_to_columns() to create this result :
user_key action.post.block action.post.reply action.post.voteup_post …
48340fhdfsdosf 1 40 5 …
3489dfhsofsdfn 0 19 8 …
90
68ndnfe870snb 7 11 0 …
Finally, to merge the received actions on posts and on replies, and as some columns could have same
names (action.post.voteup on posts and action.post.voteup on replies) , we will add “received” at the
start of the column name and “post” or “reply” in the end according the dataframe we are taking the
column from :
91
3.3.3. Environment:
For the Environment we already created a table containing all the geohashes in different days from the
registration table. In fact, we can have some redundant geohash names but with different days as a
geohash activity can change fron a day to another. After getting the activity of these locations, the only
building part is to create columns ready to contain the data and use the filtering, length() or nrow to
calculate our KPIS:
- DAU: Daily Active users in the geohash.
- posts_day: posts in that day in the geohash.
- posts_dau: postst per DAU in the geohash.
- reply_dau: replies per DAU in the geohash,
- reply_post: replies / posts in the geohash.
- upvote_post: upvotes / posts in the geohash.
- upvote_dau: upvotes per DAU in the geohash.
- upvote_reply: upvotes / replies in the geohash
- upvote_downvote : upvotes / downvotes in the geohash (happy ratio)
- downvote_dau: downvotes per DAU in the geohash.
- downvote_post: downvotes / posts in the geohash
- downvote_reply: downvotes / replies in the geohash
- netvotes_dau: (upvotes – downvotes) per DAU in the geohash
- netvotes_post: (upvotes – downvotes) / posts in the geohash
- netvotes_reply: (upvotes – downvotes) / replies in the geohash
- flag_dau: flags per DAU in the geohash
- flag_post: flags / posts in the geohash
- flag_reply: flags / replies in the geohash
- block_dau: blocks per DAU in the geohash
- block_post: blocks / posts in the geohash
- block_reply: blocks / replies in the geohash
3.4. Data Integration:
3.4.1. Merging:
Last part of our data preparation in integrating the 4 big dataframe that were created after all our
manipulations. In fact, we now have a dataframe about the sent actions by the user, the received action
by the user, the environment in which he registered and if he is retained or not.
92
For that, we just used merge() for the sent and received actions by user_key. After that, we merged
the environment and the user with the geohash and used a table containing the ids od the users who
were retained to add a column containing 1 if retained and 0 if not. These will be our modalities.
3.4.2. Normalization:
One the pillars of running a good model is the normalization. In fact, if the values of the data and the
different columns are very far and different from each other, the training can really take ages. This is
what we call feature scaling:
Although R gives multiples ways of scaling, we chose to write one small function that do it quite well
we, then use lapply() to apply it on the whole columns of the data frame:
93
Chapter 4: Modeling
After gathering all the needed data, cleaning it, constructing it, last part of this project is to create the
model out of it. This chapter could be one of the smallest one (for all the data scientists) as the most
important step is to prepare everything for this goal which is about choosing the algorithms,
understanding them and applying them with multiple iterations.
1. Modeling technique:
The big community of data analysts and data scientists has provided us with a high amount of
algorithms, tools and methods we can use for modeling and getting the answers out of the data.
In our case, our main modeling technique is SVM (Support Vector Machine):
By definition, it is a discriminative classifier that outputs an optimal hyper plane that will categorize
specific data. The hyper plane is created from a training set. It can be also used for regression (it
depends on the decisive variable; continuous/discrete).
SVM can be linear or non linear. In our case, as we have multiple features, our data in not linearly
separable, therefore, it will be a non-linear Radial SVM. In fact, by using what we call a Kernel, the
computing will be much easier for the algorithm especially for this high amount of features (58).
94
Equation 4: Radial SVM Equation
After doing some researches, we assume that the best package in R is e1017. Being the interface to
LIBSVM, the fact that his ancestor is written in C++ makes it as intuitive as possible for the best
results.
Our choice of SVM was based on the fact that it works well with high dimensional spaces, and is a
memory efficient.
we will also, for academic purpose, user the Random Forest algorithm and compare both of the
results. A lot of people consider it as “bootstrapping algorithm with decision tree model”2.
The explanation if this is that it will execute multiple CART models and create multiple trees for every
feature that will create this “Forest”. At the end, it will calculate the importance of each one and gives
it a final weight and importance.
The package used for it in R is randomForest and is implementing Breiman's random forest algorithm,
considered as the best by the R users community3.
2
https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/
3
https://www.r-bloggers.com/what-are-the-best-machine-learning-packages-in-r/
95
both of these two algorithms are part of the supervised learning family. We chose to use it as we have
a variable to predict (is the user going to be retained or not) and know which is the most influencing
features,
2. Testing Design:
After preparing our model, we will find ourselves with an important question to answer: is this a good
model? If yes, how can we know it?
For that purpose, we have to choose and think about tools that would give us these answers and to
know when to stop our modeling. In the other case, we will be continuously trying multiple
combinations of parameters without knowing which are the best ones.
2.1. ROC Curve:
In statistics, the Receiver Operating Characteristic curve is a graphic plot that allows you to compare
the diagnostic performance of a 2-modalities classifier (binary). It is efficient even in the case of
unbalanced distributions.
This curve will be used in our case to plot the true positive rate against the false positive rate with
different settings and with the two different algorithms.
96
Figure 123: ROC Curve example
Using all these numbers, here are the value that you can get out of it:
- True Positive Rate: this rate tells how much our model predicts YES when it is actually YES.
𝑇𝑃
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝑅𝑒𝑎𝑙 𝑌𝐸𝑆
97
Equation 5:True positive rate
- False Positive Rate: This rate tells how much our model predicts YES when it is actually NO.
𝐹𝑃
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝑅𝑒𝑎𝑙 𝑁𝑂
Equation 7:Accuracy
- Misclassification Rate: this rate tells how often our model is incorrect
(𝐹𝑃 + 𝐹𝑁)
𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒 =
𝑇𝑜𝑡𝑎𝑙
- Specificity: this rate tells how often our model says NO when it is actually NO
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑅𝑒𝑎𝑙 𝑁𝑂
Equation 9:Specificity
- Precision: this rate tells us how much our prediction of YES was correct.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑌𝐸𝑆
Equation 10:Precision
98
the “optimal_user_df” being our dataset and “retained” our Y, we have 54% of 0 (not retained) and
45% of 1 (retained). Therefore, our samples should have the same distribution to give us significant
results and model.
Although it is possible to do it manually, there are nowadays multiple functions that allows this
division:
the code above using the sample function takes n as the total number of observations and takes 60% of
it a variable. next command will extract take 60% for train set and 40% for test set.
If we check the distribution of our modalities after this division, here is what we get:
After checking that they were almost equal to distribution of the dataset, we can proceed.
We also need to take out the columns that will not be the features like the Y or the user key. For that,
we created the formula using paste() and as.formula() by excluding the columns and collapsing the
others with “+” :
99
We also exclude the received sharing r the received reading as the user doesn’t get a notification when
they happen. Therefore, they will not influence his behavior.
the final variable which is rf.form will be the one used as a formula for the modeling as it will be in
this format “ retained ~ action.oj_downvote + action.oj_upvote + action.post.block…”
3. Model building:
3.1. Parameter settings:
Due to the evolution of the complexity of the business questions that the companies ask and the
diversity of the data, it is becoming harder and harder to find the perfect ready algorithm for them. This
is the reason why most of the tools gives the opportunity to every data scientist to tune his algorithm
and make it fit the dataset.
Here is an overview of the one we used in our case:
3.1.1. Cost Parameter:
The cost parameter called C is what tells you how much you want to avoid misclassification for every
observation in the training set you have. The higher it is, the smaller will be the margin of the
hyperplane. This is very important when your data is non-linear to avoid false prediction. On the other
side, the smaller it is, the bigger would be the margin of the hyperplane.
In our project, we took 3 different values of C to see how much it is influencing our result:
- 10
- 0.5
- 2lm
100
3.1.2. Kernel Parameter:
Even if we explained above already our choice which is the Gaussian Kernel, it really important to
show all the difference kernels R gives us access too and their difference:
- Linear Kernel: being the fastest one, in only performs good when your data is, as the name
explicitly says it. Linear kernel is mostly used when the number of features is larger than the number
of observations.
- Polynomial Kernel: most performing one for the NLP (d=2), not only it looks to features,
but also checks the different combinations between them. It s also commonly used for the regression
analysis.
- Gaussian Kernel (Radial basis): being the most used used one among the kernels, it used
when the number of observations is higher than the number of features (most of the time the case). It
is called universal kernel 4as it guarantees a predictor that the estimation and approximation errors.
3.1.3. Gamma:
Always going side to side with the cost in the non-linear kernels, gamma tells you how much a single
training example can influence the others. The lower it is, the farthest is influences. Of course, the
higher it is, the closest it can influence. Making it too big can make the algorithm perform in a bad way
and C will not be able to regularize the over-fitting.
By making it too small, the model will not be able to capture the complexity or what we call the “shape”
of the data. In our case, the choice of our gamma is 1 (the default) and this is due to the fact that our
data is vey diversified. Fixing it to a middle value gives us the possibility to play with the cost.
4
https://www.quora.com/What-is-the-intuition-behind-Gaussian-kernel-in-SVM-How-can-I-visualize-the-
transformation-function-ϕ-that-corresponds-to-the-Gaussian-kernel-Why-is-the-Gaussian-kernel-popular
101
3.1.4. Number of trees:
The number of trees in the Random Forest algorithm tells how many trees the algorithm should grow.
If the number of observations is large and the number of trees is too small, some of them will be
predicted only once if not at all. In our case, we started with a high number of trees and then checked
whenever the accuracy becomes stable and then choose the perfect amount of trees.
Therefore, the number of trees chosen for this project is: 500
3.2. Model Training:
After going through all the understanding f these two algorithms and their tuning, we show them
practically in order to see the difference between both of them.
We will be using the e1017 package for the SVM and randomForest Package for the random forest
algorithm.
3.2.1. SVM:
Thanks to the participation and the contribution of the data science community, implementing the
algorithm has become more and more easy as experts created an interface to LIBSVM with the possible
parameters to tune. Here are the different iterations for svm and the different results from it:
o Iteration 1 – Cost = 10:
It is important to mention that there were no need to choose the “type” parameter in the svm() function
because the Y is a factor. As a result, the type would be automatically “C-classification”. The TRUE
value of probability gives is the weights of the features.
The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The Kernel used is the Gaussian Kernel (Radius Based).
The cost parameter is 10.
The gamma parameter is 0.01785714.
After training the model for some minutes, we need to apply it on the test set and see how good it is
performing:
Here are the top 10 features that are influencing the prediction and their weights:
102
Feature Weight
action.getposts.details_new 855.8991
received_action.post.votedown_replies 552.8851
received_action.post.reply_post 520.9039
received_action.post.voteup_post 448.3295
received_action.post.voteup_replies 426.9237
action.post.reply 426.2344
action.post.create 355.9180
action.post.voteup_post 304.3408
action.oj_upvote 301.5199
action.post.pin 294.1277
The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The Kernel used is the Gaussian Kernel (Radius Based).
The cost parameter is 0.5.
The gamma parameter is 0.01785714.
Feature Weight
action.getposts.details_new 421.7126
received_action.post.votedown_replies 338.4224
received_action.post.reply_post 312.3753
received_action.post.voteup_post 304.1694
action.post.reply 302.7494
received_action.post.voteup_replies 292.9179
action.post.create 260.4468
received_action.post.votedown_post 231.5631
action.post.votedown_post 219.2826
action.oj_upvote 215.4727
104
action.post.reply 284.43
received_action.post.voteup_post 280.445
received_action.post.voteup_replies 264.7267
env_blocks_reply 254.0478
action.post.pin 218.9994
env_flags_reply 207.8277
The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The number of trees is ntress=500.
The choice of importance = TRUE will gives us the weights of the features.
Here are the TOP 10 features influencing the predictions and their weights:
Feature Weight
action.getposts.details_new 1216.96093
env_reply_post 381.61395
action.post.voteup_post 381.24712
env_downvotes_post 378.22096
env_post_dau 369.92434
env_downvotes_reply 368.02946
env_downvotes_dau 362.92469
env_reply_dau 360.19532
env_upvotes_downvotes 359.67845
env_upvotes_post 352.43667
105
Table 31: Random Forest Iteration 1 Features Weights
The choice behind the 500 trees is the need to see when the error rate stops diminishing. This would
be our ideal number of trees:
As the plot shows it, the error rate start becoming constant after the trees 200. This would be the
number of the trees of the model.
o Iteration 2 – ntrees = 500 / TOP 20 Features:
The total number of observations that were trained is 29126 with 20 features.
The total number of observations that were tested is 19418with 20 features.
The number of trees is ntress=500.
The choice of importance = TRUE will gives us the weights of the features.
Here are the TOP 10 features influencing the predictions and their weights:
Feature Weight
action.getposts.details_new 1945.72978
env_upvotes_downvotes 1408.50373
env_blocks_reply 1340.74778
106
env_flags_reply 1292.91496
action.post.voteup_post 707.68980
action.post.reply 660.10736
received_action.post.voteup_replies 579.57985
action.post.votedown_post 569.60150
received_action.post.voteup_post 512.70467
action.post.voteup_reply 434.27354
The choice behind the 500 trees is the need to see when the error rate stops diminishing. This would
be our ideal number of trees:
As the plot is showing again, the error rate starts diminishing when reaching the trees 200. Therefore,
we can consider it as a good number of trees.
107
Chapter 5: Evaluation
The modeling being done in different iterations, time to see how were the different results and what is the
output of our analysis. For that, we will use the testing tools we talked about before.
1. Results Assessment:
1.1. SVM:
1.1.1. Iteration 1:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8626 2023
Actual 1 4724 4045
• Rates:
o True Positive Rate: 46%
o False Positive Rate: 18%
o Accuracy: 65%
o Misclassification Rate: 34%
1.1.2. Iteration 2:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8764 1885
Actual 1 4905 3864
• Rates:
o True Positive Rate: 44%
o False Positive Rate: 17%
o Accuracy: 65%
o Misclassification Rate: 34%
108
1.1.3. Iteration 3:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8772 1877
Actual 1 5297 3472
• Rates:
o True Positive Rate: 39%
o False Positive Rate: 17%
o Accuracy: 63%
o Misclassification Rate: 36%
1.2. SVM – TOP 20:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8592 2057
Actual 1 4716 4053
• Rates:
o True Positive Rate: 46%
o False Positive Rate: 19%
o Accuracy: 65%
o Misclassification Rate: 34%
1.3. Random Forest:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 7576 3037
Actual 1 3672 5097
• Rates:
109
o True Positive Rate: 58%
o False Positive Rate: 28%
o Accuracy: 65%
o Misclassification Rate: 34%
1.4. Random Forest – TOP 20:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 7743 3835
Actual 1 2837 5003
• Rates:
o True Positive Rate: 56%
o False Positive Rate: 26%
o Accuracy: 65%
o Misclassification Rate: 34%
1.5. SVM VS Random Forest:
1.5.1. All features:
• Rates Comparison:
the accuracy is the main technical and business goal. It will have an important weight in choosing the
model.
𝑆𝑉𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓% = 𝑅𝐹 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓%
110
• ROC Curve:
The ROC Curve above shows that our SVM curve is closer to the left-hand border of the chart (more
accurate). On the other side, the area under the curve for the Random Forest is bigger and this also
describes the accuracy. What we can say is that the two algorithms have traded false positives to true
positives and true positives to false positives.
1.5.2. TOP 20:
• Rates Comparison:
Trying the TOP 20 features has for goal to see how the algorithms will manage is the third of the
features. Who will perform better:
𝑆𝑉𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓% = 𝑅𝐹 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓%
As a result, we can clearly see that there were no differences between the SVM model and the Random
Forest model. The only difference that we may talk about here is the tuning possibilities that are way
higher in the SVM Case.
111
• ROC Curve:
The second ROC Curve shows that here too, both traces have pros and cons. While the svm plot is
closer to the left side of the of the chart, the random forest one has more space.
2. Approved Models:
As a result of what we are seeing here, the remaining part is to choose. For sure, the second and third
iterations of the SVM model are to deny completely especially when we see that most of our rates are
decreasing as we are decreasing the cost. Iteration 2 and 3 are are working bad compared to the first
iteration or the Random Forest model.
But when it is about comparing the SVM and the Random forest, for the TOP 20 or all the features,
choosing one of them automatically means choosing a good model to start. For us, and from a business
perspective, we opted for the SVM model and this for these points:
- 65% of accuracy (same as Random Forest)
- Less False positives then the random forest: It is very crucial for us not to have false positive.
As we said before, spending time working on some users that the algorithms showed as retained while
they are not cam make us lose a high amout of time and resources.
112
3. Next Steps:
After getting all these results, one of the first steps for us would be to get the average amount of
activity related to these features that have a high weight and try to reproduce it for most of our new
and existing users in order to get the value out of the analysis. Second thing to to do is to find a proxy
to the get_post_details interactions. In fact, this event in the database means scrolling, refreshing,
clicking, etc. after having the exact number of get_post_details the user need to do to get retained, we
need to know what is it exactly from the different possibilities. Last thing about the analysis is that
for now, we only took 72h action and 4 weeks of data. Taking a further away retention can help us
get a more solid model and a surer prediction of the behavior of the users.
113
Chapter 6: Deployment
Whether the model is good or bad, the process automation working fine or not or the datawarehouse
designed in a good way, the last chapter of this report is about where and how these parts of the
projects where deployed for the use of the company.
114
Figure 134: Welcoming Screen of the Shiny App
This is the starting screen of the app. It also the first step for the product team when starting the
experiment: choosing the sample size
Second tab of the app is the upload of the IDs and the time frame.
115
Figure 136: Calculating KPIs and getting the results Screen
Third tab and one of the possibilities is to get the KPIS needed for the experiment.
Fourth tab is about plotting the results of your analysis for better understanding
116
Figure 138: Retention Calculation Screen
Second possibility after uploading the IDs is to calculate the retention and plot it in the same frame.
Last tab to present is the user analytics that the product team member can use to see how the users he
uploaded are behaving inside the different clusters created in previous analysis by the BI Team.
117
2. Datawarehouse project:
2.1. Deployment environment:
The backend engineers had multiple experiences with the different instances of amazon and the
previous ETL Processes or databases. This is the reason why, this time they checked the data
warehousing service of Amazon. Using the combination of amazon redshift and S3 Bucket, we have a
constant backup of data in s3 and the computation is made through AWS Lambda with a possibility of
linking the data warehouse to a non SQL DB.
As a result, we will be able to have a full scalable Data warehouse that will be flexible today and for
our future needs in case we would like to collect new Data.
According to the amount of data that we have and the agreement of keeping only 6 months of data in
the datawarehouse (resources issues) the chosen generation of cluster is the ds2.xlarge. with 2 TB and
0.4 GB/s of I/O Speed, we assume that it would be sufficient in a first stage to fulfill the requirements
of our DW.
2.2. First Results:
One of the most impressing experiences that one can live is to see his creation coming to life in front
of his eyes. While owning the datawarehouse project, I had the opportunity to see its creation step by
step and test it by myself.
Here are the first results:
118
Figure 141: New Datawarehouse view on Datagrip
Although some tables are still missing, it is still a good start, especially that hey are using a new
language (Python). The advancement is steady, but sure.
119
Figure 144: Content Dimension view on datagrip (New Datawarehouse)
As you can see, some of the dimensions are already built, some of the other facts are eyes missing or
still having configuration problems. Still, it shows that the deployment is still going on.
120
General Conclusion
Social Media are by far one of the most interesting fields that one can work on today. This is due to
the fact that the analyst is working in order to understand the behavior of the user, a real person that is
behind a screen and therefore is reacting according on his real self. Moreover, being an analyst with
Jodel is even better and this is because it is an anonymous app. As it was explained to me by the CEO,
the owner of the idea, he opted for this use case to make the user feel comfortable, prone to a completely
free of bounds behavior. We find ourselves, as a result of this feature, analyzing people in their real
true state and getting the best out of them for them by having the best experience and for us by making
the best choices to improve our app.
On the other side, having this low amount of information makes it hard to reach all our goals without
using proxies. For example, clicking on a post to read it is described in the database with a single event
but that also could mean just opening the post without reading it. Other point that we are still missing
is that for the small posts, we can read them directly from the main first screen of the app, without
clicking. In order to overcome these stones, some APIs exists to save the view of the user. We could
also create different events accordingly to the amount of time a specific window would be opened.
Finally, I think that with all the knowledge acquired during these months and the first insights
already gained, next steps will be to go deeper in this anonymous apps fields
In fact, all of us heard a lot of stories about Facebook selling personal, sharing it with other social
media for money. Whenever they are true or false, the user is now afraid of giving any piece of
information and having his private data going from a website to another. Finding a platform that would
protect him by just not asking for his email or name will change his conception of connecting to
communities and the aim would be more the total free communication and contribution between human
beings. And using some machine learning techniques to predict what he would like or hate from a
content point of view would make it easier and better. This, in a long term perspective, could help us
reach what were all the philanthropists and good hearted people thinking about: Making the world a
smaller place.
121
Appendices
1. Literature Review:
1.1. KPIS:
KPI is the abreviation of : Key Performance Indicator.
It is a quantifiable measure that every company use to have a good overview of its performance. After
fixing the strategic goals, they use some KPIs to evaluate the evolution of the company in all the
possible fields ; finance, marketing, sales, human resources, all these department should quantify their
changes or advancement in order to have a total overview of their evolution and the results of their
activities.
There are two main big families of KPIs :
• Financial KPIs : net profit, net profit margin, gross profit margin.
• Non Financial KPIs : D57, DAU, MAU.
Jodel has chosen all over its months of activity, different KPIs that were able to reflect the activity of
the company and its consequences.
1.2. DAU:
DAU is the abreviation of : Daily Active Users.
As the name says; this KPI describes the number of users that were active in a specific date and it
reflects the stickiness of the users to the product.
as an example, Facebook recorded a total number of 1.28 billion DAU in march 2017.
1.3. WAU:
WAU is the abreviaion of : Weekly Active Users.
Giving a wide bigger range of recording the users, it measures how many users were present every
week. a single activity of a user is considered as counted. it also reflects the performance of a social
media.
1.4. MAU:
MAU is the abreviation of : Monthly active users.
Even if its calculation changes from a company to another one (Twitter count the users who do
30 followings, Facebook count users who do core actions like sharing, commenting...), Everyone
agrees on a single definition : the user has to execute an activity with the app.
122
Figure 148: Instagram MAU Evolution
1.5. DX7:
DX7 is a metric that changes according to the company needs. "D" for day, "X" for a number between
1 to 7 and "7" for the total number of days in a week, it shows how the app is used by the users in a
week. it is sort of a personalization of the WAU.
a D57 user is a user that uses the app 5 days per week. Having a high number of D57s for an mobile
app or a social media means that the users are really stick to it.
1.6. Cohort:
In statistics, a cohort is a specific number of people who had a specific characteristic or who lived the
same experience or event in a specific period of time. Applied on our field of study, it can be
registration or commenting on post.
It can be used in two different ways:
• Prospective Cohort study:
it is a cohort study that choose a sample of people according to a specific characteristic and follows
them over time to see how a specific change can affect them.
• Retrospective Cohort study:
it is a cohort study that use historical data to see how being affected by a specific factor, has changes
and compare them to another cohort who were not influenced by this factor.
123
1.7. Retention:
Costumer retention are the different activities a company or an organization take to reduce user
churning. As you know, when a company starts an app or a product, the most important part for them
is the acquisition because this would be the source of revenue and profit for them. On the other side,
it would not make sense for a company which goal is to exist for a long time to lose all these
consumers. And to be able to quantify how many users stay with us, everyone tend to use Retention,
a metric that shows you how a cohort is evolving in time.
The rows are all the different cohorts that we will analyze. In most of the cases, an analyst takes people
who registered in a specific day. the example above is taking all those who registered per day from
15 to 26 of May 2014. the columns are the different days that we want to analyze from registration to
X Day. 4th Cohort for example represents the retention of users registered in the app 18th of May
from Day 1 to Day 11. The result is showing is that only 2.13% of them are still using the product.
Retention is considered as one of the main KPIs that reflects the sustainability of the app and make
virality last longer. Finally, most of the known social media are are always checking retention after
90 days. Means if a cohort subscribes in D1, how many of those in this Cohort would stay with us in
D90. In Jodel we usually calculate D30 retention which is always between 28% and 35%.
124
Figure 150: Retention of different Social Media
1.9.2. Fact:
A fact table is the center of the star schema and is usually always surrounded by dimension
tables. it contains the measures of the business you want to analyze (revenue for an e-commerce
company) and foreign keys to dimension table. it always contains quantitative data and use the
attributes in the dimensions to choose the way to analyze data (Monthly-Women only- by city).
there are multiple types of facts. the most important types are:
• Cumulative: This type of fact table describes what has happened over a period of time.
For example, this fact table may describe the total sales by product by store by day.
The facts for this type of fact tables are mostly additive facts. The first example
presented here is a cumulative fact table.
• Snapshot: This type of fact table describes the state of things in a particular instance
of time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.
There quantitative data that you can find in the fact table is called measure and can be:
• Additive measures: These are those specific class of fact measures which can be
aggregated across all dimension and their hierarchy. Example: one may tend to add
sales across all quarters to avail the yearly sales。
• Semi-Additive measures: These are those specific class of fact measures which can be
aggregated across all dimension and their hierarchy except the time dimension.
127
Example: Daily balances fact can be summed up through the customers’ dimension but
not through the time dimension.
• Non-additive measures: These are those specific class of fact measures which cannot
be aggregated across all/any dimension and their hierarchy. Example: Facts which
have percentages, ratios calculated.
1.10. Taxonomy:
Going to the source of the word, taxonomy find its roots in the greek language and is composed of
"order" and "law". To summarize all the things that could this word mean, it is the science of ordering
things according to specific parameters. Historically, it was used especially in science or biology to
put the animals in categories according to their specifications. Nowadays, taxonomy is any task of
collecting entities for a further use or classification. Specifically, this technique is used by Jodel to
collect the different event (Client and Server Side ) to categorize them for the creation of the
Datawarehouse.
1.11. ETL:
Extract - Transform - Load: A Process responsible for pulling data out of different types of sources
and pushing them in the datawarehouse. To fulfill this task, it can be done either with tools like Talend
and MSBI, or coded from scratch using Python ("Luigi" ETL Process by Spotify)
• Extract: consists of extracting the data from a source system like an ERP, Google
Analytics, a CRM or even some text file that will be consolidated in a Staging area or an
ODS (according to the business choice) for future modification to guarantee the integrity.
• Transform: consists on transforming the data coming from different sources to one and
unique format. this involves modification for date, gender, etc.
these modifications can be:
• cleaning: F for female, NULL to 0
• joining: Lookup, Merge
• transposing: row to columns or column to rows
• calculations: calculating measures related to the business
• Load: the loading is the part of sending all this clean and clear data to a new database
that would be our final data warehousing output. it really helps to disable all the
constraints and indexes before the loads and bring them back when it is finished.
128
1.12. Machine Learning:
As specified for other notions, machine learning had different definitions from its creation until today
and data analysts/data scientists are still searching for the perfect that would really describe this
emerging technology or science. Nevertheless, if you search about it, you will find 2 main definitions
that of course converge to the same meaning. Arthur Samuel described it as:” the field of study that
gives computers the ability to learn without being explicitly programmed”. Tom Mitchell from his
side, gives a more modern definition:” A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured
by P, improves with experience E”. A summary of this would means that machine learning is allowing
software apps to do more accurate predictions without really programming them. this would be done
using algorithm that will get an input and use some statistical and mathematical analysis to get the
prediction as an output.
there are two main categories of these algorithms and learning:
• Supervised learning: requires to give a dataset with an input and the correct output
(Training dataset) , with of course having the idea that the input and output are related.
The supervised learning problems can be categorized to "regression" or "classification".
the regression is about trying to predict result that will be a continuous output (a value
between 1 and 100 for example). the classification is about trying to predict result that
will be a discrete output (accept or deny status for example).
• Unsupervised learning: is used to explore the data and approach problems with an
unknown output. in other words, it is used when we have no idea about what would the
results look like. the other important point about it is that there is no feedback based on
some prediction. we use it to explore data and understand how could they be linked (how
google group the news article for example).
129