0% found this document useful (0 votes)
137 views144 pages

Report - Mohamed Amine Jebari

This document summarizes a final project report for a Master's degree in Business Intelligence and ERP. The project involved building an insights stack at The Jodel Ventures GmbH, a social media startup. The project focused on understanding user behavior, measuring the impact of app changes on user experience, and delivering insights quickly. It covered three areas: data warehousing, process automation, and machine learning. The project was carried out within the company's BI team and aimed to meet business goals while staying within budget constraints.

Uploaded by

Sinda Arous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views144 pages

Report - Mohamed Amine Jebari

This document summarizes a final project report for a Master's degree in Business Intelligence and ERP. The project involved building an insights stack at The Jodel Ventures GmbH, a social media startup. The project focused on understanding user behavior, measuring the impact of app changes on user experience, and delivering insights quickly. It covered three areas: data warehousing, process automation, and machine learning. The project was carried out within the company's BI team and aimed to meet business goals while staying within budget constraints.

Uploaded by

Sinda Arous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

PROJET DE FIN D’ETUDES

Pour l’obtention du Diplôme National d’Ingénieur

Building an Insights
Stack

Réalisé par: Jebari Mohamed Amine


Encadré par : Schmitz Tim
Kalra Ashish
Med Heny Selmi

20 Septembre 2017
Supervisors’ validation
Mr. Mohamed Heny Selmi

Mr. Tim Schmitz


Acknowledgements

On the very outset of this report, I would like to extend my sincere & heartfelt obligation
towards any person who have helped me in this endeavor. Without their active guidance, help,
cooperation & encouragement, I would not have made headway in the project.
Firstly, I would like to express my sincere gratitude to my company and University supervisors Mr.
Tim Schmitz, Mr. Ashish Karla and Mr. Mohamed Heny Selmi for their continuous support of
my final project, their patience, motivation, and immense knowledge about business and technical
fields. Their guidance helped me to improve myself and finish writing this thesis with having my real
first steps in the world of employment. I would have never been able to do all this without their help.
Second, I am also ineffably indebted to my colleagues Mr. Jasper Boskma, Mr. Sebastian Van
Lengerich, Mr. Alexander Linewitch, Mr. Alessio Avellan Borgmeyer, Mr. Niklas Henkell and
Mr. Ryan Cooley for conscientious guidance and encouragement to accomplish this assignment.
I am extremely thankful and pay my gratitude to my family members for their valuable guidance
and support on completion of this thesis in its presently. They were always keeping an eye on my
evolution and finding the ways to coach be and get the best out of me.
My sincere thanks also goes to all the tech team who gave me all the necessary tools and facilities
to accomplish my mission and make the roads easier for me.
I extend my gratitude to ESPRIT School of Engineering for giving me this opportunity.
Nevertheless, I would like to thank the big family that were around me, including Tunisians,
Chineses, Frenchs, Americans, Germans, Czechs, Italians, Venezuelans and Turkishs who
directly or indirectly helped me and encouraged me to complete my thesis, managed my stress and
knew how to give me the moral support that I needed.
Any omission in this brief acknowledgement does not mean lack of gratitude.
Abstract
The project is entitled “Building an Insight Stack” and was implemented within the company
The Jodel Ventures GmbH, a product-oriented startup specialized in Social Media.
This work forms part of the graduation project presented to obtain the Engineer's degree on
Business Intelligence-ERP at ESPRIT, Private Higher School of Engineering and Technology,
Tunis.
Its aim is to understand the user behavior and the results of changes of the app on the user
experience and have insights and deliveries within a small amount of time. It was realized with the
BI Team and touches 3 main important chapters of Business intelligence: Data warehouse, Process
automation and Machine learning.

Keywords: insight, product-oriented, user behavior, user experience, BI, data warehouse, process
automation, machine leaning.

Résumé
C'est dans le cadre de la préparation à l'intégration au milieu professionnel que ce stage a été lancé. Il
s'agit d'un projet de fin d'études de l'École Supérieure Privée d'Ingénierie et de Technologie, Tunis.
(ESPRIT).

Ce projet s'est déroulé au sein de l’entreprise The Jodel Ventures Gmbh qui offre une solution mobile
(Social Media).

Son principal but est de comprendre le comportement utilisateur et le resultat du changement effectué
dans l’application sur l experience de l utilisateur et livrer des analyses rapidement. Le projet a été
realisé avec l’equipe BI et touche 2 chapitres important de l’informatique decisionelle: data
warehousing, l’automatisation des process et la Machine Learning.

Mots-clés: analyses, solution mobile, comportement utilisateur, experience de l’utilisateur, BI,


informatique decisionnelle, data warehousing, automatisation, process, machine learning.
Table of content
GENERAL INTRODUCTION .......................................................................................................... 1
CHAPTER 0: GENERAL CONTEXT ........................................................................................................ 3
1. PROJECT CONTEXT: .......................................................................................................................... 3
1.1. Hosting organism and team: ......................................................................................................... 3
1.1.1. Hosting company: ...................................................................................................................... 3
1.1.2. Hosting team:............................................................................................................................. 3
1.2. Company methodology - OKRs: .................................................................................................. 4
1.3. Project Presentation: ..................................................................................................................... 4
1.3.1. Process Automation: .................................................................................................................. 5
1.3.2. Datawarehouse Design: ............................................................................................................. 6
1.3.3. Machine Learning:..................................................................................................................... 7
1.3.4. Summary: .................................................................................................................................. 7
2. PROJECT METHODOLOGY:................................................................................................................ 7
2.1. Methodologies: ............................................................................................................................. 7
2.1.1. SEMMA: ................................................................................................................................... 7
2.2. Choice: ......................................................................................................................................... 9
2.2.1. CRISP-DM: ............................................................................................................................... 9
CHAPTER 1: BUSINESS UNDERSTANDING ........................................................................................... 11
1. BUSINESS GOALS: .......................................................................................................................... 11
1.1. Project Background: ................................................................................................................... 11
1.1.1. Datawarehouse project: ........................................................................................................... 11
1.1.2. Experiment Automation project: ............................................................................................. 12
1.1.3. Machine Learning project :...................................................................................................... 13
1.2. Business Goals List: ................................................................................................................... 13
1.3. Business Success Criteria: .......................................................................................................... 13
1.3.1. Datawarehouse project: ........................................................................................................... 14
1.3.2. Experiment automation project: .............................................................................................. 14
1.3.3. Machine Learning project:....................................................................................................... 14
2. SITUATION STATE: ......................................................................................................................... 15
2.1. Inventory of resources: ............................................................................................................... 15
2.2. Costs and benefits: ..................................................................................................................... 15
2.2.1. Data warehouse project : ......................................................................................................... 16
2.2.2. Experiment automation project: .............................................................................................. 16
2.2.3. Machine Learning project:....................................................................................................... 18
3. DATA MINING GOALS: ................................................................................................................... 18
3.1. Business success criteria: ........................................................................................................... 18
3.2. Data mining success criteria:...................................................................................................... 18
4. PROJECT PLAN: .............................................................................................................................. 19
4.1. Planning schema:........................................................................................................................ 19
4.2. Initial assessment of tools and techniques:................................................................................. 20
4.2.1. Data warehouse project: .......................................................................................................... 20
4.2.2. Process Automation project:.................................................................................................... 20
4.2.3. Machine learning project: ........................................................................................................ 20
CHAPTER 2: DATA UNDERSTANDING ................................................................................................. 21
1. DATA DESCRIPTION: ...................................................................................................................... 21
1.1. Data sources: .............................................................................................................................. 21
1.2. Data Specifications:.................................................................................................................... 21
1.2.1. Old Redshift: ........................................................................................................................... 21
1.2.2. Old Postgres: ........................................................................................................................... 26
2. DATA EXPLORATION: ..................................................................................................................... 29
2.1. Datawarehouse project: .............................................................................................................. 29
2.2. Process Automation project: ...................................................................................................... 32
2.3. Machine Learning project: ......................................................................................................... 34
3. DATA QUALITY: ............................................................................................................................. 35
3.1. Data warehouse project: ............................................................................................................. 35
3.1.1. Client data:............................................................................................................................... 35
3.1.2. Experiment Assignation: ......................................................................................................... 35
3.1.3. User location inside the app: ................................................................................................... 36
3.2. Experiment Automation project: ................................................................................................ 37
3.2.1. core actions: ............................................................................................................................. 37
3.1. Machine learning project:........................................................................................................... 37
3.1.1. Location from a user perspective: ........................................................................................... 37
CHAPTER 3: DATA PREPARATION ...................................................................................................... 38
1. DATAWAREHOUSE PROJECT: .......................................................................................................... 38
1.1. Taxonomy: ................................................................................................................................. 38
1.1.1. Sources: ................................................................................................................................... 38
1.1.2. Categories: ............................................................................................................................... 38
1.1.3. Events: ..................................................................................................................................... 39
1.2. Design: ....................................................................................................................................... 43
1.2.1. Dimensions: ............................................................................................................................. 43
1.2.2. Facts:........................................................................................................................................ 49
1.2.3. Relations: ................................................................................................................................. 52
2. PROCESS AUTOMATION PROJECT: .................................................................................................. 55
2.1. Experiment parameters:.............................................................................................................. 55
2.2. KPIs Calculation: ....................................................................................................................... 57
2.3. Significance Testing: .................................................................................................................. 64
2.4. KPIS Visualizations: .................................................................................................................. 71
2.5. Retention: ................................................................................................................................... 74
2.6. User Analytics: ........................................................................................................................... 79
3. MACHINE LEARNING PROJECT: ....................................................................................................... 84
3.1. Data Selection: ........................................................................................................................... 84
3.1.1. Retained users:......................................................................................................................... 84
3.1.2. Sent engagement:..................................................................................................................... 86
3.1.3. Received engagement: ............................................................................................................. 87
3.1.4. Environment: ........................................................................................................................... 87
3.2. Data Cleaning: ............................................................................................................................ 88
3.2.1. Sent Engagement: .................................................................................................................... 88
3.2.2. Received Engagement: ............................................................................................................ 89
3.3. Data Construction:...................................................................................................................... 90
3.3.1. Sent engagement:..................................................................................................................... 90
3.3.2. Received Engagement: ............................................................................................................ 91
3.3.3. Environment: ........................................................................................................................... 92
3.4. Data Integration: ......................................................................................................................... 92
3.4.1. Merging: .................................................................................................................................. 92
3.4.2. Normalization: ......................................................................................................................... 93
CHAPTER 4: MODELING ..................................................................................................................... 94
1. MODELING TECHNIQUE: ................................................................................................................. 94
2. TESTING DESIGN: ........................................................................................................................... 96
2.1. ROC Curve: ................................................................................................................................ 96
2.2. Confusion Matrix: ...................................................................................................................... 97
2.3. Dataset Division: ........................................................................................................................ 98
3. MODEL BUILDING:........................................................................................................................ 100
3.1. Parameter settings: ................................................................................................................... 100
3.1.1. Cost Parameter: ..................................................................................................................... 100
3.1.2. Kernel Parameter: .................................................................................................................. 101
3.1.3. Gamma: ................................................................................................................................. 101
3.1.4. Number of trees: .................................................................................................................... 102
3.2. Model Training:........................................................................................................................ 102
3.2.1. SVM: ..................................................................................................................................... 102
3.2.2. Random Forest: ..................................................................................................................... 105
CHAPTER 5: EVALUATION ................................................................................................................ 108
1. RESULTS ASSESSMENT: ................................................................................................................ 108
1.1. SVM: ........................................................................................................................................ 108
1.1.1. Iteration 1: ............................................................................................................................. 108
1.1.2. Iteration 2: ............................................................................................................................. 108
1.1.3. Iteration 3: ............................................................................................................................. 109
1.2. SVM – TOP 20:........................................................................................................................ 109
1.3. Random Forest: ........................................................................................................................ 109
1.4. Random Forest – TOP 20: ........................................................................................................ 110
1.5. SVM VS Random Forest: ........................................................................................................ 110
1.5.1. All features: ........................................................................................................................... 110
1.5.2. TOP 20: ................................................................................................................................. 111
2. APPROVED MODELS: .................................................................................................................... 112
3. NEXT STEPS: ................................................................................................................................ 113
CHAPTER 6: DEPLOYMENT ............................................................................................................... 114
1. PROCESS AUTOMATION PROJECT: ................................................................................................ 114
1.1. Deployment environment: ........................................................................................................ 114
1.2. Overview: ................................................................................................................................. 114
1.2.1. Screeshots: ............................................................................................................................. 114
2. DATAWAREHOUSE PROJECT: ........................................................................................................ 118
2.1. Deployment environment: ........................................................................................................ 118
2.2. First Results: ............................................................................................................................. 118

GENERAL CONCLUSION............................................................................................................ 121

APPENDICES .................................................................................................................................. 122


Table of figures
Figure 1 : Company Organigram ........................................................................................................... 3
Figure 2: Business Intelligence Department structure ........................................................................... 4
Figure 3: Objectives and Key Results Flow .......................................................................................... 4
Figure 4: Traditional experiment process .............................................................................................. 5
Figure 5: Enhanced experiment process ................................................................................................ 5
Figure 6: Data flow situation ................................................................................................................. 6
Figure 7: Data flow enhancement .......................................................................................................... 6
Figure 8: SEMMA process flow ............................................................................................................ 8
Figure 10: CRISP-DM Phases ............................................................................................................... 9
Figure 11: Methodologies comparison ................................................................................................ 10
Figure 12: Postgres database datagrip view ......................................................................................... 11
Figure 13: Redshift database Datagrip view ........................................................................................ 12
Figure 14: Detailed steps of the experiment process ........................................................................... 12
Figure 15: R Studio Shiny pricing ....................................................................................................... 17
Figure 16: AWS EC2 pricing and types .............................................................................................. 17
Figure 17: M4 instance specifications and types ................................................................................. 17
Figure 18: Project planning steps......................................................................................................... 19
Figure 19: Fact interaction Datagrip view ........................................................................................... 25
Figure 20: Old Datawarehouse relations.............................................................................................. 29
Figure 21: Experiment column data in Old Redshift ........................................................................... 30
Figure 22: Geohashes data in Old Redshift ......................................................................................... 30
Figure 23: Geohash technology structure ............................................................................................ 30
Figure 24: Geohash Example ............................................................................................................... 31
Figure 25: Geohashes distribution by countries................................................................................... 31
Figure 26: Funnel example .................................................................................................................. 32
Figure 27: Users activity inside the app distribution ........................................................................... 32
Figure 28: Users distribution in the clusters ........................................................................................ 33
Figure 29: Happy ratio influence on D1 activity ................................................................................. 34
Figure 30: Happy ration influence on D14 activity ............................................................................. 35
Figure 31: Old Redshift metadata issue ............................................................................................... 36
Figure 32: Dimension user (new datawarehouse) ................................................................................ 43
Figure 33: Dimension content (new datawarehouse)........................................................................... 44
Figure 34: Dimension interaction (new datawarehouse) ..................................................................... 44
Figure 35: Dimension date (new datawarehouse) ................................................................................ 45
Figure 36: Dimension property (new datawarehouse) ......................................................................... 45
Figure 37: Dimension in app location (new datawarehouse)............................................................... 46
Figure 38: Dimension country (new datawarehouse) .......................................................................... 46
Figure 39: Dimension city (new datawarehouse) ................................................................................ 46
Figure 40: Dimension location (new datawarehouse) ......................................................................... 47
Figure 41: Dimension moderator status (new datawarehouse) ............................................................ 47
Figure 42: Dimension moderator desicion (new datawarehouse)........................................................ 47
Figure 43: Dimension block (new datawarehouse) ............................................................................. 47
Figure 44: Dimension flag (new datawarehouse) ................................................................................ 48
Figure 45: Dimension experiment interaction (new datawarehouse) .................................................. 48
Figure 46: Dimension value (new datawarehouse) .............................................................................. 48
Figure 47: Dimension experiment (new datawarehouse) .................................................................... 49
Figure 48: Fact product (new datawarehouse) ..................................................................................... 49
Figure 49: Fact moderation (new datawarehouse) ............................................................................... 50
Figure 50: Fact experimentation (new datawarehouse) ....................................................................... 51
Figure 51: Fact product and dimensions (new datawarehouse) ........................................................... 52
Figure 52: Fact Moderation and dimensions (new datawarehouse) .................................................... 53
Figure 53: Fact Experimentation and dimensions (new datawarehouse) ............................................ 54
Figure 54: Shiny app architecture ........................................................................................................ 55
Figure 55: Date and time selection window ........................................................................................ 56
Figure 56: User IDs transformation ..................................................................................................... 56
Figure 57: KPIs calculation window.................................................................................................... 57
Figure 58: KPIs calculation table result ............................................................................................... 57
Figure 59: Fetching data from datawrehouse code lines ..................................................................... 58
Figure 60: Filterinf a dataframe code lines .......................................................................................... 59
Figure 61: Dataframe creation code lines ............................................................................................ 60
Figure 62: DAU Calculations code lines ............................................................................................. 61
Figure 63: Calculatin the KPIs code lines............................................................................................ 61
Figure 64: New Users difference code lines ........................................................................................ 62
Figure 65: Saving the date and time code lines .................................................................................. 63
Figure 66: dynamic slider code lines ................................................................................................... 63
Figure 67: Executin the KPI function on the server ............................................................................ 63
Figure 68: Significance testing table.................................................................................................... 64
Figure 69: Super t.test function code lines........................................................................................... 65
Figure 70: Same users condition checkbox ......................................................................................... 66
Figure 71: Significance test dataframe creation code lines ................................................................. 67
Figure 72: Action per DAU significance testing code lines ................................................................ 68
Figure 73: % of DAU doing the action data manipulation code lines ................................................. 69
Figure 74: Paired / Unpaired data condition ........................................................................................ 69
Figure 75: Action Per Unique Actor .................................................................................................... 70
Figure 76: Table rendering code lines ................................................................................................. 70
Figure 77 : server significance testing code lines ................................................................................ 71
Figure 78: Plotting per day box ........................................................................................................... 72
Figure 79: Plotting per actions box ...................................................................................................... 72
Figure 80: Plotting per day box code lines .......................................................................................... 72
Figure 81: Plots rendering code lines................................................................................................... 73
Figure 82: Adding trace in a plot code lines ........................................................................................ 73
Figure 83: Daily active users plot example.......................................................................................... 73
Figure 84: percentage of users doing an action plot example .............................................................. 74
Figure 85: 3d Plotting example ............................................................................................................ 74
Figure 86: Retention table .................................................................................................................... 75
Figure 87: Retention average chart ...................................................................................................... 75
Figure 88: Creatin dataframe for retention code lines ......................................................................... 76
Figure 89: Calculating the retention code lines ................................................................................... 76
Figure 90: Manipulating the table rendering code lines ...................................................................... 76
Figure 91: Weighted average calculation code lines ........................................................................... 77
Figure 92: Retention calculation for existing users code lines ............................................................ 77
Figure 93: Fetching data condition ...................................................................................................... 78
Figure 94: Retention data condition in the server file.......................................................................... 78
Figure 95: Retention dataframe rendering code lines .......................................................................... 79
Figure 96: Radio buttons for charts dynamic changing ....................................................................... 80
Figure 97: Dataframe from S3 ............................................................................................................. 80
Figure 98: Clustering datatable rendering............................................................................................ 81
Figure 99: rendering clusters chart code lines ..................................................................................... 81
Figure 100: Clusters line chart ............................................................................................................. 81
Figure 101: Drop down menu code lines ............................................................................................. 82
Figure 102: Dynamic user – cluster plot code lines ............................................................................. 82
Figure 103: User inside cluster flow chart ........................................................................................... 83
Figure 104: Radio buttons code lines................................................................................................... 83
Figure 105: Plot and radio buttons fusion ............................................................................................ 84
Figure 106: Registrations pie chart per country ................................................................................... 85
Figure 107: Registrations dataframe .................................................................................................... 85
Figure 108: Getting thecreation and reply data sql function................................................................ 86
Figure 109: Environment data matrix .................................................................................................. 87
Figure 110: Getting the data by geohash code lines ............................................................................ 87
Figure 111: Getting the geohash neighbours code lines ...................................................................... 88
Figure 112: Filtering the dataframe ..................................................................................................... 89
Figure 113: Manipulating action names code lines ............................................................................. 90
Figure 114: matrix to dataframe code lines ......................................................................................... 91
Figure 115: Cleaning and merging the received engagement dataframe ............................................. 91
Figure 116: Merging the received on posts and on replies rows ......................................................... 91
Figure 117: Creating the modalities for the prediction ........................................................................ 93
Figure 118: Difference between normalized and non-normalized training ......................................... 93
Figure 119: Normalization function..................................................................................................... 93
Figure 120: Application of the normalization function ....................................................................... 93
Figure 121: Linear hyperplane of SVM ............................................................................................... 94
Figure 122: Random forest logic ......................................................................................................... 95
Figure 123: Supervised learning steps ................................................................................................. 96
Figure 124: ROC Curve example ........................................................................................................ 97
Figure 125: Confusion Matrix Example .............................................................................................. 97
Figure 126: modality distribution ........................................................................................................ 98
Figure 127: Training and testing set division ...................................................................................... 99
Figure 128: Training and Testing dataset modality distribution .......................................................... 99
Figure 129: Equation Creation ............................................................................................................. 99
Figure 130: Soft Margin Examples .................................................................................................... 100
Figure 131: Random Forest Iteration 1 Error Curve .......................................................................... 106
Figure 132: Random Forest – TOP 20 Error Curve........................................................................... 107
Figure 133: SVM VS Random Forest ROC Curve (All features) ..................................................... 111
Figure 134: SVM VS Random Forest ROC Curve (TOP 20 Features) ............................................. 112
Figure 135: Welcoming Screen of the Shiny App ............................................................................. 115
Figure 136: User Uploading Screen ................................................................................................... 115
Figure 137: Calculating KPIs and getting the results Screen ............................................................ 116
Figure 138: Plotting the results screen ............................................................................................... 116
Figure 139: Retention Calculation Screen ......................................................................................... 117
Figure 140: Getting insights about the users screen .......................................................................... 117
Figure 141: Data Flow from s3 to redshift using AWS technology .................................................. 118
Figure 142: New Datawarehouse view on Datagrip .......................................................................... 119
Figure 143: User Dimension view on datagrip (New Datawarehouse) ............................................. 119
Figure 144: Fact product view on datagrip (New Datawarehouse) ................................................... 119
Figure 145: Content Dimension view on datagrip (New Datawarehouse) ........................................ 120
Figure 146: User Dimension Data (New Datawarehouse) ................................................................ 120
Figure 147: Content Dimension Data (New Datawarehouse) ........................................................... 120
Figure 148: Interaction Dimension Data (New Datawarehouse) ....................................................... 120
Figure 149: Instagram MAU Evolution ............................................................................................. 123
Figure 150: Retention Table .............................................................................................................. 124
Figure 151: Retention of different Social Media ............................................................................... 125
Figure 152: Data warehouse VS Operational System........................................................................ 126
Figure 153: Relation between an Operational System and a Data warehouse .................................. 126
Table of tables
Table 1: Datawarehouse business success criteria ............................................................................... 14
Table 2: Experiment Automation business success criteria ................................................................. 14
Table 3: Machine learning business success criteria ........................................................................... 15
Table 4: Datawarehouse and databases pricing ................................................................................... 16
Table 5: Fact interaction columns and explanation (Old Redshift) ..................................................... 22
Table 6: Basic interactions naming and explanation ........................................................................... 23
Table 7: Administration interactions naming and explanation ............................................................ 23
Table 8: Me section interactions naming and explanation................................................................... 24
Table 9: Channels interactions naming and explanation ..................................................................... 24
Table 10: Hashtags interactions naming and explanation.................................................................... 25
Table 11: Hometown feature interactions naming and explanation .................................................... 25
Table 12: Experiment assignation interactions naming and explanation............................................. 25
Table 13: Fact Interaction colums and explanation (Postgres) ............................................................ 26
Table 14: Dimension Interaction colums and explanation (Postgres) ................................................. 26
Table 15: Dimension city colums and explanation .............................................................................. 27
Table 16: Dimension user colums and explanation ............................................................................. 27
Table 17: Dimension content colums and explanation ........................................................................ 28
Table 18: Dimension time colums and explanation ............................................................................. 28
Table 19: Experiment Rows Example ................................................................................................. 36
Table 20: Taxonomy Categories .......................................................................................................... 39
Table 21: Events Taxonomy and sources ............................................................................................ 42
Table 22: Data format result ................................................................................................................ 59
Table 23: Binary manipulation dataframe result ................................................................................. 69
Table 24: metadata column data .......................................................................................................... 89
Table 25: Merging results .................................................................................................................... 90
Table 26: Matrix to dataframe result ................................................................................................... 91
Table 27: SVM Iteration 1 Features Weights .................................................................................... 103
Table 28: SVM Iteration 2 Features Weights .................................................................................... 103
Table 29: SVM Iteration 3 Features Weights .................................................................................... 104
Table 30: SVM – TOP 20 Features Weights .................................................................................... 105
Table 31: Random Forest Iteration 1 Features Weights .................................................................... 106
Table 32: Random Forest – TOP 20 Features Weights ..................................................................... 107
Table 33: SVM Iteration 1 Confusion Matrix.................................................................................... 108
Table 34: SVM Iteration 2 Confusion Matrix.................................................................................... 108
Table 35: SVM Iteration 3 Confusion Matrix.................................................................................... 109
Table 36: SVM TOP – 20 Confusion Matrix..................................................................................... 109
Table 37: Random Forest Iteration 1 Confusion Matrix .................................................................... 109
Table 38: Random Forest – TOP 20 Confusion Matrix..................................................................... 110
References:
- http://jodel.com (august 2017)
- http://www.coursera.com (july 2017)
- http://rstudio.github.io/shiny/tutorial/#reactivity (march 2017)
- https://www.r-bloggers.com/hosting-shiny-on-amazon-ec2/ (march 2017)
- http://courses.had.co.nz/11-rice/slides/01-basics.pdf (may 2017)
- http://www.computerworld.com/article/2497304/business-intelligence-beginner-s-guide-to-r-
painless-data-visualization.html (may 2017)
- https://plot.ly/r/shiny-tutorial/#plotly-graphs-in-shiny (march 2017)
- https://www.quora.com/Explain-VC-dimension-and-shattering-in-lucid-Way (february 2017)
- http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ (june 2017)
- https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/ (june
2017)
- https://www.analyticsvidhya.com/blog/2014/10/support-vector-machine-simplified/ (june
2017)
- https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-cheat-
sheet (june 2017)
- https://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-
1.html (june 2017)
- https://www.quora.com/How-do-you-choose-a-machine-learning-algorithm (july 2017)
- https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer
(june 2017)
- https://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.html (january 2017)
- http://www.zentut.com/data-warehouse/fact-table/ (january 2017)
- https://yoshibauco.wordpress.com/2011/04/27/empezando-con-las-etapas-de-crisp-dm/ (may
2017)
- http://www.movable-type.co.uk/scripts/geohash.html (april 2017)
- https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf (june 2017)
- https://eight2late.wordpress.com/2017/02/07/a-gentle-introduction-to-support-vector-
machines-using-r/ (june 2017)
- http://math.stanford.edu/~yuany/course/2015.fall/SVM_in_R.pdf (june 2017)
- http://www.r-tutor.com/gpu-computing/svm/rpusvm-1 (june 2017)
- https://escience.rpi.edu/data/DA/svmbasic_notes.pdf (july 2017)
- https://www.r-bloggers.com/learning-kernels-svm/ (may 2017)
- http://www.listendata.com/2017/01/support-vector-machine-in-r-tutorial.html (july 2017)
- http://machinelearningmastery.com/pre-process-your-dataset-in-r/ (april 2017)
- http://www.statpac.com/surveys/statistical-significance.htm (april 2017)
- http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests%3A-
significance-levels-alpha-and-p-values-in-statistics (april 2017)
- http://www.investopedia.com/terms/s/statistically_significant.asp (april 2017)
- https://www.cliffsnotes.com/study-guides/statistics/principles-of-testing/significance
- http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html (may 2017)
- http://www.dummies.com/programming/big-data/phase-3-of-the-crisp-dm-process-model-
data-preparation/ (june 2017)
- https://www.analyticsvidhya.com/blog/2015/10/understaing-support-vector-machine-
example-code (jul 2017)
- http://www.kdnuggets.com/2016/06/select-support-vector-machine-kernels.html may (2017)
- http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ (july 2017)
- http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html (july 2017)
General Introduction
Knowledge is by definition any new information, fact or skill acquired by a person or a group of person
through theorical or practical understanding of a subject. But all these dictionaries and encyclopedias
forgot to mention an important part of acquiring that knowledge: The analysis.
In fact, from the starting of Life, human beings were staring at nature, its different transformations and
getting knowledge by analyzing it. It is true that they were not really conscious about what they were
doing. Nevertheless, that was it. They were the proper analysts of the early ages. Either called scientists
by nowadays researchers or philosophers by literary persons, the most important point to get from this
is that they were showing us the truth.
Hundreds and thousands of years later, the questions were becoming bigger and bigger as Humanity
had already accumulated a huge amount of information that let them, fortunately create new
technologies to speed up their finding processes. That was the industrial revolution.
Electricity, steam power, machines, a whole new army of tools was in the hands of the Mankind to
tackle problems way far from what they were looking at. Finally, about one century later, the first
computer was created. Occupying about 1800 square, it was really different from our nowadays
definition of computer but still doing the automatic computing needed. Although using it was very
hard and awkward, it helped the great minds to understand that a new era has started, an era that could
let the Man go way far from his ambitions, an era of fast computing.
Mathematics started getting more and more applied on these machines and the two disciplines started
merging more and more, for the sake of insights, for the sake of truth. By 1970 until today, we started
seeing more and more “bunch of code lines” that were doing complex tasks and helping the world to
understand natural, physical or technical phenomenon. The Data science was born. Along with it, more
and more data was created and spread everywhere: Text, Image, sound, all types of what transactions
or simple electric signals could create. Handling all this and analyzing it was also requiring a new
discipline: data warehousing.
Hand in hand, these two fields (data science and data warehousing) are breaking through the limits of
time and speed to make the world a better place for Humanity by understanding the most complex
happenings.
I chose to do this internship with Jodel because I saw in it an opportunity for me to get knowledge
about these analytical skills. Working with data created by real people was a chance to live the
experience of an analyst, a scientist, a philosopher of the modern times.

1
This project is left in 7 part: Chapter 0 will be about the general presentation of the company
and their field of activity. The second one devoted to present the general context by introducing the
company and the business in which I will be involved. The third section concentrates on the data
understanding. Fourth part will be about the data preparation. The fifth part presents the main
theoretical study and modeling. The sixth chapter reflects the evaluation of our results. Finally, last
part will be the deployment of our solution in its different aspects.

2
Chapter 0: General context

In this first chapter, we present The Jodel Ventures GmbH; the Company in which this project was realized,
its business market, the services that it offers and also the team in which I fulfilled my end of studies
internship. Besides, we present our solution and how we ended up doing this project (the challenges).

1. Project Context:
1.1. Hosting organism and team:
1.1.1. Hosting company:
The Jodel Ventures Gmbh is a startup based in Berlin which main field of activity is Social Media.
Created in 2015 by Alessio Von Borgmeyer, Its main Mission is to let you instantly engage with the
community around. When it comes to the Vision, it is about developing a platform to discover,
follow, and participate in the most relevant conversations with people around 10 KM, anonymously.
The Jodel Ventures Gmbh provides iOS and Android location based app that only requires your
geolocalization to let you interact with people around you by reading, posting and even sharing on
other platforms like Whatsapp or Facebook.
The communities are active in more than 10 countries including Saudi Arabia and the number of
daily active users is more than 1.000.000.
1.1.2. Hosting team:
During the realization of this project, I’ve been part of the Business Intelligence Team freshly recruited
and built by the first days of 2017. I was also working with all the other departments due to their needs for
insights of analysis. Here is the general structure of The Jodel Ventures Gmbh :

Board

CTO CEO COO


Business Office Human
Tech Intelligence Management Resources Product Growth Community

Figure 1 : Company Organigram

3
Most of my time was spent in my department (BI) working on different project that were either related to our
own analytics or other needs. Here is the general diagram of the BI Team:

Figure 2: Business Intelligence Department structure

1.2. Company methodology - OKRs:


When tackling real problems and issues, and with this bloody market out there, there is no way to compete with
other social media if one doesn't follow a clear and organized methodology of task choosing and ordering. Jodel
chose, after reviewing all the possible ways of dealing with long term goals the "OKRs system".
Created basically by John Doeer from Intel, it is now an international framework used by giants like Google,
LinkedIn, Uber, etc.
Based on choosing Objective and the key results (how to achieve this objective), it is meant to make all the
employees see what are all the others working on and the percentage of their evolution. Not achieving your
goals does not assume that you are a bad employee but that you should reconsider their time spent on a task
prediction. Finally, these objectives should be chosen every 3 months (quarter). In our case, and due to our fast
evolution, we chose it as a two months’ period.

4
Figure 3: Objectives and Key Results Flow

Every OKR Quarter includes a weekly grading of the objectives and key results that is usually made by the
individuals, then the department, and finally the company ones.
In Jodel case, at the moment we are fixing departments and company OKRs only. We are also willing to have
it for the individuals to insure personal development.
As long as one has a result of 60-70% of his objectives & key results done, the quarter is considered as
successful. On the other side, if the final grading is too low or too high, one has to reconsider how he is
estimating every task.
1.3. Project Presentation:
All over the past years, Jodel (As a company) did some giant steps in the social media world. Although
the road is still long and full of obstacles, we can easily say that the growth was quite impressive. But
after reaching a certain level of user experience, it becomes difficult to understand what is happening
with your users unless you analyze the data.

4
1.3.1. Process Automation:
Not that they were not doing it before, Jodel employees were obliged to wait for a important amount
of time before having the needed results. But time is not a luxury that one can afford as it depends only
on some hours or days to see a whole new product rising from nowhere and competing with you, trying
to take the best out of your market.

Figure 4: Traditional experiment process

the schema above explains the process of experimentation. As you know, the best way for a company
to improve is to run different experiments and fixing the KPIS willing to increase before completely
rolling out a new feature. As a matter of fact, Facebook is able to run up to 3000 experiments at the
same time to choose the best new features from the one they are willing to launch. Consequently, in
Jodel case, analyzing the data was taking a lot of time due to the lack of resources (Material and
Human). To remedy to this issue, automating seems to be the best solution.

Figure 5: Enhanced experiment process


5
1.3.2. Datawarehouse Design:
Second inspiration of this project was in fact the different data points that we have. after some days of
empiric analysis, the BI department figured out that all the client side and server side events were
divided between different databases. This without omitting to mention that one of the main investors
of Jodel is an IT company that is, therefore providing them with a reporting tool and tracking the events
with their own SDK. Here is a summary of how the data divided:

Figure 6: Data flow situation

The goal of this part of the project is to make all the needed data for analysis getting stored in one,
unique data warehouse with keeping the integrity of everything able to be extracted. By doing this, we
would be able to connect any reporting tool and create monitoring dashboards for all the teams:

Figure 7: Data flow enhancement

6
1.3.3. Machine Learning:
Last but not least was the concern about the knowledge of our users. The more you know your product
(the users), the more you would be able to improve it. Starting from that point, we agreed on doing
some analysis and choose to carry on with one important question: what is the optimal user?

Answering this main question will for sure give us more insights about the consumers of the social app
and help us to get better decisions than before. this will be the machine learning part of our work.

1.3.4. Summary:
The project fulfills three important Objectives that will give a new value to the company:

• Getting the data from a unique point.

• Automating the experimentation process.

• Getting new insights about the users.

Achieving these points will help Jodel to compete with the direct opponents and find the right answers
for the right questions.

2. Project Methodology:
2.1. Methodologies:
2.1.1. SEMMA:
SEMMA is a methodology created and developed by the SAS institute that makes the exploration,
visualization, along with the selection, transformation and modeling easier and more understandable
by the user of the tool od SAS for data mining. Here is a graphic representation of this methodology:

7
Figure 8: SEMMA process flow

• Sample:
Getting a small portion from a high amount of volume in order to make the data handling more agile.
It reduces the time and resources costs. By the end of this task, we would be having a training set, a
validation set, and a test set.
• Explore:
this phase is about finding weird trends and anomalies among our data. This exploring can be either
through numbers or visualizations. The most common techniques are clustering, factor analysis or
correspondence analysis.
• Modify:
Modify is about manipulating the data needed for the modeling, modifying its format if needed, and
removing if what id not needed for our data set.
• Model:
After getting our clean ready data, time to apply some methods, algorithms, statistical methods like
neural networks, decision trees or logistic models.
• Assess:
Last part of this method is to evaluate your modeling. It is made to check whether our modeling was
good. It orders to know this, we apply another testing set and check the results.

8
2.2. Choice:
2.2.1. CRISP-DM:
Cross Industry Standard Process for Data Mining is a Data Mining process model (Along with KDD
and SEMMA) developed in the late 1990s by IBM and is, until today considered as the best and most
generic solution for managing Data Science projects (source: KDnuggets.com).
the main causes of using this methodology are the fact that it is really independent of any tool or
technique (unlike SEMMA which is for SAS), the support of document of projects and knowledge
and transfer training.
It splits any data mining projects in six important phases but insure also different iterations (going
back and forth in the different steps):

Figure 9: CRISP-DM Phases

• Business Understanding:
This first step is essentially about understanding the business, the need for the specific project and the
resources that we have. It also includes the risks, the costs, the benefits and finally developing a project
plan according to these variables.
• Data Understanding:
This step is about selecting the data requirement and doing an initial data collection in order to explore
and get an overview the its quality.

9
• Data Preparation:
The data preparation task is the final selection of the data, acquiring it and doing all the possible
cleaning, formatting and integration. it also should be extended to some transformation and
enrichment (for wider possibilities of analysis). It could sometimes be the longest part of your data
mining project.
• Modelling:
When reaching the modelling phases, a modelling technique should be selected. the more you did the
last steps in a good way, the you’ll do a good choice. it is about how you understood the need and the
data. the data scientist/analyst needs also to divide his set in training and testing subsets for evaluation.
Finally, one should examinate the alternative modeling algorithms and parameter settings (Gradient
descent, etc.)
• Evaluation:
The Evaluation is about asking oneself:” is this result answering my business question?” or "is this
the wanted output?". it would also include the model approval according to some specifications.
• Deployment:
Whether it is an API or a result saved in an excel sheet/word, the deployment means creating a report
containing the findings of your analysis. In case it is an API, it would be planning the deployment on
an operational system like a server on which it would be executed according to your needs. this step
would also contain a final review of the project and a planning for the next steps.

Figure 10: Methodologies comparison


10
Chapter 1: Business
Understanding

In this chapter, along with all the other upcoming ones, we will try to always have three subchapters
about the three main projects we participated to. Chapter 1 is about understanding the goal of these
projects, the success criteria, or the resources. To summarize it, it would be the pillars of the next steps
of this project.

1. Business Goals:
1.1. Project Background:
Like every new thing that one is going to build, it is important to get an overview of the current state.
In other words, every engineer has to go through what is already existing to know what to do, what to
plan. This is what the project background is about.
1.1.1. Datawarehouse project:
Jodel is a company that is evolving in a quite fast way. Therefore, it sometimes happens that problems
have to be solved in a very fast way. As a result, we ended up having data in different sources:
• Postgres Datawarehouse: first datawarehouse created in Jodel by the previous
CTO. Written in node.js, the ETL Process in quite fast and contains the basic interactions of the app
(register, post, reply, voteup, votedown). The timeframe of this Datawarehouse is from 2015 until
summer 2017. the data is still not cleaned and a lot of tables are empty because the work wasn't finished.

Figure 11: Postgres database datagrip view


11
• Redshift Database: Used as an analytical database, this is the most used DB by
the product or BI team. It contains all the data from November 2016 to summer 2017. But unlike the
Postgres DW, it is updated with the late features and contains a unique table with multiple columns. It
is also not clean enough and contains messy tables like fact_interaction2 or the test tables. the only
used table is the fact interaction.

Figure 12: Redshift database Datagrip view

• RLT Cache: Third source of data, it is the cache used by the reporting tool
of our investors. Containing some precalculated queries used to plot some charts, it is the most updated
and directly connected source of data. Whenever a new server side or even client side event is added,
it is added using a specific pipeline to the RLT cache.
1.1.2. Experiment Automation project:
As a company willing to reach a high amount of retained users is a specific restrained timeframe, it is
important the throw multiple experiments in order to increase the retention and the engagement of the
users. This is how an experimentation process works:

End of the Experiment


+ choice of the data Analysis of the data
First Draft of the Idea
to be analyzed by the by the BI Team
BI Team

Meeting with BI Team


Decision to roll out
•define the KPIS Rolling out of the
the feature globaly or
•define the Sample Size feature for the groups
•Define the Cohort not

Acceptance by the
Development of the
Head of Product
feature
Department

Figure 13: Detailed steps of the experiment process

12
For a company who want to evolve fast, these steps can take a considerable amount of time especially
if multiple experiments are run at the same time.
1.1.3. Machine Learning project :
When launching a new feature or choosing to launch one, you are involving two important things for
every company: Money and employees time. Therefore, choosing what to change, what to add to the
app should be a very wise and calculated step. A false change or a triggering of the closest need of the
users’ base can lead to a complete crashing and end of the app. One of the main examples was Yik
Yak1: started as an anonymous app, it invited people after collecting a good mass to make their profile
non anonymous anymore. after some weeks of struggling, it leads to the death of the app and a shutting
down of the servers.
Jodel, as a company that is not monetizing and with a specific amount of money to spend in a specific
amount of time, have to face the same challenges. Willing to always push the app forward and do
changes that could lead to higher engagement and retention of the users, a need of knowing what is
really critical and important to add is more than urgent.
For now, apart of general values like Daily Active Users and Retention, there is no real other
understanding of the real activity and influencing milestones for the users.
1.2. Business Goals List:
Jodel needs to fulfill the needs of her users of socializing, interacting and increase their stickiness to
the app. For that this is what should be done:
• Make data more accessible and understandable for everyone inside the company to know where
and how to track the users’ behavior.
• Provide fast analytics for the different departments to get results about the experiments and
take decisions.
• Increase the knowledge about the users to know the different trends among them and what to
change exactly to increase their engagement and therefore retention.
1.3. Business Success Criteria:
The business success criteria are what makes the company or the people in charge of a task say that
they reached their goal or not. In our case, according the the company methodology, the different key
results of the objectives are what makes a task a success or failure:

1
https://www.theverge.com/2017/4/28/15480052/yik-yak-shut-down-anonymous-messaging-app-square
13
1.3.1. Datawarehouse project:

Perspective Who? What? Measure

Business Product / Growth / • Understanding • Have full taxonomy integrated.


Community • Availability of • Have a business driven fact tables.
the Data

Technical Business • Integrity • Full Server side and Client Side


Intelligence Team • Speed Events
• Architecture • Keep track of months’ data
• Historical • Availability of the data
• unique source of data

Table 1: Datawarehouse business success criteria

1.3.2. Experiment automation project:

Perspective Who? What? Measure

Business Product Dept • Analysis of the • Experiments results in less that 24


Community Dept Experiments hours
• Fast results • Have all the KPIs implemented.

Technical Business • Process • Have a web app for experiment


Intelligence Team Automation analysis using R Studio.
• Data • Have calculations based on
Manipulation mathematical scientific methods

Table 2: Experiment Automation business success criteria

1.3.3. Machine Learning project:

Perspective Who? What? Measure

Business The Whole • Knowledge • Specifications of the optimal user


Company acquisition

14
Technical Business • Machine • Model with more than 60%
Intelligence Team Learning Accuracy
Analysis • top 20 features influencing the
• Data optimal user experience
Manipulation

Table 3: Machine learning business success criteria

2. Situation State:
2.1. Inventory of resources:
• Personnel:
o Backend team.
o Product Team.
o Head of Business Intelligence Department (Bachelor and Master in Mathematics).
• Data:
o MongoDB Production Data.
o live access to old datawarehouses and databases.
o Live access to new datawarehouse.
• Computing resources:
o Lenovo, windows 8.1, i5 processor, 8 GB RAM
o Macbook Air, early 2014, Sierra, i5 processor, 8 GB RAM
o m4. xlarge AWS instance with 16 CPUs
• Softwares :
o RStudio 3.2.4.
o DataGrip.
o Draw.io.
o Excel.
o AWS.
o Shiny.

2.2. Costs and benefits:


As seen previously, our project would be touching 3 main fields which are the datawarehousing, the
process automation and the machine learning and understanding how much will be their costs and how
15
much the company would be saving includes a great understanding of the costs of the previous
processes and settings of the company.
2.2.1. Data warehouse project :
When talking about material benefits and costs of a datawarehouse, we are mostly talking about the
instances prices that a company would be using to store the data.
As most of the startups have been doing now, AWS is the best choice. Whether quality comparing or
availability; Amazon have proven that they were able to provide on of the best services.
The design of the new datawarehouse will be mainly to shut down all the other different points of data
that are creating the discrepancies.
Here is a table summarizing the costs of the old databases and warehouses and the future cost of the
new datawarehouse:
* source: CTO per Interim of Jodel
Database Type of Instance Price/Month
Old Datawarehouse (Potgres) m4.xlarge 1,522.01 €
Old 1-Table Database (Redshift) ds2.xlarge 582.561 €
New Datawarehouse ds2.xlarge 383.264 €

Table 4: Datawarehouse and databases pricing

As explained above, after the whole implementation of the new datawarehouse, we would be able to
decommission the Postgres datawarehouse and the 1-Table Database and save 𝟏𝟕𝟖𝟕 + 𝟔𝟖𝟒 − 𝟒𝟓𝟎 =
𝟏𝟕𝟐𝟏, 𝟑𝟎𝟕 €/𝑴𝒐𝒏𝒕𝒉.
2.2.2. Experiment automation project:
The process automation is maybe not easy for benefits calculations because it created a whole new
behavior inside the company. But, we can use some assumptions to know what was Jodel spending
and how this changed.
Before the automation, Jodel could run a maximum of 2 Experiments per week. And due to the lack
of resources (1 product analyst), the analysis of 2 experiments was also taking 1 week. Therefore, the
monthly cost of analyzing 8 experiments was 2500 euros per month which is the wage of an analyst.
Coming to our web application, R studio is a free open source software that allows you to create these
process automation. Moreover, letting the product managers do their own analysis will free the other
BI resources. And finally, when creating a shiny app, R gives you the possibility to use their server but
with a limited number of hours.

16
Figure 14: R Studio Shiny pricing

At Jodel, we chose to deploy it ourselves on an amazon EC2 instance. This way, we would be able to
debug it, scale it according to the needs, with an unlimited amount of hours or users.

Figure 15: AWS EC2 pricing and types

For performance needs, we are most of the time using m4.xlarge which have these specs :

Figure 16: M4 instance specifications and types


17
As a final result, a simple calculation leads us to a final amount of:
𝟎. 𝟐𝟐𝟐$ ∗ 𝟐𝟒𝒉 ∗ 𝟑𝟎𝒋 = 𝟏𝟓𝟗. 𝟖𝟒 $/𝒎𝒐𝒏𝒕𝒉 which is 136 Euros to which we could add getting
maintained once in a month by a BI team member for a whole week:
𝟐𝟓𝟎𝟎
= 𝟔𝟐𝟓€/𝒎𝒐𝒏𝒕𝒉 which gives us a total of 𝟕𝟔𝟏€/𝒎𝒐𝒏𝒕𝒉. It is still very far from the first
𝟒

number and is making the company benefit of a number of up to 16 experiments analysis per month
with saving 𝟏𝟕𝟑𝟗 €/𝒎𝒐𝒏𝒕𝒉.
2.2.3. Machine Learning project:
The machine learning task scheduled in this project is purely for investigating. The output of it is some
advice about which new feature to implement not to lose money in developers to end up rolling back
all the efforts. Usually, a bad feature would cost 1 week of salary of an Android developer (1K), an
IOS developer (1K) and Backend developer (1K). this give you a total of 3K Euros. The machine
learning task, on the other side, was performed for 1 month on RStudio server installed on an
m4.4xlarge Amazon EC2 instance which costs a total of 213.12 € for being exploited 8 hours per day
for 30 days. Adding to this a wage of data analyst for 1 month, we end up having a total of 2713.12 €
paid only one time to give us a valuable amount of information as a guidance for developing the
features.

3. Data Mining Goals:


3.1. Business success criteria:
The business success criteria are the specific outputs that could describe the projects as done.
• A final design of a data warehouse specific to the product and community needs
• a web app giving the final calculations and plotting of the different KPIs of the Product and
Community Department
• A fast web app that would allow multiple analysis at the same time
• the more influencing interactions that leads to the optimal user
3.2. Data mining success criteria:
The data mining success criteria is what is important as an output for our machine learning analysis.
• Weights of the top 20 features from the highest to the lowest
• a 60% or more probability of having these features influencing the user
• a successful use and understanding of SVM

18
4. Project Plan:
This part of the plan is about describing the different steps that would occur during this project and the
different iterations involved. Moreover, we will talk about the different tools and techniques used to
reach the data mining and business goals.
4.1. Planning schema:
as specified above, the plan that would be used in this presentation is CRISP-DM. Although not all our
tasks are a data mining task, they contain some of the iterations in this methodology, that thanks to its
agile specification, can make us reach our goals easily.

Datawarehouse design Datawarehouse design


Process automation Process automation
Machine learning Machine learning

Process automation Datawarehouse design


datawarehouse Process automation
Machine learning

Machine learning Machine learning

Figure 17: Project planning steps

19
4.2. Initial assessment of tools and techniques:
4.2.1. Data warehouse project:
• Draw.io: a free online tool for designing and modeling that let you import your schemas in
different formats like xml, pdf, etc.
• Slack: a collaborative professional platform for discussing and meetings.
• Trello: a collaborative tool for task assignment.
• Data Grip: an IDE from JetBrains for SQL, used also to visualize the database schema
4.2.2. Process Automation project:
• R and RStudio: an open source programing language statistics oriented used for data mining
and data analysis. RStudio is an IDE for R.
• Shiny: Shiny is a web application framework for R that allows to transform all your analysis
and designs in an interactive web app
• Plotly Package: Plotly package in R is the R version of plotly, an online data analytics
visualization tool that allows you to create beautiful visualization.
• Bootstrap: Bootstrap is a free open-source front-end web framework that contains different
fonts, buttons, and other interface component for designing user-friendly websites.
• Slack: a collaborative professional platform for discussing and meetings.
• Trello: a collaborative tool for task assignment.
• DataGrip: an IDE from JetBrains for SQL, used also to visualize the database schema.
4.2.3. Machine learning project:
• R and RStudio: an open source programing language statistics oriented used for data mining
and data analysis. RStudio is an IDE for R.
• Slack: a collaborative professional platform for discussing and meetings.
• Trello: a collaborative tool for task assignment.
• DataGrip: an IDE from JetBrains for SQL, used also to visualize the database schema.

20
Chapter 2: Data
Understanding

This chapter, will go through the data involved in our project, whether in the data warehouse, the process
automation or the machine learning part, its quality and how it can give answers to our questions.

1. Data Description:
1.1. Data sources:
Here are the different data sources we used to achieve our goals:
• Old Redshift: this transactional database named internally "Old Redshift" is an analytical
database used by the product team to do some simple analysis. Each row represents one event.
It runs on Amazon Redshift using the ds2.xlarge instance of AWS. The data is stored there
from September 2016 until May. It is updated with 22 Million rows per day.
• Old Postgres: A data warehouse named internally "Old Postgres" is the first data warehouse
containing one fact table, multiple dimensions. It has the design of traditional DW and runs on
PostgreSQL with using a m4.xlarge instance. The data is stored there from October 2014
until august. The ETL process was coded in Node.js.
In Order to be able to go through all of them, here all the used tools:
• Data Grip.
• RPostgreSQL and R: one of the most known packages in R that plays a role as a Database
Interface. As the name tells, it is a Postgres driver but can handle Redshift too.
1.2. Data Specifications:
1.2.1. Old Redshift:
• fact_interaction table :

Column name Meaning

interaction_key What interaction did user do? See: description of events tracked in the DW

country_key 2-letter country code

city_key Currently not tracked (1 for all rows)

21
city_name City name of a mapped city

date_key Date and time in yyyy-MM-dd HH:mm:ss

utc_date_key Used as index ! Date and time in yyyy-MM-dd HH:mm:ss

lifecycle_key Number of minutes from registration

user_key Unique user id

content_key Unique id of content

geohash 5-letter geohash location

experiments Which experiments is this user part of?

metadata Additional data for some interactions, for instance flag reasons.

metadata_2 Used to take values of specific experiments

user_agent The version of the OS

Table 5: Fact interaction columns and explanation (Old Redshift)

o interaction_keys tracked in the table:


§ Basic interactions:

Interaction Meaning

action.user.create New registration

action.getposts.details/action.getposts.details_new Click on a post to see the comments

action.post.votedown Downvote content

action.post.voteup Upvote content

action.post.reply Reply

action.post.create Post

action.post.delete Delete your own post

action.post.flag Flag a post

22
action.post.pin Pin a jodel

action.post.unpin Unpin a jodel

action.getposts.location_triple Get the posts of the “newest” feed

action.getposts.location_discussed Get the posts of the “discussed” feed

action.getposts.location_recent Get the posts of the “newest” feed

action.getposts.location_popular Get the posts of the “popular” feed

action.app.LoudestFeedSelected Click on loudest

action.app.MostCommentedFeedSelected Click on most commented

action.app.NewestFeedSelected Click on newest

action.post.block action of blocking a post. Reason of


blocking is in metadata

Table 6: Basic interactions naming and explanation

§ Administration:

Interaction Meaning

admin.user.blocked User is blocked by Jodel

admin.user.unblocked User gets unblocked by Jodel

Table 7: Administration interactions naming and explanation

§ Me Section:

Interaction Meaning

action.getposts.mine_triple Go to “My Jodels”

action.getposts.mine_popular In “My Jodels” click on “popular”

action.getposts.mine_discussed In “My Jodels” click on “discussed”

action.getposts.mine_voted Go to “My Votes”

23
action.getposts.mine_replied Go to “My replies”

action.post.moderate Moderate a post

action.getposts.modposts Click on “Moderation”

Table 8: Me section interactions naming and explanation

§ Channels:

Interaction Meaning

action.user.add_channel Add a channel

action.user.remove_channel Remove channel

action.getposts.channel_combo Go to “channels”

action.getposts.channel_recent In a channel click “recent”

action.getposts.channel_discussed In a channel click “discussed”

action.getposts.channel_popular In a channel click “popular”

Table 9: Channels interactions naming and explanation

§ Hashtags:
Interaction Meaning

action.user.add_hashtag_channel Add hashtag channel

action.user.remove_hashtag_channel Remove hashtag channel

action.getposts.hashtag_channel_combo Go to “channels”

action.getposts.hashtag_channel_recent In a channel click “recent”

action.getposts.hashtag_channel_discussed In a channel click “discussed”

action.getposts.hashtag_channel_popular In a channel click “popular”

action.user.remove_hashtag Same as for hashtag_channel

action.getposts.hashtag_discussed Same as for hashtag_channel

24
action.getposts.hashtag_combo Same as for hashtag_channel

action.getposts.hashtag_recent Same as for hashtag_channel

action.user.add_hashtag Same as for hashtag_channel

action.getposts.hashtag_popular Same as for hashtag_channel

Table 10: Hashtags interactions naming and explanation

§ Hometown Feature:
Interaction Meaning

action.app.SetHomeStarted Click on the city name to start the


Hometown setup
action.app.SetHomeCompleted Set a Hometown

Table 11: Hometown feature interactions naming and explanation

§ Experiment Assignation:
Interaction Meaning

action.user.add_experiment User was assigned an experiment


action.user.remove_experiment User was removed from an experiment

Table 12: Experiment assignation interactions naming and explanation

Figure 18: Fact interaction Datagrip view


25
1.2.2. Old Postgres:
• fact_interaction table :

Column name Meaning

id Row identifier (no use)

interaction_key What interaction did user do? See: description of events tracked in the DW

country_key 2-letter country code

city_key Numerical key identifying a city. A key of the dim_city table

date_key Date and time in yyyy-MM-dd HH:mm:ss

lifecycle_key Number of minutes from registration of given user to the moment of interaction

user_key Unique user id

content_key Unused

Geohash 12-letter geohash location

Created Unused

Last_edited Unused

Table 13: Fact Interaction colums and explanation (Postgres)

• dim_interaciton table:

Column Name Meaning

Id Id of the interaction

Type What type of interaction

Table 14: Dimension Interaction colums and explanation (Postgres)

• dim_city table:

Column Name Meaning

id Numerical city_key of the fact_interaction table

26
name City name

population Empty at the moment

universities Empty at the moment

Polygon Coordinates of the geohash

Processed Unused

country 2-character country abbreviation

addition date When was the city mapped?

center GPS coordinates of the central point

Table 15: Dimension city colums and explanation

• dim_user table:

Column Name Meaning

id user_key from fact_interaction, unique user id

karma Users total carma

client_type Android or ios?

blocked Was the user blocked?

created When was the entry created

Last_edited When the last modification in the user happened

Table 16: Dimension user colums and explanation

• dim_content table:

Column Name Meaning

id content_key of the fact_interaction table, unique id of content

karma Upvotes - downvotes on this particular content

27
message The content itself, pictures as a link

parent In case of reply, parent is the content id of the original post

blocked Was the post blocked?

created Empty

Last edited Empty

Table 17: Dimension content colums and explanation

• dim_time table:

Column Name Meaning

time timestamp jouant le role d ID

formatted_date The date without the time

hours_of_day The hour

minute_of_hour the minute in the hour

day_of_week which day of the wek it is ( 1-7)

year_calendar_week the week of the year (1-52)

calendar_month the month number in the calendar (1-12)

calendar_quarter the quarter number in the calendar (1-4)

calendar_year the year number in the calendar

holiday_indicator TRUE if its a holiday

weekday_indicator TRUE if it is sunday

Table 18: Dimension time colums and explanation

28
Figure 19: Old Datawarehouse relations

2. Data Exploration:
2.1. Datawarehouse project:
As we know, one of the main differences between a database and a datawarehouse is the purpose. One
is purely operational, the other one is analytical. We also know that every fact table should be related
to a specific business case. Our exploration part here is to understand how are all these columns are
business related and the values structures.
• Experiment Assignation: while going through the database (Old Redshift) we see that there
is a column named experiment that contains comma-separated values.

29
Figure 20: Experiment column data in Old Redshift

most of these words can seem nonsense for the reader but they are just a naming of a specific
experiment running for specific users.
• Geohashes: Jodel is a location-based app and, therefore needs a system to quantify the
location in a way that the users will see the posts according to their locations. The company is for that
using a geocoding system that divides space into bucket of grid shape. the longer is the name of the
geohash, the more precise it is. For business purpose (The radius of every user), Jodel uses the 5 letters
Geohash.

Figure 21: Geohashes data in Old Redshift

Figure 22: Geohash technology structure

30
Examples:
u1vu1:
-latitude: 55.59013367
-longitude: 8.17314148

Figure 23: Geohash Example

here are how the countries are divided by geohashes:

Figure 24: Geohashes distribution by countries


31
• User Location Inside the app: one of the most positive points about Jodel is the user
experience and the easiness to consume the app. No need for registration or profile. Your
geolocalization is the only thing that lets you automatically create the app and generate a user ID. this
said, it is very important to know where is the user located inside the app.
Here are some findings of what is already existing:
- enterFromFeed
- enterFromPush
- loadNextPage

Figure 25: Funnel example

The image above is what we call a funnel. It shows you that, depending on the positioning of the user
inside the app, he can leave. Then you would be able to know why and where he disconnected.
2.2. Process Automation project:
• Core actions:
The process automation includes calculating some specific KPIs for the product team. For this, one
needs to understand the actions that are mostly made by the users.

Figure 26: Users activity inside the app distribution


32
This figure shows how, posting, replying, voting and reading are really the main tasks of the users. We
also took other tasks in consideration like flagging which can be considered as a proxy for the quality
of the content.
• Users Distribution in the clusters:
One of the other analysis that the BI Team did was the clustering. We could query the results from S3
that they found out to have an overview of the distribution of the users of the app.
How many are very active, how many are quite active, and how many have a very low activity.
1: high activity
2: medium activity
3: low activity

Figure 27: Users distribution in the clusters

33
2.3. Machine Learning project:
• Retention:
One of the most important metrics for Jodel is the retention. In fact, it really shows who are the users
that really come back to the app after using it. But for the timeframe of comeback, it depends on the
business point of view. Jodel always choose to calculate it monthly.
- Retention (Weekly):
• All over the world: 27%
• DE: 31.69%
• FI: 38.95%
• DK: 35.21%
• NO: 23.78%
• AT: 26.58%
• SE: 29.04%
• IT: 25.58%
• FR: 32.27%
• Happy Ratio:
𝑼𝒑𝒗𝒐𝒕𝒆𝒔
One of the other important KPIs of Jodel is the Happy ratio. Its calculation is and is also
𝑫𝒐𝒘𝒏𝒗𝒐𝒕𝒆𝒔
used as a proxy to know in what kind of environment the user is growing. In fact, the option of upvoting
is used when someone likes a post and the option of downvoting is used when someone dislikes
something. Therefore, 2 analyses were run for D1 comeback and D14 comeback.
Here are some results using a simple correlation between happy ratio and number of comebacks:

Figure 28: Happy ratio influence on D1 activity


34
Figure 29: Happy ration influence on D14 activity

As these two plots are showing, there could be a clear trend that express how the happy ratio and the
retention of the users could be linked. and by the same opportunity, the upvotes and the downvotes.

3. Data Quality:
3.1. Data warehouse project:
3.1.1. Client data:
After checking all our different data sources, it clearly appears that one the most missing data is the
client side. Some meetings with the other departments showed a high need of it. Most of the use cases
they talked about were related to the number of the clicks, where did these clicks happened and when.
3.1.2. Experiment Assignation:
Although the data is available in the Old Redshift, the way the experiment names is written makes the
querying very difficult. Here are some examples of what you can find in the experiment field:
Experiment
hashtag_prompt_android_hashtag_prompt_android;inapp_notis_global;mentioning_repliers_menti
oning_repliers_global;picture_feed_global_picture_feed_global;screen_shot_sharing_screen_shot
_sharing_global;user_profiling_user_profiling_global
flag_reason_change_flag_reason_change_global;inapp_notis_global;mentioning_repliers_mention
ing_repliers_global;picture_feed_global_picture_feed_global

35
cell_new_design_cell_new_design_GLOBAL;flag_reason_change_flag_reason_change_global;m
ark_repliers_ios_mark_repliers_global;mentioning_repliers_mentioning_repliers_global;picture_f
eed_global_picture_feed_global;pin_main_feed_pin_main_feed_global;reply_in_feed
channels_berlin_old;flag_reason_change_germany
thankajodler_thankajodler_no

Table 19: Experiment Rows Example

As weird as it seems, the experiment field contains 1,2,3 or 7 experiments that are assigned
simultaneously to the users and the querying them can take ages. for that, one has to use queries and
adding LIKE %name_of_the_experiment%. As a result, the product employees need to wait 20 to
30 minutes to get 1000 IDs to analyze.
3.1.3. User location inside the app:
Although the data is “available” in the database of the old warehouse, the column in which it is written
is changing its content from a row to another. To have a better understanding about that point, here is
a concrete example:

Figure 30: Old Redshift metadata issue

metadata data changes its content from an interaction_key to another:


- When the interaction_key is about discoverability (action.getposts.details_new), metadata
contains the entry point (in our case here , enterFromPush and EnterFromFeed.
- When the interaction_key is about participation (action.post.reply) metadata contains the id of
the post the user replied to.
- When the interaction_key is about a moderation task (action.getposts.modposts), it counts the
number of the moderated posts.
As a result, we will for have a loss of the entry point when the user replies, a loss of the parent id when
the user discovers.

36
3.2. Experiment Automation project:
3.2.1. core actions:
- Getpostdetails:
The experiment automation is about creating a web app that will automatically calculate the specific
KPIs for the business teams. Any discrepancy would lead eventually to false values on which they
could base their decisions. One of the flaws that we found was the naming of the event of reading a
post. For a specific period of time, they run an experiment and changed it to
action.getposts.details_new when it was action.getposts.details. As a result, for a specific period of
time, we had the same data divided into two different interactions.
- Received replies:
While computing and calculating the metrics according to the business-oriented employees, the web
app needs to do some merging and linking between dataframes or columns. Therefore, columns that
have the same role should contain the same format for better filtering and using. Practically, when
searching for who received the reply, or what is the ID of the post on which the reply was written, the
only result is found in the metadata column in this format: parentId:58b69e78ff23cd0f3d40f1af while
a normal content is 55bo0e78ffq33dyf3d40f12f. That makes any data scientist either wait for ages to
get the related content or do more manipulations that will cost more resources and time.
3.1. Machine learning project:
3.1.1. Location from a user perspective:
Our machine learning part will be about finding out what is the optimal user according to the actions
they get, they do and what is happening in their feed. A user who is living in Koln would be seeing
what is happening in his geohash and 26 others geohashes around him. Therefore, the most accurate
way to see what is going on in a user’s feed, what is he seeing, sending and receiving should be using
geohashes and no the city or even the country name.
As a simple example, I would ask you to consider yourself living in the borders of Germany. Having
your geohash representing this border would make you see posts from Germany but also the
Netherlands. This said, when a data scientist will query using country_key=’DE’, he will only see
what the user will see from Germany and would be missing half of what is happening in the
Netherlands (as the structure is one geohash where the user is + 26 other geohashes around him).

37
Chapter 3: Data Preparation
After understanding the business Jodel is working on, and getting to know the different data sources and
the possibilities, comes the time of the data preparation. No modeling or analysis is possible with a messy
data. Before applying any algorithm, loop or statistical method, one should have a clean platform of base
to work on. This is the data preparation chapter. For coherence purposes, this chapter will be divided in
our 3 main projects as the subchapters are changing every time.

1. Datawarehouse project:
After finding out all these discrepancies between the different data sources, one of the main steps is to
create a taxonomy for the events you are willing to send to the new datawarehouse.
As a reminder, a taxonomy is a categorization of different entities.
1.1. Taxonomy:
1.1.1. Sources:
When sending the report that will the data engineers or the backend developers use for creating the
new datawarehouse, it is really important to explain from where is coming every event. Thus said we
tried to find out what where the different possible access point that we have:
- RLT: The database of the reporting tool of the inverstor called RLT stats. It contains almost
everything related to our app as most of the backend engineers are the investors engineers.
- DW: The old Redshift specified previously. It is the most recent and updated database that we
can access as a BI Team. It contains the events of all the new experiments.
1.1.2. Categories:
As the software programming is nowadays module oriented, same thing goes for the data. As a result
of all these code lines, one can easily find from where is coming every event and how to categorize it.
Here are the different categories found by the BI Department:
Category Meaning

First time user The very first actions that a user does/sees when he uses the app

App start The actions that happens when he connects to the Jodel community

App close Closing the session

Channels All the actions performed using the feature of Channels

Hashtags All the actions performed using the feature of Hashtags


Core user actions The primary features of Jodel (Voting, posting,etc)

38
Mainfeed Loading and filtering the main feed of the app

Three dot Using the three dots of the app (deleting flagging)

Doing a moderation task like accepting skipping or denying a


Moderation post/reply

Picture feed Using the picture feed feature

Hometown Assigning, changing or choosing your hometown

Me section Accessing to the user section (Karma, my postsm my pins, etc)

Administration Actions made by the admins like blocking profiles


Other Example : take screenshots

LocationFilter Filtering by locations events

LocationTag Location tag feature events

InAppNotification The events fired when the user clicks on a notification inside the app

PushNotification The events fired when the user clicks on a push notification

Table 20: Taxonomy Categories

1.1.3. Events:
In order to have a unified view of all what is happening in our mobile app and have a great
understanding of the different phenomenon that could occur, the BI department chose to be the owner
of the naming of any event that will be fired in the new datawarehouse. Therefore and after multiple
iterations with the different departments, we collected the most important interactions that would be
part of our new datawarehosue :
Category Event Client/Server

First time user Register Server

WelcomeScreenShow Client

WelcomeScreenTapConnect Client

App start
OpenSession Client

LocationPermissionRequestAccept Client

LocationPermissionRequestDeny Client

39
ProfilingShow Client

ProfilingConfirm Client

App close

CloseSession Client

Channels
ChannelsSearch Client

ChannelsJoin Server

ChannelsTapJoin Client

ChannelsUnjoin Server

ChannelsSelect Client

ChannelsEnter Client

ChannelsLeave Client

Hashtags

HashtagsTapMostCommented Client

HashtagsTapLoudest Client

HashtagsTapNewest Client
Core user actions

Upvote Server

Downvote Server

LoadConversation Server

EnterConversation Client

ViewImage Client
TapPin Client

Pin Server

TapUnpin Client

Unpin Both

PostTapPlus Client

40
PostTapCancel Client

PostTapCamera Client

PostTapSend Client

Post Server

Reply Server
LeaveConversation Client

TapSharePost Client

SharePost Server

GiveThanks Server

Mainfeed

MainTapNewest Client
MainTapMostCommented Client

MainTapLoudest Client

MainSelect Client

Three dot

Flag Server

DeletePost Server
Moderation

ModerationRegister Server

ModerationRemove Server

ModerationAllow Server

ModerationBlock Server

ModerationSkip Server
ModerationTapRules Client

ModerationUpdateStatus Server

Picture feed

PictureFeedEnter Client

PictureFeedLeave Client
41
Hometown

HometownSwitch Client

HometownStartSetup Client

HometownConfirmSetup Server

Me section
MeTapPins Client

MeTapReplies Client

MeTapSettings Client

MeTapVotes Client

Administration

BlockUser Server
UnblockUser Server

BlockPost Server

UnblockPost Server

Other

TakeScreenshot Client

LocationFilter LocationFilterTapTag Client


LocationFilterLoadFeed Server

LocationFilterTapButton Client

InAppNotification

InAppNotificationView Client

InAppNotificationTap Client

InAppNotificationDismiss Client
PushNotification

PushNotificationTap Client

PushNotificationDismiss Client

PushNotificationView Client

Table 21: Events Taxonomy and sources


42
1.2. Design:
1.2.1. Dimensions:
- dim_user: one the most important dimensions, gives a static view of the user.
o sk_user: surrogate key of the dim_user
o id_user: business key of the dim_user
o created: creation date of the user
o blocked: true if the user has been blocked before
o moderator: true if the user is a moderator
o os_type: ios/android/desktop
o user_type: student / high_school / high_school_graduate / employee / other / unknown

Figure 31: Dimension user (new datawarehouse)

- dim_content: a dimension that contains the static properties of the content.


o sk_content: surrogate key of dim_content
o id_content: business key of the dim_content
o post_type: I for image, T for text
o color: color of the cell containing the post in hexadecimal format
o from_home: true if the user posted from using hometown feature
o parent_id: contains the ID of the parent post when the content is a reply
o channel: contains the name of the channel if the content was created inside a channel
o flagged: true if the post was flagged
o blocked: true if the post was blocked
o lang : tells the language the content was written in

43
Figure 32: Dimension content (new datawarehouse)

- dim_interaction: a dimension that contains the different names of the interactions(events).


o id_interaction : business key of the dim_interaction
o type_interaction : name of the interaction (all the names are unique)

Figure 33: Dimension interaction (new datawarehouse)

- dim_date: a dimension containing different formats of the dates.


o sk_date : surrogate key of the dim_date
o year_no : number of the year (eq. 2017)
o day_year : number of the day in the year (1-356)
o quarter_num : number of the quarter (1-4)
o month_num : number of the month in the year (1-12)
o month _name : name of the month (january-december)
o month_day_num : number of the day in the month (1-31)
o week_num : number of the week in the year (1-55)
o day_week : number of the day in the week (1-7)
o dt : normal date format

44
Figure 34: Dimension date (new datawarehouse)

- dim_property: contains specific properties related to the the user


o id_dim_property : business key of dim_property
o OJ : true if the user is the Original Jodeler (owner of the original post)
o ChannelJoined : true if the channel hasbeen joined by the user

Figure 35: Dimension property (new datawarehouse)

- dim_inApp_Location: used to localize the position of the user inside the app
o id_inapplocation: business key for dim_inapplocation
o entry_point: different entry points inside the app (ex. Me, Main,
Hashtag,Channel,SearchChannels,etc)
o sorting: type of sorting applied when the action is executed (ex. Newest,
MostCommented,etc)
o filter: type of filtering applied when the action is executed (ex. timeNew,timeDay,
locationHere,locationClose,etc)
o Conversation : true if it is happening in a conversation view

45
Figure 36: Dimension in app location (new datawarehouse)

- dim_country: contains the different countries of the world


o id_country : business key for dim_country
o country_code : contains the country code (ex. TN for Tunisia, DE for Germany)
o country_name : contains the country nae (ex. Tunisa, Germany)

Figure 37: Dimension country (new datawarehouse)

- dim_city: contains the different cities of the world


o id_city : business key of the city
o fk_country : foreign key linking dim_country
o city_name : name of the city

Figure 38: Dimension city (new datawarehouse)

- dim_location: all the geohashes of the world.


o id_location : business key of dim_location
o fk_country : foreign key linking dim_location to dim_country
o fk_city : foreign key linking dim_location to dim_city
o geohash : name of the geohash

46
Figure 39: Dimension location (new datawarehouse)

- dim_mod_status: dimension containing the moderator status


o id_modstatus : business key of the dimension modstatus
o status : the status of the moderator (shadow, normal, trial)

Figure 40: Dimension moderator status (new datawarehouse)

- dim_mod_decision: dimension containing the moderation decisions


o id_moddecision : business key of the dimension dim_mod_decision
o decision : the decision of the moderator (allow,deny,skip)

Figure 41: Dimension moderator desicion (new datawarehouse)

- dim_block: dimension that contains the different combinations of blocks


o id _block : business key of the dimension block
o block_source : source of the block (ex. internalModeration, autoSpam,etc)
o block_reason : reason of the block (ex. Disclosure of personal information, repost,etc)

Figure 42: Dimension block (new datawarehouse)

- dim_flag : dimension that gives the different conbinations of flags

47
o id_flag : business key of the dimension dim_flag
o flag_source : contains the source of the block (ex. internalModeration, autoSpam,etc)
o flag_reason : contains the reason of the block (ex. Disclosure of personal information,
repost,etc)
o flag_subreason : contains the subreason of the block (ex. subreason 101, subreason
225,etc)

Figure 43: Dimension flag (new datawarehouse)

- dim_exp_interaction : dimension containing the temporary interactios that would be added


when an experiment will be launched. We chose not to mix them up with the other
dim_interaction to keep the other dimension clean
o id_exp_interaction : business key of the dimension dim_exp_interaction
o type_temp_interaction : name of the temporary interaction name

Figure 44: Dimension experiment interaction (new datawarehouse)

- dim_value : dimension that will be populated with the different values that will occur in a
specific experiment.
o id_value : business key of the dimension value
o value : a string containing a value according to a specific experiment

Figure 45: Dimension value (new datawarehouse)

- dim_experiment : name of the different experiments and their starting and ending dates.
o id_experiment : Business key of the experiment
o name : name of the experiment
48
o start_date : starting date of the experiment
o end_date : ending date of the experiment

Figure 46: Dimension experiment (new datawarehouse)

1.2.2. Facts:
- fact_product : first fact of our datawarehouse. This first fact table will be linked to different
dimensions that will give us answers about questions mainly related to the product department.
The design was made in a way so that it could be linked to a reporting tool for further analysis.
o fk_interaction : foreign key linking the fact product to the dimension dim_interaction
o fk_user : foreign key linking the fact product to the dimension dim_user
o fk_date : foreign key linking the fact product to the dimension dim_date
o fk_location : foreign key linking the fact product to the dimension dim_location
o fk_content : foreign key linking the fact product to the dimension dim_content
o fk_inapplocation : foreign key linking the fact product to the dimension
dim_inapp_location
o fk_property : foreign key linking the fact product to the dimension dim_property
o karma : It is the result of the total point the user is getting fro receiving actions like
votes, replies, blocks, etc.
§ Example : you get +2 karma if you receive an upvote. You also get -2 karma
if you receive a downvote.
o blocked_count : number of times the user has been blocked.

Figure 47: Fact product (new datawarehouse)


49
- fact_moderation : fact table reflecting the moderation business. Related mostly to the
community team, it will be a lead and a proxy to see how the moderators* are behaving. It wil
answers the concrete questions for improving our moderation system.
o fk_block : foreign key linking the fact product to the dimension dim_block
o fk_flag : foreign key linking the fact product to the dimension dim_flag
o fk_date : foreign key linking the fact product to the dimension dim_date
o fk_location : foreign key linking the fact product to the dimension dim_location
o fk_content : foreign key linking the fact product to the dimension dim_content
o fk_user : foreign key linking the fact product to the dimension dim_user
o fk_mod_status : foreign key linking the fact product to the dimension
dim_mod_status
o fk_moddecision : foreign key linking the fact product to the dimension
dim_mod_decision
o fk_interaction : foreign key linking the fact product to the dimension dim_interaction
o weight : weight is the strength of the decision of the moderator. This weight is
changing according to the previous decisions. If the moderator allow bad content, the
wight will decrease and he will not be influencing the allowance or denial.
o queue_size : the number of posts still remaining to moderate.

Figure 48: Fact moderation (new datawarehouse)

*moderators: users that reached a high amout of karma (changing from a country to another) and didn t get a ban before.
They are able to allow or deny posts
- fact_experimentation : fact table answering questions about how experiments afre going on
without altering the old clean data. It will be used by the BI team who’s task will be to run
some A/B tests to orient the business employees toward the best solution.

50
o fk_exp_interaction : foreign key linking the fact product to the dimension
dim_exp_interaction
o fk_value : foreign key linking the fact product to the dimension dim_value
o fk_user : foreign key linking the fact product to the dimension dim_user
o fk_date : foreign key linking the fact product to the dimension dim_date
o fk_location : foreign key linking the fact product to the dimension dim_location
o fk_content : foreign key linking the fact product to the dimension dim_content
o fk_inapplocation : foreign key linking the fact product to the dimension
dim_inapp_location
o fk_experiment: foreign key linking the fact product to the dimension dim_experiment
o fk_property: foreign key linking the fact product to the dimension dim_property
o count_usage : number of times the new feature was used by the assigned users to the
experiment

Figure 49: Fact experimentation (new datawarehouse)

51
1.2.3. Relations:
- Fact_product:

Figure 50: Fact product and dimensions (new datawarehouse)

52
- Fact_moderation:

Figure 51: Fact Moderation and dimensions (new datawarehouse)

53
- Fact_experimentation:

Figure 52: Fact Experimentation and dimensions (new datawarehouse)

54
2. Process Automation project:
The process automation part is about building a web application that would calculate all the needed
KPIs and make the experiment process automatic. To reach this purpose, our app will be divided in 3
main files according to every shiny app architecture:

Figure 53: Shiny app architecture

- ui.r: this is the file that controls the layout of your app and its appereance. You can as well
call a css file and insert some js code in it.
- server.r : this file contains all the instructions needed to build the app. It is also used to
program a specific action for specifi widgets, buttons, frames, tables,etc.
- helpers.r : this file contains some generic functions that would be called multiple times in the
sever file.
all the different functions and needs that the business employees asked for will be coded in these three
different files. Throughout all the different steps that will be explained, we always tried to make the
coding as simple, fast and efficient as possible to make the availability of the data on the needed speed
scale of the users of the app.
2.1. Experiment parameters:
While running an experiment, the product team need to choose the users that will be involved and the
timeframe. This part of the app is about giving the opportunity to upload the the IDs and picking the
starting and ending dates.

55
• ui.r:
The shiny package has provided for this purpose a component called fileUpload that give you the
possibility to upload a csv or text file and specify the separator, the quotes, etc.
From our side, we tuned it by adding the start date and end date component, the hour specification, and
the type of experiment (same ids, different ids):

Figure 54: Date and time selection window

• server.r:
the server-side part of this module saves the ids in the file using the function fread from the data.table
package. The choice was based on the speed as it is faster to read.csv. it will also use shQuote, a
function that will transform the dataframe of the IDs to a vector of comma separated IDs with quotes
for the SQL function

Figure 55: User IDs transformation


56
2.2. KPIs Calculation:
this could be one of the most important module of our web application. In fact, it represents the real
value of our app which is to give results about the experiments in a small amount of time. But before
starting the implementation, we had a meeting with the product and community team and here are the
metrics needed for them to get insights:
- Product team:
o Choice between activity of new users and their 24 hours’ activity or existing users
o Track creating posts, replying on posts, pinning, flagging, voting up, voting down
and reading content and their averages and percentages
o Export the results of significance in different formats (pdf, csv, copy/paste)
• ui.r:
the ui part of our project will give the opportunity for the user to choose either KPIS of what the user
did 24 hours after his registration or what he did as an existing user (not new) for experiment and
control group.

Figure 56: KPIs calculation window

Figure 57: KPIs calculation table result


57
• helpers.r:
the helpers part of our KPIs calculation is the key of process automation project. In fact, in contains
the functions that would be linked to the server to have the final results needed about the behaviors of
our cohorts.
o Getting the data from the Database:
As the 3 projects specified above were going simultaneously, we were able to connect the shiny app to
the new datawarehouse to get the data for the product team. And according to this, we created a function
for fetching the data from our data source to the R server instance.

Figure 58: Fetching data from datawrehouse code lines

the function takes the dates that the user entered in the previous ui and the vector of users’ IDs that has
been created from the text files. This way, and by not taking the country or city, we can simply track
all what the users did as this is what is really important for us, what they are doing after assigning them
to an experiment. This comes without forgetting the error handling in case the vector of users in empty.
In this case, the function will just return a null vector.

58
o Calculating the KPIS:
By joining the different dimensions of our new datawarehouse, we end up having a view similar to the
transactional database we had, but with clean and more complete data. Therefore, having our results
needs from us to do some calculations. The function doing this task parameter are the result of the data
gathering which is a table called “newdata" in this format:
interaction_key user_key utc_date_key
action.post.create 58hrh957292ndvqk2 2017-05-01 15:08:22.00000
action..post.reply 4730jdurrkt03845sje 2017-05-01 15:08:25.00000
action.post.flag 37391hduen02840sb 2017-05-01 15:08:28.00000
… … …

Table 22: Data format result

It also takes again the starting and ending date specified in the first step of our experiment.
As we are fetching all the different actions our cohort executes (for other analysis of the app), it is very
important to create multiple small data frames by filtering with the interactions the product team want
to track:

Figure 59: Filterinf a dataframe code lines

Second step of this KPI calculation is to create the dataframe that will contain the result of our
calculations by creating a sequence of dates, putting the name of our actions in a small table for loop,
creating the column names and the number of rows.

59
Figure 60: Dataframe creation code lines

here is a description of the columns:


- Date: contains the date of the action
- Action: contains the name of the action
- CountAction: number f times the action was executed
- CountUsers: Number of users who did the action
- AverageAction: CountAction/ CountUsers
- DAU: Daily Active Users
- % Users Doing the Action: CountAction/ DAU
- Action Per DAU: CountUsers/ DAU
According to this description, we reach the final part of this function that will calculate all these
values. Nevertheless, it is very important to mention that, as R doesn t easily handle the loops, we
tried to find another way to reach our main goal, getting all out KPIs without taking so munch time.

60
Figure 61: DAU Calculations code lines

Figure 62: Calculatin the KPIs code lines

ps: this conditions are executed according to the number of actions.


For the 24 Hours case, the. Only changing part in it is how to choose the functions. IN fact, one should
assume that by choosing the 24 hours’ options, the cohort uploaded by the product team will contain
users that registered in that period of time. The task of our web application will be to check what these
users have been doing in their first 24 hours. For that, we need at first to check every day and the users
registered in that day, add 24 hours to the creation date and check what is happening in between:

61
Figure 63: New Users difference code lines

the code creates one table with the user ids and their registration+24 hours (users.registrations) and
another containing all their interactions(users.interactions.filtered). The rest is about merging the two
tables using the user_key column and then keeping the rows that contains the utc_date_key column
values (from the interaction table) less than the onedayafter column values (from the registration table)
• server.r:
we use the server file to program the tasks of the components. In our case, we would be linking the
functions written in the helper with the components in the ui. We have two solutions. Either we use
eventReactive or observeEvent. The difference between them is that eventReactive is creating a reactive
value that changes based on the expression we will write and observeEvent is triggered automatically
based on the expressions. As we have our button already existing in the ui and to keep the same way of
programming, we chose observeEvent.
Inside it, 2 important steps have to be fulfilled: fetching the date from the date component and checking
if the user have selected the time to fetch the time with it and execute the functions explained above
for both experiments and control group.

62
Figure 64: Saving the date and time code lines

this first part initializes the date and time variables as the date component doesn’t give us the
opportunity to get the time too. For that specific reason we created some hours’ sliders that will only
be seen by the user if he wants to add time to date.
the renderUI function is responsible of the dynamic part of this component:

Figure 65: dynamic slider code lines

the last part of it is of course about putting the output of our data gathering and KPIs calculations in
some variables for future use:

Figure 66: Executin the KPI function on the server


63
2.3. Significance Testing:
Having our KPIs calculated is surely the main goal of this app, but no numbers can be trusted if they
are not statistically significant. In other words, every time these calculations are done, we need to check
if the change seen between the two groups is a real difference or just some values biased by whatever
factor. Therefore, a very significant result is “very true” with some significance level that the company
decision makers agree about with the BI team. The most used significance level is .95 or 95%. that
means that our finding has 95% to be true. Technically, it will show .05 which is 5% chance to be false.
This is what we call the p value.
To do this in R, one should use the Student’s t-test when the variance of the two samples are equal or
the Welch approximation on the other way.
For our case we chose to calculate the significance of three KPIs: Action Per DAU Significance, % of
DAU who did the action significance and Action per unique actor significance.
• ui.r:
we created multiple tabs on a same frame containing the experiment group KPIs, the control group
KPIs, and the significance testing dataframes. The improvement of the UX led us to add some colors
so that the rows are red of the results are insignificant and green if they are significant.

Figure 67: Significance testing table


64
• helpers.r:
the significance calculation in its 3 ways is almost the same. The only changing part of it is the kind of
dataframe we are testing.
o Super t-test:
As explained above, no significance can truly be calculated without checking the variance and the
pairing variable. Two samples are considered as paired if the users and their number is the same. They
are unpaired if it is the other way around. In order to ensure this, we needed to put the t.test and the
var.test in only one big function that will manipulate the variables according to our needs.
§ t.test: a function in r that performs one or two sample t-tests on vector of data.
The most used variables are x and y which are the data values, the paired variable that indicates that
we want a paired test, var.equal which is true if the two vectors have the same variance, and the
conf.level that one use to specify the confidence level.
§ var.test: a function in r that performs an f-test to compare variance of two
samples. The most used variables are are the data values and the conf.level which is the confidence
level you choose.
We put these two functions in one unique function called super.t. It will execute at first a variance test
and according to the p.value result, will put the var.equal option to TRUE or FALSE (p.value < 0.05
FALSE else TRUE) in the t.test. it will also be connected to the ui where the user specify in it is a
cohort of same users or not and put Paired to TRUE (same users) or FALSE (different users).

Figure 68: Super t.test function code lines

65
Figure 69: Same users condition checkbox

o Proportion test:
The proportion test in a statistical test used to see if the proportions in several groups (in our case 2)
are same or have equal values. Like any other test, it contains the H0 hypothesis that means that our
proportions difference is significant and H1 which is the alternative that says that the difference is
insignificant. In R, the proportion test is called using the function prop.test and is used when the cohorts
are not paired, means different.

Equation 1: Proportion test

o McNemar test:
The McNemar z-test is a proportion test used when we have the same cohort. It is a useful test that tells
us if there is a statistical significance between proportion testing.

Equation 2: McNemar test


66
o Action per DAU significance function:
This fist significance testing takes as input the data gathering of the experiment and control group and
the paired value according to what the user specified. First task of this function is to create the data
frame that will contain the results and the different needed columns. These will be applied also to the
different significance tests too:
- Action: column containing the name of action
- Control Value: mean of the control values
- Experiment Value: mean of the experiment values
- P-value: p-value of the significance test
- How is it?: Significant/Insignificant
- Change: is it a decrease or an increase
- By: experiment mean – control mean

Figure 70: Significance test dataframe creation code lines

second part of this function consists on grouping the number of times an action was done by user ids
and counting. For this one need to use “group_by” and “summarize”, two dplyr package functions:

As a result, we will have a data frame looking like this:


user_key count
595ndlkd84740n 10
37494hkmd83n9 0
987qze97fhfn74 7
… …

67
Finally, we will apply on it our super.t function explained above:
sig = super.t(x$count,y$count,Paired)
now that we have our significance, we need to fill our table with understandable data. For this, we need
to calculate the mean and put a condition on where it should be written in the “Change” column. It
would be written increase when
𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 > 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑚𝑒𝑎𝑛
and would be written decrease when
𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 < 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑚𝑒𝑎𝑛
Same thing about the “how it is column, if the p-value is < 0.05 then it is a significant result, otherwise,
it is insignificant.

Figure 71: Action per DAU significance testing code lines

o % of DAU doing the action significance function:


for this second significance testing, the table creation and column would be the same. The only
changing part in it is the type of data frame we will be having. In fact, this significance is more about
the proportion of users doing a specific task. The number of times it was done by the user doesn’t make
a change in our calculation. This would be done by grouping and counting the number of actions per
users, merging the table with our starting table of IDs, changing NAs to 0 and then putting 1 in a new
column called binary whenever the number of actions the user did is different of 0:

68
Figure 72: % of DAU doing the action data manipulation code lines

as a result, we will be having a data frame in that format:


user_key count binary
595ndlkd84740n 10 1
37494hkmd83n9 0 0
987qze97fhfn74 7 1
… … …

Table 23: Binary manipulation dataframe result

Finally, depending of the type of cohorts (if it is the same users or not) we would be using the prop.test
(not paired, different cohorts) or the mcnemar.test (paired,same cohort) explained above :

Figure 73: Paired / Unpaired data condition

o Action per unique actor significance function:


The third significance testing is about only the number of users who participated in doing the specific
actions. It is perfectly similar to the Action per DAU significance explained above. The only detail is

69
that there is no merging with the starting IDs uploaded as we don t need in this test users who didn’t t
do the action. The only data manipulation done is the grouping per user id and counting.

Figure 74: Action Per Unique Actor

o Table Rendering:
One of the most important things one should consider while writing and R web app that will be part of
the BI stack is the optimization. As the significance is used and will be used multiple times, we wanted
to make the table formatting and rendering generic as the users were satisfied by it. By rendering and
formatting, we are talking about the colors of the rows according to the significance for example.
Therefore, we created a function called “TableRendering” that we will apply on the different
significance functions. This function is using a package called DT which is an interface to the javascript
library “DataTables”. The function takes as a parameter the dataframe and uses the command
“datatable” to apply the changes on our dataframes. Using the parameter “extensions”, you can specify
some add-ons like the buttons for pdf and excel that will appear and get linked to your dataframe. The
“formatStyle” command allows to add the colors on the rows and the “formatPercentage” command to
give a limited number of decimals to show.

Figure 75: Table rendering code lines

• server.r:
the server part of this significance testing is also linked to the button of the KPI calculations as we need
all the results to be seen at the same time for instant consuming. On the other side, the different tests
70
showed that these mathematical functions that we have been using were weak in handling errors. Here
is how we managed to solve this problem:
- if the cohort is the same and the dates are the same, so significance testing is run. This can be
done when the product team just need to fetch the KPIS of a single cohort.
- if not, then the significance tests are calculated just after the KPIS.

Figure 76 : server significance testing code lines

2.4. KPIS Visualizations:


The visualization part in this kind of tools is surely very important. In fact, drawn plots will let the user
get a general overview of the behavior of the user. For that, we choose to use plotly, a package created
by the MIT using plotly.js for interactive web graphics.
From a business perspective, we created two type of plots: by Days and by Actions
• ui.r:
In order to reach a user friendly, we used multiple components. Here are the most important:
o actionButton: the action button was used in our case to execute the plotly function. In
fact we tried to make the charts independent from the KPIs calculations for 1 main reason: speed.
o Box: boxes were used to contain our charts and show the different families. We created
a box for the days’ charts and another one for the actions charts. Both of them collapsible, the user can
take his time to check the results on a full screen.

71
Figure 77: Plotting per day box

Figure 78: Plotting per actions box

Figure 79: Plotting per day box code lines

o plotlyOutput: this component will be handling the result of the plotly charts. By writing
the name of the plot used in the server side, it will make it apprear in the brower.
• server.r:
In the server side, for this part of the project, we will be essentially consuming the variables in our
global environment.
The plotly package affords a function called “renderPlotly” that will be responsible sending the plotly
result to the chart names in the ui side of the app. Here is an example with the multiple parameters in
the plotly function:
72
Figure 80: Plots rendering code lines

- data: the dataframe containing the columns you need


- x = the data of the x axis
- y = the data of the y axis
- type = the type of graph (bar,scatter,pie,…)
- showlegend= show the legend or not (TRUE/FALSE)
- text= the text shown when hovering on a point of the plot
- name=used when we will have multiple charts in one graph; name of the graph
when there is a need to add a second chart in the same graph, plolty has a function called “add_trace”
that can be used by just adding the pipe operator in R “%>%”:

Figure 81: Adding trace in a plot code lines

Playing around with these multiples different possibilities leads you t have different charts:

Figure 82: Daily active users plot example


73
Figure 83: percentage of users doing an action plot example

by adding a z parameter, it is also possible to draw some 3D plots for a better understanding and other
insights:

Figure 84: 3d Plotting example

2.5. Retention:
As new users with no idea about the product, they are the one that can give you the best feedback about
a new design or feature. Therefore, we use the retention to see how stick they become to the app.
• ui.r:
Designed to be another tab in the web app, the UX of the retention is essentially a data frame containing
the different cohorts and the different days plus an average of the retention and how it is evolving. We

74
also added two radio buttons to give the user the possibility to choose between the retention of new or
existing ids.

Figure 85: Retention table

Figure 86: Retention average chart

• helpers.r:
o New users:
In order to get the new users retention, we created a function containing the start date, end date and
data of the experiment fetched previously. Therefore, a filtering to get the event of the user creation is
made and a dataframe is created containing as rows the cohorts and as columns the days. Every cohort
represents the number of users that registered per day. The days columns represents the different days
of the experiment.

75
Figure 87: Creatin dataframe for retention code lines

After getting all this ready, we are using a for loop to create a temporary variable containing ids of
users registered x day and another loop inside that one to check how much of them were present every
day.

Figure 88: Calculating the retention code lines

the rest of the function in mainly based on changing all the NAs to 0 and using a function called assign()
that will allow us to push multiple variables in the global environment from a function.

Figure 89: Manipulating the table rendering code lines

76
Finally, the weighted average of every retention day will be calculated for future plotting.

Figure 90: Weighted average calculation code lines

the table contains a column called day that specifies all the days of the experiment and another column
called value that sums all the percentages of a specific column in the retention of the cohorts table and
divide it by the rows that have percentage ≠ 0%.
o Existing users:
While trying to do the same thing for the existing users, we found out a big obstacle. If we keep on
using the same function for the new users, the algorithm will be counting the same users multiple times
because we don’t have the registration event that divides them. As a remedy we had to add a 4th
parameter to the function which is the user IDs. Like this, we just add a variable inside the function
that will filter with every loop the users that were present in our calculation, take them out from our
variable and go through the remaining users again:

Figure 91: Retention calculation for existing users code lines

77
As it can be seen here, we assigned the IDs of the experiment to a variable called “remaining.users”.
the number of ids in it will be decreasing after every loop by keeping the ids who are “%not in%” the
users we have been using.
• server.r:
As we are endlessly running behind the speed and time with our web app, we made the retention
functions only query the data from the datawarehouse when it doesn t exist in our global environment.
This way, after getting the KPIs, the product team can directly get the retention of the users in seconds.
They can also only open the app to see the retention of the users. In both cases, it is for sure a time-
gaining condition:

Figure 92: Fetching data condition

Second condition is about which function to trigger. If the radio button chosen is the new users one,
then the new users retention will be calculated. In the other case, it would be the existing users’
retention. .

Figure 93: Retention data condition in the server file

Last part of it consists on rendering the datatable. To make it readable for the user, we used an extension
of DT called “fixedColumns”. This extension helps us to make a number of columns fixed when they

78
contain names or values representing the other columns. this is indeed important especially that the
retention calculation can be on a high number of days, a number that a normal laptop screen can’t
handle.

Figure 94: Retention dataframe rendering code lines

2.6. User Analytics:


The BI team created a clustering of the users and saved the content using the s3 service of Amazon in
previous analysis. We tried to use it by filtering the users of the experiments from those in the clusters
to know to which cluster they belong and how an experiment can influence their movement from one
to another.
• ui.r:
to make the clustering understandable, we created in this part of the projects multiples small boxes:
- Upload IDs Clustering Info: containing the means of all the actions our uploaded users have
been doing
- Number of users per cluster: how many users do we have in every cluster
- Clusters per users: Line chart showing the flow of a specific ID in the clusters in the last 4
weeks
- Activity level diagram: Line chart showing how are the features means have been changing in
the last 4 weeks for all the different clusters.
For the last box, we tried to include some radio buttons linked to our features in a way that whenever
one is checked, it is dynamically changing.

79
Figure 95: Radio buttons for charts dynamic changing

• server.r:
While working on this part, the BI Manager prepared a function in the helpers that would fetch the
clustering results from S3 bucket and filter IDs upload by the user so that we only get those involved
in the experimentation. After getting a general overview of it and checking it closely, he is the output
of the function:

Figure 96: Dataframe from S3

As you can clearly see, we have 20 columns which are the features used for the clustering, the ids of
the users, the clusters they belong to, and the week and year.
The clustering, in his 2 different versions (3 clusters and 9 clusters) is explained using all the different
components shown in the ui file.
For the dataframe containing the summary of the different features leading to the clustering, we used
the function isoweek that gives the number of the week in the year. This function takes as a parameter
a date. By writing isoweek(Sys.date)-1 we are filtering this dataframe that contains the data of the 4
last week to get the closest information to our experiment.

80
Figure 97: Clustering datatable rendering

this gives as a concrete table containing the means of the different interactions that the users do and
receive in the last week plus to which cluster they do belong.
Second important chart is the number of users in the clusters. For us, the more users we have going to
the active cluster (cluster 3). Therefore, this chart was created to see the general flowing of the users
in the last 4 weeks. Having a great number of them going from 1 to 2 or from 2 to 3 means that our
experiment was efficient. The plot was drawn by taking x as the week number, y as the number of
users and the color of the lines in the plot will be changing according to the cindex column, containing
the index of the cluster.

Figure 98: rendering clusters chart code lines

Figure 99: Clusters line chart


81
Third chart gives the user path inside the clusters. To create a dropdown menu component that would
be filled from the users uploaded ids, a new component called uiOutput() is used. This component
allows to add any ui in the server side of the app to build a dynamic ui reacting to the input.

Figure 100: Drop down menu code lines

last part of this coming chart is to make it dynamic. For that, the Shiny package provides us with a
function called reactive. This function will create a dynamic dataframe that will be changing according
the user input and therefore changing the chart with it. As a result, whenever the product employee in
entering the ID od a user, the charting is changing and showing his path.

Figure 101: Dynamic user – cluster plot code lines

82
Figure 102: User inside cluster flow chart

this same function was used for the two last charts as they also should be changing according to the
user input. On the other side, we are no more using the id this time but the the feature we want to see
the path. This will give us a view of how our clusters in general are changing and answering questions
like: How was the upvoting in cluster 1 two weeks ago? How is the downvoting mean this week? etc.
the charting is changing according to the radio buttons called actionTypeuser.

Figure 103: Radio buttons code lines

83
Figure 104: Plot and radio buttons fusion

3. Machine learning project:


3.1. Data Selection:
Our data selection is divided in 4 main phases. We will go through the first step of finding the users
that were retained, their sent engagement, received engagement and the environment they grew in
(what is around them without being directly linked to them)
3.1.1. Retained users:
After different discussions between the BI Team and the product department, we agreed on choosing
the registrations of all the users all around the world. Although their behavior could be different
(Saudi Arabia users do half of the total activity), we wanted to create a model that would fit all the
different cultures as a first iteration. Moreover, we selected all these registrations in a timeframe of a
week. In fact, the way the users would perceive the app would change from a week day to another.
This week is from the 15 to 21 of May, a normal one that was not influenced by a specific action of
event. This is clearly diminishing the chance of having biased data.
As a summary, we will have 98 countries in the week of 15-21 May 2017.

84
Figure 105: Registrations pie chart per country

Figure 106: Registrations dataframe

Second step of this data selection is to find how many of these users were still with us after a specific
time frame. From a business perspective, the chosen timeframe was 4 weeks later (almost one month).
In fact, we saw that in most of the cases, this is when the users stop churning on a high rate. Therefore,
the date in which we would be checking if our users were still stick to the app or not is from 12 to 18
of June 2017. Any unique appearance of an ID in this timeframe would be considered as a retained
one.

85
𝐼𝐷𝑠 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑒𝑑 4 𝑤𝑒𝑒𝑘𝑠 𝑙𝑎𝑡𝑒𝑟
𝑅𝑒𝑡𝑎𝑖𝑛𝑒𝑑 𝑢𝑠𝑒𝑟𝑠 =
𝐼𝐷𝑠 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑒𝑑

Equation 3: Retention calculation

3.1.2. Sent engagement:


In Jodel, we believe that what is the user doing in his first 72 hours is highly influencing the rest of his
lifetime inside the app. This has been proven multiple times in previous analysis. As a result, we choose
to fetch all the different interactions that the user does in a conscious way:
- action.app.LoudestFeedSelected
- action.app.MostCommentedFeedSelected
- action.app.SetHomeCompleted
- action.getposts.details_new
- action.getposts.hashtag_combo
- action.oj-downvote
- action.oj-upvote
- action.post.block
- action.post.create
- action.post.flag
- action.post.pin
- action.post.reply
- action.post.share
- action.post.votedown
- action.post.voteup
- action.thanks_given
Second thing about the engagement data is the use of the result of the ids of the content that got flagged,
voted up or voted down to fetch again the creation of this content to differentiate between a post and
a reply.

Figure 107: Getting thecreation and reply data sql function

ps: “contentvec” is a vector containing the IDs of the content that have been voted up, voted down or
flagged.
86
3.1.3. Received engagement:
the received engagement for the users is not very different from the sent engagement. The only
changing parameter for our data fetching is a first part consisting on getting all the content that the
users posted (posts and replies, text and image) and then using the content ids to get the actions that
contains these ids.
3.1.4. Environment:
Last part of our data gathering is the environment data. Every user has a proper feed that he sees
according to his position. The activity he will see in the feed (negativity, positivity) could really
influence his stickiness. As a result, we chose to get the data that will be a proxy to reflect the general
engagement or health of the environment

Figure 108: Environment data matrix

In order to do this, our SQL query for fetching the data should only be focused on the geohash based
location. We need to get the geohashes where the registrations happened and get the data that help us
to get these values:

Figure 109: Getting the data by geohash code lines


87
ps: vec is a vector containing the geohashes of registrations.
On the otherside, the user always sees what is happening in 1 geohash + 26 neighbours. For that
purpose, we created a function called getJodelArea from an R package called “geohash” .
This function will get as an input 1 geohash and return automatically the 26 others.
The functions provided by the packages allowing us to get this result were:
- gh_neighbours: takes the geohash as an input and gives you the north, northeast, east, southeast,
south southwest, west and northwest neighbours
- west: takes the geohash as an input and gives you the west neighbor
- east: takes the geohash as an input and gives you the east neighbour
- north: takes the geohash as an input and gives you the north neighbour
- south: takes the geohash as an input and gives you the south neighbor
By combining all these functions, we end up having a dataframe containing the geohash and his 26
neighbours for the query.

Figure 110: Getting the geohash neighbours code lines

3.2. Data Cleaning:


3.2.1. Sent Engagement:
While trying to distinguish between the actions that the user sent to other users by being a post or a
reply, we found out that some of them were not in the timeframe that we wanted to study. Therefore,
we agreed on only keeping the one that we found after launching our queries.
For that, we filtered the the table containing the interactions and the content IDs content with the one
resulted of our query to get what type of content the users were interacting with content_type. To make
it faster, and knowing that the table was big, we used filter(), a function equivalent to subset from dplyr
package.

88
Figure 111: Filtering the dataframe

this way, we have a table clean of NAs.


Another important step to do about the data is changing the date type. In fact, what is interesting for
us here is to see what the users did 3 days after registration. Getting the exact minute or hour will
just make the processing of the data take more time. Using the as.Date() casting Function in R, we
transformed our date columns from YYYY-MM-DD HH:mm:ss.SSSSS to YYYY-MM-DD.
Finally, the only kept columns for future manipulation is the interaction_key column (containing the
name of the interaction), the date_key column (containing the date and time of the interaction), and
the user_key column (containing the user ID).
3.2.2. Received Engagement:
In order to get the received replies on the posts, and as explained above, we have to use the metadata
column. As a new experiment, adding the parentId for every reply (containing the id of the post) was
added on the fly by the backed team. Therefore, before doing any manipulation, we need to get all the
replies of this specific timeframe and end up having a dataframe with a column like this:
metadata
parentId:579dhfkg93750f
parentId:579dhf793bd034
parentId:04fn078g93750f

Table 24: metadata column data

This column should be used in the future to merge this table with the table of our posts created to get
only the replies that were the result of the interaction of other users with ours. As a result, the only way
to be able to merge is to clean this column by keeping only the ID withouth the “parentId” string using
this command:
replies$metadata<-sub(“parentId:”, “”, replies$metadata)
Finally, same columns should be kept for future manipulations which are the interaction_key, user_key
and content_key columns.

89
3.3. Data Construction:
3.3.1. Sent engagement:
the construction part of the sent engagement is focused on how to get, from all the data that we fetched
a table containing the IDs f the users and the number of times they did the specific actions we chose.
But before doing this, and as explained in previous chapters, we need to do a differentiation between
the actions they did it on replies and the actions they did on posts. Therefore, after getting the difference
between them, we end up having a merged table in this format:
content_key interaction_key.x interaction_key.y
579dhfkg93750f action.post.voteup action.post.create
457dbdfnlso379 action.post.flag action.post.reply
… … …

Table 25: Merging results

We will use setDT() a function of the package data.table and paste() to add “_reply” or “_post” after
the interaction.x depending on what we have in interaction_key.y :

Figure 112: Manipulating action names code lines

here, new engagement is the table described above. After all these modifications, we merge it to the
other dataframe containing the other interactions.
This done, we use a loop that will go through the days and get the users who registered. After that the
loop go through the dataframe containing all the actions and filter is by the IDs of these users with a
condition that the rows in our new dataframe should have a date of registration +2.
As a final step, we end up with a temporary dataframe othe the used IDs and the interactions. To have
it according to our needs, we use table() , a function that will return a contingency table with as columns
the interaction_key column, and as rows the user_key column. All this is transformed using
as.data.frame.matrix() and rownames_to_columns() to create this result :
user_key action.post.block action.post.reply action.post.voteup_post …
48340fhdfsdosf 1 40 5 …
3489dfhsofsdfn 0 19 8 …
90
68ndnfe870snb 7 11 0 …

Table 26: Matrix to dataframe result

Figure 113: matrix to dataframe code lines

3.3.2. Received Engagement:


One of biggest differences between the received and sent actions is that when the user goes online even
to do nothing, it will be counted as an action. As a result, we will luckily never have users with 0 sent
actions. On the other side, one can receive an interaction from others without being online. That’s why
we need, while doing the table part of the received actions, always add the users that have 0
engagement. They are also important for our model.
In order to to this, after getting the received actions on posts, and the received actions and replies, we
need to use the merge function with the dataframe containing the users with the parameter all=TRUE.
Our final dataframe will have a lot of NA that we will change to 0 using is.na():

Figure 114: Cleaning and merging the received engagement dataframe

Finally, to merge the received actions on posts and on replies, and as some columns could have same
names (action.post.voteup on posts and action.post.voteup on replies) , we will add “received” at the
start of the column name and “post” or “reply” in the end according the dataframe we are taking the
column from :

Figure 115: Merging the received on posts and on replies rows

91
3.3.3. Environment:
For the Environment we already created a table containing all the geohashes in different days from the
registration table. In fact, we can have some redundant geohash names but with different days as a
geohash activity can change fron a day to another. After getting the activity of these locations, the only
building part is to create columns ready to contain the data and use the filtering, length() or nrow to
calculate our KPIS:
- DAU: Daily Active users in the geohash.
- posts_day: posts in that day in the geohash.
- posts_dau: postst per DAU in the geohash.
- reply_dau: replies per DAU in the geohash,
- reply_post: replies / posts in the geohash.
- upvote_post: upvotes / posts in the geohash.
- upvote_dau: upvotes per DAU in the geohash.
- upvote_reply: upvotes / replies in the geohash
- upvote_downvote : upvotes / downvotes in the geohash (happy ratio)
- downvote_dau: downvotes per DAU in the geohash.
- downvote_post: downvotes / posts in the geohash
- downvote_reply: downvotes / replies in the geohash
- netvotes_dau: (upvotes – downvotes) per DAU in the geohash
- netvotes_post: (upvotes – downvotes) / posts in the geohash
- netvotes_reply: (upvotes – downvotes) / replies in the geohash
- flag_dau: flags per DAU in the geohash
- flag_post: flags / posts in the geohash
- flag_reply: flags / replies in the geohash
- block_dau: blocks per DAU in the geohash
- block_post: blocks / posts in the geohash
- block_reply: blocks / replies in the geohash
3.4. Data Integration:
3.4.1. Merging:
Last part of our data preparation in integrating the 4 big dataframe that were created after all our
manipulations. In fact, we now have a dataframe about the sent actions by the user, the received action
by the user, the environment in which he registered and if he is retained or not.

92
For that, we just used merge() for the sent and received actions by user_key. After that, we merged
the environment and the user with the geohash and used a table containing the ids od the users who
were retained to add a column containing 1 if retained and 0 if not. These will be our modalities.

Figure 116: Creating the modalities for the prediction

3.4.2. Normalization:
One the pillars of running a good model is the normalization. In fact, if the values of the data and the
different columns are very far and different from each other, the training can really take ages. This is
what we call feature scaling:

Figure 117: Difference between normalized and non-normalized training

Although R gives multiples ways of scaling, we chose to write one small function that do it quite well

Figure 118: Normalization function

we, then use lapply() to apply it on the whole columns of the data frame:

Figure 119: Application of the normalization function

93
Chapter 4: Modeling

After gathering all the needed data, cleaning it, constructing it, last part of this project is to create the
model out of it. This chapter could be one of the smallest one (for all the data scientists) as the most
important step is to prepare everything for this goal which is about choosing the algorithms,
understanding them and applying them with multiple iterations.

1. Modeling technique:
The big community of data analysts and data scientists has provided us with a high amount of
algorithms, tools and methods we can use for modeling and getting the answers out of the data.
In our case, our main modeling technique is SVM (Support Vector Machine):
By definition, it is a discriminative classifier that outputs an optimal hyper plane that will categorize
specific data. The hyper plane is created from a training set. It can be also used for regression (it
depends on the decisive variable; continuous/discrete).

Figure 120: Linear hyperplane of SVM

SVM can be linear or non linear. In our case, as we have multiple features, our data in not linearly
separable, therefore, it will be a non-linear Radial SVM. In fact, by using what we call a Kernel, the
computing will be much easier for the algorithm especially for this high amount of features (58).

94
Equation 4: Radial SVM Equation

After doing some researches, we assume that the best package in R is e1017. Being the interface to
LIBSVM, the fact that his ancestor is written in C++ makes it as intuitive as possible for the best
results.
Our choice of SVM was based on the fact that it works well with high dimensional spaces, and is a
memory efficient.
we will also, for academic purpose, user the Random Forest algorithm and compare both of the
results. A lot of people consider it as “bootstrapping algorithm with decision tree model”2.
The explanation if this is that it will execute multiple CART models and create multiple trees for every
feature that will create this “Forest”. At the end, it will calculate the importance of each one and gives
it a final weight and importance.
The package used for it in R is randomForest and is implementing Breiman's random forest algorithm,
considered as the best by the R users community3.

Figure 121: Random forest logic

2
https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/
3
https://www.r-bloggers.com/what-are-the-best-machine-learning-packages-in-r/
95
both of these two algorithms are part of the supervised learning family. We chose to use it as we have
a variable to predict (is the user going to be retained or not) and know which is the most influencing
features,

Figure 122: Supervised learning steps

2. Testing Design:
After preparing our model, we will find ourselves with an important question to answer: is this a good
model? If yes, how can we know it?
For that purpose, we have to choose and think about tools that would give us these answers and to
know when to stop our modeling. In the other case, we will be continuously trying multiple
combinations of parameters without knowing which are the best ones.
2.1. ROC Curve:
In statistics, the Receiver Operating Characteristic curve is a graphic plot that allows you to compare
the diagnostic performance of a 2-modalities classifier (binary). It is efficient even in the case of
unbalanced distributions.
This curve will be used in our case to plot the true positive rate against the false positive rate with
different settings and with the two different algorithms.

96
Figure 123: ROC Curve example

2.2. Confusion Matrix:


Like the ROC curve, the confusion matrix is a table that gives you an overview of the different results
of your prediction and compare it to your starting results. It helps you to know how your model is
performing and gives you a probability of efficiency in trust by doing some calculations on it.
Considering that our modalities are Yes and NO, these are the different types of values the confusion
matrix could contain:
- True positives (TP): when the prediction is Yes and the real value is Yes.
- True negatives (TN): when the prediction is No and the real value is No.
- False positive (FP): when the prediction is Yes but the real value is No (called Type I error).
- False negative (FN): when the prediction is No but the real value is Yes (called Type II error).

Figure 124: Confusion Matrix Example

Using all these numbers, here are the value that you can get out of it:
- True Positive Rate: this rate tells how much our model predicts YES when it is actually YES.
𝑇𝑃
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝑅𝑒𝑎𝑙 𝑌𝐸𝑆
97
Equation 5:True positive rate

- False Positive Rate: This rate tells how much our model predicts YES when it is actually NO.
𝐹𝑃
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝑅𝑒𝑎𝑙 𝑁𝑂

Equation 6:False positive rate

- Accuracy: This rate tells how often our model is correct.


(𝑇𝑃 + 𝑇𝑁)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙

Equation 7:Accuracy

- Misclassification Rate: this rate tells how often our model is incorrect
(𝐹𝑃 + 𝐹𝑁)
𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒 =
𝑇𝑜𝑡𝑎𝑙

Equation 8:Misclassification Rate

- Specificity: this rate tells how often our model says NO when it is actually NO
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑅𝑒𝑎𝑙 𝑁𝑂

Equation 9:Specificity

- Precision: this rate tells us how much our prediction of YES was correct.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑌𝐸𝑆

Equation 10:Precision

2.3. Dataset Division:


The division of dataset is the way that will help us test our model. In fact, and like every supervised
learning algorithm, having 2 different datasets will allow us to create our model from the training set,
verify it of the testing set for future use. But before doing it, one needs to check the distribution of the
modalities in our data to have samples that will be significant:

Figure 125: modality distribution

98
the “optimal_user_df” being our dataset and “retained” our Y, we have 54% of 0 (not retained) and
45% of 1 (retained). Therefore, our samples should have the same distribution to give us significant
results and model.
Although it is possible to do it manually, there are nowadays multiple functions that allows this
division:

Figure 126: Training and testing set division

the code above using the sample function takes n as the total number of observations and takes 60% of
it a variable. next command will extract take 60% for train set and 40% for test set.
If we check the distribution of our modalities after this division, here is what we get:

Figure 127: Training and Testing dataset modality distribution

After checking that they were almost equal to distribution of the dataset, we can proceed.
We also need to take out the columns that will not be the features like the Y or the user key. For that,
we created the formula using paste() and as.formula() by excluding the columns and collapsing the
others with “+” :

Figure 128: Equation Creation

99
We also exclude the received sharing r the received reading as the user doesn’t get a notification when
they happen. Therefore, they will not influence his behavior.
the final variable which is rf.form will be the one used as a formula for the modeling as it will be in
this format “ retained ~ action.oj_downvote + action.oj_upvote + action.post.block…”

3. Model building:
3.1. Parameter settings:
Due to the evolution of the complexity of the business questions that the companies ask and the
diversity of the data, it is becoming harder and harder to find the perfect ready algorithm for them. This
is the reason why most of the tools gives the opportunity to every data scientist to tune his algorithm
and make it fit the dataset.
Here is an overview of the one we used in our case:
3.1.1. Cost Parameter:
The cost parameter called C is what tells you how much you want to avoid misclassification for every
observation in the training set you have. The higher it is, the smaller will be the margin of the
hyperplane. This is very important when your data is non-linear to avoid false prediction. On the other
side, the smaller it is, the bigger would be the margin of the hyperplane.

Figure 129: Soft Margin Examples

In our project, we took 3 different values of C to see how much it is influencing our result:
- 10
- 0.5
- 2lm

100
3.1.2. Kernel Parameter:
Even if we explained above already our choice which is the Gaussian Kernel, it really important to
show all the difference kernels R gives us access too and their difference:
- Linear Kernel: being the fastest one, in only performs good when your data is, as the name
explicitly says it. Linear kernel is mostly used when the number of features is larger than the number
of observations.

Equation 11:Linear Kernel

- Polynomial Kernel: most performing one for the NLP (d=2), not only it looks to features,
but also checks the different combinations between them. It s also commonly used for the regression
analysis.

Equation 12:Polynomial Kernel

- Gaussian Kernel (Radial basis): being the most used used one among the kernels, it used
when the number of observations is higher than the number of features (most of the time the case). It
is called universal kernel 4as it guarantees a predictor that the estimation and approximation errors.

Equation 13: Gaussian Kernel equation

3.1.3. Gamma:
Always going side to side with the cost in the non-linear kernels, gamma tells you how much a single
training example can influence the others. The lower it is, the farthest is influences. Of course, the
higher it is, the closest it can influence. Making it too big can make the algorithm perform in a bad way
and C will not be able to regularize the over-fitting.
By making it too small, the model will not be able to capture the complexity or what we call the “shape”
of the data. In our case, the choice of our gamma is 1 (the default) and this is due to the fact that our
data is vey diversified. Fixing it to a middle value gives us the possibility to play with the cost.

4
https://www.quora.com/What-is-the-intuition-behind-Gaussian-kernel-in-SVM-How-can-I-visualize-the-
transformation-function-ϕ-that-corresponds-to-the-Gaussian-kernel-Why-is-the-Gaussian-kernel-popular
101
3.1.4. Number of trees:
The number of trees in the Random Forest algorithm tells how many trees the algorithm should grow.
If the number of observations is large and the number of trees is too small, some of them will be
predicted only once if not at all. In our case, we started with a high number of trees and then checked
whenever the accuracy becomes stable and then choose the perfect amount of trees.
Therefore, the number of trees chosen for this project is: 500
3.2. Model Training:
After going through all the understanding f these two algorithms and their tuning, we show them
practically in order to see the difference between both of them.
We will be using the e1017 package for the SVM and randomForest Package for the random forest
algorithm.
3.2.1. SVM:
Thanks to the participation and the contribution of the data science community, implementing the
algorithm has become more and more easy as experts created an interface to LIBSVM with the possible
parameters to tune. Here are the different iterations for svm and the different results from it:
o Iteration 1 – Cost = 10:

It is important to mention that there were no need to choose the “type” parameter in the svm() function
because the Y is a factor. As a result, the type would be automatically “C-classification”. The TRUE
value of probability gives is the weights of the features.
The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The Kernel used is the Gaussian Kernel (Radius Based).
The cost parameter is 10.
The gamma parameter is 0.01785714.
After training the model for some minutes, we need to apply it on the test set and see how good it is
performing:

Here are the top 10 features that are influencing the prediction and their weights:

102
Feature Weight
action.getposts.details_new 855.8991
received_action.post.votedown_replies 552.8851
received_action.post.reply_post 520.9039
received_action.post.voteup_post 448.3295
received_action.post.voteup_replies 426.9237
action.post.reply 426.2344
action.post.create 355.9180
action.post.voteup_post 304.3408
action.oj_upvote 301.5199
action.post.pin 294.1277

Table 27: SVM Iteration 1 Features Weights

o Iteration 2 – Cost = 0.5:

The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The Kernel used is the Gaussian Kernel (Radius Based).
The cost parameter is 0.5.
The gamma parameter is 0.01785714.
Feature Weight
action.getposts.details_new 421.7126
received_action.post.votedown_replies 338.4224
received_action.post.reply_post 312.3753
received_action.post.voteup_post 304.1694
action.post.reply 302.7494
received_action.post.voteup_replies 292.9179
action.post.create 260.4468
received_action.post.votedown_post 231.5631
action.post.votedown_post 219.2826
action.oj_upvote 215.4727

Table 28: SVM Iteration 2 Features Weights


103
o Iteration 3 – Cost = 2lm :
The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The Kernel used is the Gaussian Kernel (Radius Based).
The cost parameter is 2lm .
The gamma parameter is 0.01785714.
Here are the top 10 features that are influencing the predictions and their weights:
Feature Weight
action.getposts.details_new 83.4546
action.post.reply 58.87
received_action.post.votedown_replies 54.49
received_action.post.voteup_replies 54.0964
received_action.post.voteup_post 53.2664
received_action.post.reply_post 51.7330
action.post.create 49.6154
received_action.post.votedown_post 45.131
action.post.votedown_post 41.253
action.post.votedown_reply 37.663

Table 29: SVM Iteration 3 Features Weights

o Iteration 4 – Cost = 10 / TOP 20 Features:


The total number of observations that were trained is 29126 with 20 features.
The total number of observations that were tested is 19418with 20 features.
The Kernel used is the Gaussian Kernel (Radius Based).
The cost parameter is 10.
The gamma parameter is 0.05.
Here are the top 10 features that are influencing the predictions and their weights:
Feature Weight
action.getposts.details_new 550.1805
received_action.post.votedown_replies 357.85261
env_upvotes_downvotes 308.8288
received_action.post.reply_post 293.8561

104
action.post.reply 284.43
received_action.post.voteup_post 280.445
received_action.post.voteup_replies 264.7267
env_blocks_reply 254.0478
action.post.pin 218.9994
env_flags_reply 207.8277

Table 30: SVM – TOP 20 Features Weights

3.2.2. Random Forest:


For the modeling using random forest algorithm, we will be using the package randomForest of R,
originally written in Fortran created and proposed by Breiman
o Iteration 1 – ntrees = 500

The total number of observations that were trained is 29126 with 56 features.
The total number of observations that were tested is 19418with 56 features.
The number of trees is ntress=500.
The choice of importance = TRUE will gives us the weights of the features.
Here are the TOP 10 features influencing the predictions and their weights:
Feature Weight
action.getposts.details_new 1216.96093
env_reply_post 381.61395
action.post.voteup_post 381.24712
env_downvotes_post 378.22096
env_post_dau 369.92434
env_downvotes_reply 368.02946
env_downvotes_dau 362.92469
env_reply_dau 360.19532
env_upvotes_downvotes 359.67845
env_upvotes_post 352.43667

105
Table 31: Random Forest Iteration 1 Features Weights

The choice behind the 500 trees is the need to see when the error rate stops diminishing. This would
be our ideal number of trees:

Figure 130: Random Forest Iteration 1 Error Curve

As the plot shows it, the error rate start becoming constant after the trees 200. This would be the
number of the trees of the model.
o Iteration 2 – ntrees = 500 / TOP 20 Features:

The total number of observations that were trained is 29126 with 20 features.
The total number of observations that were tested is 19418with 20 features.
The number of trees is ntress=500.
The choice of importance = TRUE will gives us the weights of the features.
Here are the TOP 10 features influencing the predictions and their weights:
Feature Weight
action.getposts.details_new 1945.72978
env_upvotes_downvotes 1408.50373
env_blocks_reply 1340.74778

106
env_flags_reply 1292.91496
action.post.voteup_post 707.68980
action.post.reply 660.10736
received_action.post.voteup_replies 579.57985
action.post.votedown_post 569.60150
received_action.post.voteup_post 512.70467
action.post.voteup_reply 434.27354

Table 32: Random Forest – TOP 20 Features Weights

The choice behind the 500 trees is the need to see when the error rate stops diminishing. This would
be our ideal number of trees:

Figure 131: Random Forest – TOP 20 Error Curve

As the plot is showing again, the error rate starts diminishing when reaching the trees 200. Therefore,
we can consider it as a good number of trees.

107
Chapter 5: Evaluation

The modeling being done in different iterations, time to see how were the different results and what is the
output of our analysis. For that, we will use the testing tools we talked about before.

1. Results Assessment:
1.1. SVM:
1.1.1. Iteration 1:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8626 2023
Actual 1 4724 4045

Table 33: SVM Iteration 1 Confusion Matrix

• Rates:
o True Positive Rate: 46%
o False Positive Rate: 18%
o Accuracy: 65%
o Misclassification Rate: 34%
1.1.2. Iteration 2:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8764 1885
Actual 1 4905 3864

Table 34: SVM Iteration 2 Confusion Matrix

• Rates:
o True Positive Rate: 44%
o False Positive Rate: 17%
o Accuracy: 65%
o Misclassification Rate: 34%

108
1.1.3. Iteration 3:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8772 1877
Actual 1 5297 3472

Table 35: SVM Iteration 3 Confusion Matrix

• Rates:
o True Positive Rate: 39%
o False Positive Rate: 17%
o Accuracy: 63%
o Misclassification Rate: 36%
1.2. SVM – TOP 20:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 8592 2057
Actual 1 4716 4053

Table 36: SVM TOP – 20 Confusion Matrix

• Rates:
o True Positive Rate: 46%
o False Positive Rate: 19%
o Accuracy: 65%
o Misclassification Rate: 34%
1.3. Random Forest:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 7576 3037
Actual 1 3672 5097

Table 37: Random Forest Iteration 1 Confusion Matrix

• Rates:

109
o True Positive Rate: 58%
o False Positive Rate: 28%
o Accuracy: 65%
o Misclassification Rate: 34%
1.4. Random Forest – TOP 20:
• Confusion Matrix:
Predicted 0 Predicted 1
Actual 0 7743 3835
Actual 1 2837 5003

Table 38: Random Forest – TOP 20 Confusion Matrix

• Rates:
o True Positive Rate: 56%
o False Positive Rate: 26%
o Accuracy: 65%
o Misclassification Rate: 34%
1.5. SVM VS Random Forest:
1.5.1. All features:
• Rates Comparison:
the accuracy is the main technical and business goal. It will have an important weight in choosing the
model.
𝑆𝑉𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓% = 𝑅𝐹 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓%

110
• ROC Curve:

Figure 132: SVM VS Random Forest ROC Curve (All features)

The ROC Curve above shows that our SVM curve is closer to the left-hand border of the chart (more
accurate). On the other side, the area under the curve for the Random Forest is bigger and this also
describes the accuracy. What we can say is that the two algorithms have traded false positives to true
positives and true positives to false positives.
1.5.2. TOP 20:
• Rates Comparison:
Trying the TOP 20 features has for goal to see how the algorithms will manage is the third of the
features. Who will perform better:
𝑆𝑉𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓% = 𝑅𝐹 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 𝟔𝟓%
As a result, we can clearly see that there were no differences between the SVM model and the Random
Forest model. The only difference that we may talk about here is the tuning possibilities that are way
higher in the SVM Case.

111
• ROC Curve:

Figure 133: SVM VS Random Forest ROC Curve (TOP 20 Features)

The second ROC Curve shows that here too, both traces have pros and cons. While the svm plot is
closer to the left side of the of the chart, the random forest one has more space.

2. Approved Models:
As a result of what we are seeing here, the remaining part is to choose. For sure, the second and third
iterations of the SVM model are to deny completely especially when we see that most of our rates are
decreasing as we are decreasing the cost. Iteration 2 and 3 are are working bad compared to the first
iteration or the Random Forest model.
But when it is about comparing the SVM and the Random forest, for the TOP 20 or all the features,
choosing one of them automatically means choosing a good model to start. For us, and from a business
perspective, we opted for the SVM model and this for these points:
- 65% of accuracy (same as Random Forest)
- Less False positives then the random forest: It is very crucial for us not to have false positive.
As we said before, spending time working on some users that the algorithms showed as retained while
they are not cam make us lose a high amout of time and resources.

112
3. Next Steps:
After getting all these results, one of the first steps for us would be to get the average amount of
activity related to these features that have a high weight and try to reproduce it for most of our new
and existing users in order to get the value out of the analysis. Second thing to to do is to find a proxy
to the get_post_details interactions. In fact, this event in the database means scrolling, refreshing,
clicking, etc. after having the exact number of get_post_details the user need to do to get retained, we
need to know what is it exactly from the different possibilities. Last thing about the analysis is that
for now, we only took 72h action and 4 weeks of data. Taking a further away retention can help us
get a more solid model and a surer prediction of the behavior of the users.

113
Chapter 6: Deployment

Whether the model is good or bad, the process automation working fine or not or the datawarehouse
designed in a good way, the last chapter of this report is about where and how these parts of the
projects where deployed for the use of the company.

1. Process Automation project:


1.1. Deployment environment:
The process automation was applied using the Shiny package of R Studio by creating a web app.
Nevertheless, and to make everyone take full advantage of it, we had to deploy it and not just keep it
on the local machine. One solution was to use the free servers of R Studio, but these allows us only to
have 23 hours per month, which is nothing compared to the availability we need and the high amount
of experiments we want to achieve. Second, an eventual paying solution with the same company will
cost almost 1000 euros/user.
As we have only 5 to 6 people who will use it, we deployed it on an amazon instance in 6 different
naming so that every user would have his own “session” (the session management in only possible
with the paid R servers).
Using a m4.xlarge EC2 instance from AWS (Amazon Web Services), we were able to have 4 CPUs
and 16 GB of ram to handle the activity of all these users all together. After that, we had finally to fix
the port and create multiple link for every user plus one for the development:
- http://bi.int.jodel.com:XXXX/BIAnalysis/ : Development
- http://bi.int.jodel.com:XXXX/BIAnalysis1/ : Product Employee 1
- http://bi.int.jodel.com:XXXX/BIAnalysis2/ : Product Employee 2
- http://bi.int.jodel.com:XXXX/BIAnalysis3/ : Product Employee 3
- http://bi.int.jodel.com:XXXX/BIAnalysis4/ : Product Employee 4
- http://bi.int.jodel.com:XXXX/BIAnalysis5/ : Community Team
We had also the help of the Infra team to make it only accessible using the VPN of the company.
1.2. Overview:
1.2.1. Screeshots:

114
Figure 134: Welcoming Screen of the Shiny App

This is the starting screen of the app. It also the first step for the product team when starting the
experiment: choosing the sample size

Figure 135: User Uploading Screen

Second tab of the app is the upload of the IDs and the time frame.

115
Figure 136: Calculating KPIs and getting the results Screen

Third tab and one of the possibilities is to get the KPIS needed for the experiment.

Figure 137: Plotting the results screen

Fourth tab is about plotting the results of your analysis for better understanding

116
Figure 138: Retention Calculation Screen

Second possibility after uploading the IDs is to calculate the retention and plot it in the same frame.

Figure 139: Getting insights about the users screen

Last tab to present is the user analytics that the product team member can use to see how the users he
uploaded are behaving inside the different clusters created in previous analysis by the BI Team.

117
2. Datawarehouse project:
2.1. Deployment environment:
The backend engineers had multiple experiences with the different instances of amazon and the
previous ETL Processes or databases. This is the reason why, this time they checked the data
warehousing service of Amazon. Using the combination of amazon redshift and S3 Bucket, we have a
constant backup of data in s3 and the computation is made through AWS Lambda with a possibility of
linking the data warehouse to a non SQL DB.
As a result, we will be able to have a full scalable Data warehouse that will be flexible today and for
our future needs in case we would like to collect new Data.

Figure 140: Data Flow from s3 to redshift using AWS technology

According to the amount of data that we have and the agreement of keeping only 6 months of data in
the datawarehouse (resources issues) the chosen generation of cluster is the ds2.xlarge. with 2 TB and
0.4 GB/s of I/O Speed, we assume that it would be sufficient in a first stage to fulfill the requirements
of our DW.
2.2. First Results:
One of the most impressing experiences that one can live is to see his creation coming to life in front
of his eyes. While owning the datawarehouse project, I had the opportunity to see its creation step by
step and test it by myself.
Here are the first results:
118
Figure 141: New Datawarehouse view on Datagrip

Although some tables are still missing, it is still a good start, especially that hey are using a new
language (Python). The advancement is steady, but sure.

Figure 142: User Dimension view on datagrip (New Datawarehouse)

Figure 143: Fact product view on datagrip (New Datawarehouse)

119
Figure 144: Content Dimension view on datagrip (New Datawarehouse)

As you can see, some of the dimensions are already built, some of the other facts are eyes missing or
still having configuration problems. Still, it shows that the deployment is still going on.

Figure 145: User Dimension Data (New Datawarehouse)

Figure 146: Content Dimension Data (New Datawarehouse)

Figure 147: Interaction Dimension Data (New Datawarehouse)

120
General Conclusion
Social Media are by far one of the most interesting fields that one can work on today. This is due to
the fact that the analyst is working in order to understand the behavior of the user, a real person that is
behind a screen and therefore is reacting according on his real self. Moreover, being an analyst with
Jodel is even better and this is because it is an anonymous app. As it was explained to me by the CEO,
the owner of the idea, he opted for this use case to make the user feel comfortable, prone to a completely
free of bounds behavior. We find ourselves, as a result of this feature, analyzing people in their real
true state and getting the best out of them for them by having the best experience and for us by making
the best choices to improve our app.

On the other side, having this low amount of information makes it hard to reach all our goals without
using proxies. For example, clicking on a post to read it is described in the database with a single event
but that also could mean just opening the post without reading it. Other point that we are still missing
is that for the small posts, we can read them directly from the main first screen of the app, without
clicking. In order to overcome these stones, some APIs exists to save the view of the user. We could
also create different events accordingly to the amount of time a specific window would be opened.

Finally, I think that with all the knowledge acquired during these months and the first insights
already gained, next steps will be to go deeper in this anonymous apps fields
In fact, all of us heard a lot of stories about Facebook selling personal, sharing it with other social
media for money. Whenever they are true or false, the user is now afraid of giving any piece of
information and having his private data going from a website to another. Finding a platform that would
protect him by just not asking for his email or name will change his conception of connecting to
communities and the aim would be more the total free communication and contribution between human
beings. And using some machine learning techniques to predict what he would like or hate from a
content point of view would make it easier and better. This, in a long term perspective, could help us
reach what were all the philanthropists and good hearted people thinking about: Making the world a
smaller place.

121
Appendices

1. Literature Review:
1.1. KPIS:
KPI is the abreviation of : Key Performance Indicator.
It is a quantifiable measure that every company use to have a good overview of its performance. After
fixing the strategic goals, they use some KPIs to evaluate the evolution of the company in all the
possible fields ; finance, marketing, sales, human resources, all these department should quantify their
changes or advancement in order to have a total overview of their evolution and the results of their
activities.
There are two main big families of KPIs :
• Financial KPIs : net profit, net profit margin, gross profit margin.
• Non Financial KPIs : D57, DAU, MAU.
Jodel has chosen all over its months of activity, different KPIs that were able to reflect the activity of
the company and its consequences.
1.2. DAU:
DAU is the abreviation of : Daily Active Users.
As the name says; this KPI describes the number of users that were active in a specific date and it
reflects the stickiness of the users to the product.
as an example, Facebook recorded a total number of 1.28 billion DAU in march 2017.
1.3. WAU:
WAU is the abreviaion of : Weekly Active Users.
Giving a wide bigger range of recording the users, it measures how many users were present every
week. a single activity of a user is considered as counted. it also reflects the performance of a social
media.
1.4. MAU:
MAU is the abreviation of : Monthly active users.
Even if its calculation changes from a company to another one (Twitter count the users who do
30 followings, Facebook count users who do core actions like sharing, commenting...), Everyone
agrees on a single definition : the user has to execute an activity with the app.

122
Figure 148: Instagram MAU Evolution

1.5. DX7:
DX7 is a metric that changes according to the company needs. "D" for day, "X" for a number between
1 to 7 and "7" for the total number of days in a week, it shows how the app is used by the users in a
week. it is sort of a personalization of the WAU.
a D57 user is a user that uses the app 5 days per week. Having a high number of D57s for an mobile
app or a social media means that the users are really stick to it.
1.6. Cohort:
In statistics, a cohort is a specific number of people who had a specific characteristic or who lived the
same experience or event in a specific period of time. Applied on our field of study, it can be
registration or commenting on post.
It can be used in two different ways:
• Prospective Cohort study:
it is a cohort study that choose a sample of people according to a specific characteristic and follows
them over time to see how a specific change can affect them.
• Retrospective Cohort study:
it is a cohort study that use historical data to see how being affected by a specific factor, has changes
and compare them to another cohort who were not influenced by this factor.

123
1.7. Retention:
Costumer retention are the different activities a company or an organization take to reduce user
churning. As you know, when a company starts an app or a product, the most important part for them
is the acquisition because this would be the source of revenue and profit for them. On the other side,
it would not make sense for a company which goal is to exist for a long time to lose all these
consumers. And to be able to quantify how many users stay with us, everyone tend to use Retention,
a metric that shows you how a cohort is evolving in time.

Figure 149: Retention Table

The rows are all the different cohorts that we will analyze. In most of the cases, an analyst takes people
who registered in a specific day. the example above is taking all those who registered per day from
15 to 26 of May 2014. the columns are the different days that we want to analyze from registration to
X Day. 4th Cohort for example represents the retention of users registered in the app 18th of May
from Day 1 to Day 11. The result is showing is that only 2.13% of them are still using the product.
Retention is considered as one of the main KPIs that reflects the sustainability of the app and make
virality last longer. Finally, most of the known social media are are always checking retention after
90 days. Means if a cohort subscribes in D1, how many of those in this Cohort would stay with us in
D90. In Jodel we usually calculate D30 retention which is always between 28% and 35%.

124
Figure 150: Retention of different Social Media

1.8. Process Automation:


The Business Process Automation is a strategy in the business that aims to contain the costs. By costs,
we mean material costs like money, or immaterial costs like time. this process is being used more and
more in the companies that are not yet monetizing their product. In fact, some social media spend a
good part of their lives living on investors money and what we call round. according to its valuation,
the startup will receive a specific amount of money that should be handled until the next round of
investment or monetizing and reaching profit status.
Although there already a wide range of products that helps this automation, some companies vote for
their In-house tool of BPA. that was the case of Jodel.
1.9. Datawarehouse:
A data warehouse is a relational database helping business responsible for analysis and choosing the
best decisions.
Designed specifically for query more than for transactional processing, it contains most of the time
historical data and is a mixture (but organized) of different data sources.
Although the definitions can vary from a website/book to another, William Bill Inmon, know as the
father of the data warehousing, had a specific way of introducing the specifications of a data
warehouse:
• Non Volatile: Every data that would enter the DW should never be altered, updated or deleted.
125
• Time Variant: The DW data should be time-driven in a way that one can track the status of a
specific entity/data in a specific moment of time. Example: the ability to see the karma of a
user in a particular moment of his lifecycle in the app.
• Subject Oriented: As the DW is an analytical tool, every business case inside a company
would need to answer specific questions related to a specific business case. Therefore, a data
warehouse should always be subject-oriented.
• Integrated: As specified above, Our Analytical DB should contain clean and formatted data.
Discrepancies like having F for female in a row and FEMALE in another one are not allowed.
Completely independent from the source, the final result should be one clear integrated Data
warehouse.

Figure 151: Data warehouse VS Operational System

Figure 152: Relation between an Operational System and a Data warehouse


126
A data warehouse is composed of two types of tables somehow linked to each other:
1.9.1. Dimension:
Composed usually of one or more hierarchies, it describes a dimensional value. Containing
usually text or descriptive data, there is different types of dimensions:
• Junk dimensions - a collection of miscellaneous attributes that are unrelated to any
particular dimensions.
• Degenerate dimensions - data that is dimensional in nature but stored in a fact table.
• Role playing dimensions - a dimension that can play different roles in a fact table
depending on the context.
• Conformed dimensions- a dimension that has exactly the same meaning and content
when being referred to from different fact tables.

1.9.2. Fact:
A fact table is the center of the star schema and is usually always surrounded by dimension
tables. it contains the measures of the business you want to analyze (revenue for an e-commerce
company) and foreign keys to dimension table. it always contains quantitative data and use the
attributes in the dimensions to choose the way to analyze data (Monthly-Women only- by city).
there are multiple types of facts. the most important types are:
• Cumulative: This type of fact table describes what has happened over a period of time.
For example, this fact table may describe the total sales by product by store by day.
The facts for this type of fact tables are mostly additive facts. The first example
presented here is a cumulative fact table.
• Snapshot: This type of fact table describes the state of things in a particular instance
of time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.
There quantitative data that you can find in the fact table is called measure and can be:
• Additive measures: These are those specific class of fact measures which can be
aggregated across all dimension and their hierarchy. Example: one may tend to add
sales across all quarters to avail the yearly sales。
• Semi-Additive measures: These are those specific class of fact measures which can be
aggregated across all dimension and their hierarchy except the time dimension.

127
Example: Daily balances fact can be summed up through the customers’ dimension but
not through the time dimension.
• Non-additive measures: These are those specific class of fact measures which cannot
be aggregated across all/any dimension and their hierarchy. Example: Facts which
have percentages, ratios calculated.

1.10. Taxonomy:
Going to the source of the word, taxonomy find its roots in the greek language and is composed of
"order" and "law". To summarize all the things that could this word mean, it is the science of ordering
things according to specific parameters. Historically, it was used especially in science or biology to
put the animals in categories according to their specifications. Nowadays, taxonomy is any task of
collecting entities for a further use or classification. Specifically, this technique is used by Jodel to
collect the different event (Client and Server Side ) to categorize them for the creation of the
Datawarehouse.
1.11. ETL:
Extract - Transform - Load: A Process responsible for pulling data out of different types of sources
and pushing them in the datawarehouse. To fulfill this task, it can be done either with tools like Talend
and MSBI, or coded from scratch using Python ("Luigi" ETL Process by Spotify)
• Extract: consists of extracting the data from a source system like an ERP, Google
Analytics, a CRM or even some text file that will be consolidated in a Staging area or an
ODS (according to the business choice) for future modification to guarantee the integrity.
• Transform: consists on transforming the data coming from different sources to one and
unique format. this involves modification for date, gender, etc.
these modifications can be:
• cleaning: F for female, NULL to 0
• joining: Lookup, Merge
• transposing: row to columns or column to rows
• calculations: calculating measures related to the business
• Load: the loading is the part of sending all this clean and clear data to a new database
that would be our final data warehousing output. it really helps to disable all the
constraints and indexes before the loads and bring them back when it is finished.

128
1.12. Machine Learning:
As specified for other notions, machine learning had different definitions from its creation until today
and data analysts/data scientists are still searching for the perfect that would really describe this
emerging technology or science. Nevertheless, if you search about it, you will find 2 main definitions
that of course converge to the same meaning. Arthur Samuel described it as:” the field of study that
gives computers the ability to learn without being explicitly programmed”. Tom Mitchell from his
side, gives a more modern definition:” A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured
by P, improves with experience E”. A summary of this would means that machine learning is allowing
software apps to do more accurate predictions without really programming them. this would be done
using algorithm that will get an input and use some statistical and mathematical analysis to get the
prediction as an output.
there are two main categories of these algorithms and learning:
• Supervised learning: requires to give a dataset with an input and the correct output
(Training dataset) , with of course having the idea that the input and output are related.
The supervised learning problems can be categorized to "regression" or "classification".
the regression is about trying to predict result that will be a continuous output (a value
between 1 and 100 for example). the classification is about trying to predict result that
will be a discrete output (accept or deny status for example).
• Unsupervised learning: is used to explore the data and approach problems with an
unknown output. in other words, it is used when we have no idea about what would the
results look like. the other important point about it is that there is no feedback based on
some prediction. we use it to explore data and understand how could they be linked (how
google group the news article for example).

129

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy