0% found this document useful (0 votes)
7 views54 pages

rapportml (2)

Uploaded by

riahi.dorsaff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views54 pages

rapportml (2)

Uploaded by

riahi.dorsaff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

ESPRIT, Private Higher School of Engineering and Technologies

Machine Learning
Sprint IA
Performed by:
Team Dev Dynamos

Team Members:
 Dorsaf RIAHI
 Wissal DAOUD
 Slim-Fady HANAFI
 Raslen FERCHICHI
 Saif ROMDHANI
 Mohamed Aziz BENDEG
 Supervised by: Supervisor: Mrs. Jihen HLEL
Summary
Contents
Chapter 1 – Project Context and Problem Framing................................................................................5
1.1Introduction:.....................................................................................................................................5
1.2Problem Statement :.........................................................................................................................5
1.3Proposed Solution :...........................................................................................................................6
Chapter 2 – Project Structure and Methodology....................................................................................8
Chapitre 3 – Outils et Technologies Utilisées.......................................................................................13
Introduction:....................................................................................................................................13
Chapitre4:Business Understanding and Project Objectives..................................................................17
General Conclusion..............................................................................................................................53
List of Figures

Figure 1 CRISP-DM Methodology workflow applied............................................................................11


Figure 2 Distrubtion of stipends...........................................................................................................21
Figure 3 Top 10 competences..............................................................................................................21
Figure 4: Definition of mean_top_similarity.........................................................................................23
Figure 5: Train bs Train performance comparison................................................................................24
Figure 6: Resultas.................................................................................................................................25
Figure 7 :User interface of internship recommandation......................................................................26
Figure 8: Correlation Matrix.................................................................................................................28
Figure 9 : Distribution of the number of places....................................................................................29
Figure 10 Distribution of internship durations.....................................................................................30
Figure 11 Boxplots................................................................................................................................32
Figure 12 Elbow curve..........................................................................................................................33
Figure 13: Confusion Matrix................................................................................................................34
Figure 14: user Interface success Prediction........................................................................................37
Figure 15: User Interface Failure Prediction.........................................................................................38
Figure 16: Distribution of sentiment....................................................................................................40
Figure 17: Matrix correlation................................................................................................................41
Figure 18: KNN.....................................................................................................................................42
Figure 19 Matrix confusion...................................................................................................................43
Figure 20: Accurancy comparaison Accross Models.............................................................................44
Figure 21: User interface sentiment.....................................................................................................45
Figure 22 Correlation Matrix between numerical features..................................................................47
Figure 23: Outliers................................................................................................................................49
Figure 24: Distribution of participants by education level....................................................................49
Figure 25: courbe Roc-XGboost............................................................................................................51
Figure 26:Learning Curve for KNN........................................................................................................52
Figure 27: User interface dropout Risk.................................................................................................53
References and Bibliography

[1] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras,


and TensorFlow. O'Reilly Media.
[2] Chollet, F. (2018). Deep Learning with Python. Manning Publications.
[3] Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research.
[4] Kaggle.com – Educational datasets and competitions.
[5] https://scikit-learn.org – Scikit-learn documentation.
Chapter 1 – Project Context and Problem
Framing

1.1Introduction:
In today's data-driven world, information has become a key strategic asset.
Artificial Intelligence (AI) and Machine Learning (ML) are now essential tools for
extracting value from data, providing powerful levers for analysis, prediction,
and decision-making. This project is aligned with that dynamic, aiming to
harness the capabilities of machine learning to address concrete analytical
objectives.
Our work relies on a real-world dataset that includes detailed information on
student profiles, internship opportunities, and academic journeys. By applying
appropriate data processing and modeling techniques, we strive to build
predictive models that support insightful decision-making in educational and
career-oriented contexts.
Throughout this report, we follow the CRISP-DM methodology to ensure a clear
and rigorous presentation of our AI sprint. Each stage — from business
understanding to deployment — is carefully detailed with relevant figures,
clean visualizations, and scientifically grounded references, ensuring
coherence, clarity, and value throughout the entire document.
.
1.2Problem Statement :

In an increasingly digitalized educational environment, retaining


students and supporting their personalized learning journeys have
become critical priorities for academic institutions and training
platforms. Despite this, many students struggle to stay engaged,
identify relevant internship opportunities, or receive timely support,
often leading to disengagement or dropout. This project addresses the
core challenge of how machine learning can be leveraged to model
student behavior, detect dropout risks, and recommend tailored
academic or professional pathways in a predictive and personalized
manner. It explores how to segment learners based on behavioral and
performance indicators, recommend internships aligned with
individual profiles, predict the likelihood of dropout, and interpret
emotional signals from student feedback. By structuring our work
around dedicated Data Science Objectives (DSOs), we provide a
comprehensive and intelligent framework to support learners and
optimize educational outcomes.
1.3Proposed Solution :
To address the identified challenges, we propose a modular and data-
driven solution based on supervised and unsupervised machine
learning techniques. The project is organized into multiple targeted
Data Science Objectives (DSOs), each designed to solve a specific
sub-problem related to student engagement, progression, and
personalization. Our approach relies on the CRISP-DM methodology,
ensuring a structured process from data understanding to model
deployment. By using real educational and behavioral data, we
implement classification models (such as XGBoost and Logistic
Regression) to predict dropout risk, clustering techniques for profile
segmentation, and recommendation systems based on similarity
measures to personalize content and internship suggestions. The final
models are deployed in a real-time, user-friendly application using
Flask and Angular, enabling administrators and trainers to make
informed decisions and take early action when students are at risk.
This integrated solution aims to enhance educational outcomes, reduce
dropout rates, and support personalized learning pathways.
Chapter 2 – Project Structure and
Methodology
2.1 Project Structure

This AI sprint project was structured as a collaborative and modular initiative,


where each team member was assigned a distinct Data Science Objective
(DSO). This structure allowed team members to work in parallel on specific
machine learning problems using individual notebooks, promoting autonomy,
focus, and depth of exploration. By distributing the work across clearly defined
objectives—such as dropout prediction, sentiment analysis, or personalized
recommendations—we ensured both coverage and specialization in the
development process.
To guide the implementation of each DSO, the team adopted the CRISP-DM
(Cross Industry Standard Process for Data Mining) methodology, a widely
recognized framework in both academic and industrial data science projects.
CRISP-DM breaks down a data science process into six well-defined stages:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
Its structured, iterative nature enabled a systematic approach to problem-solving,
ensuring coherence and scientific rigor throughout the project lifecycle.
2.2 Applied Methodology and Workflow
The CRISP-DM framework was implemented across all notebooks, ensuring a
consistent and professional approach from the initial problem framing to final
deployment. Each DSO followed this pipeline to deliver robust, interpretable,
and application-ready machine learning components.
The methodology also emphasized data ethics, explainability, and
generalization, which are critical in the context of education-related AI systems
that impact real users.
2.3 Core AI and Data Science Concepts
The project integrated a broad range of essential machine learning and artificial
intelligence techniques, tailored to each DSO. These include:
 Supervised Learning: Training models on labeled data for classification
tasks such as dropout risk prediction. Algorithms used include:
o Logistic Regression
o Random Forest
o Support Vector Machines (SVM)
 Recommendation Systems: Leveraging algorithms like K-Nearest
Neighbors (KNN) and cosine similarity to suggest personalized
internships and content to students based on profile similarity.
 Model Evaluation Metrics:
o Accuracy: Proportion of correct predictions over all cases.
o Precision & Recall: Key indicators for imbalanced classification
problems.
o F1-Score: Harmonic mean of precision and recall.
o Confusion Matrix: A visual tool to assess prediction outcomes
versus actual values.
 Feature Engineering: Creating, modifying, or selecting relevant features
to improve model performance. This included encoding categorical
variables, generating sentiment scores, and scaling numeric inputs.
 Exploratory Data Analysis (EDA) & Visualization: Visual techniques
such as heatmaps, bar charts, and correlation matrices were used to
uncover hidden patterns and detect data quality issues prior to modeling.

2.4 Benefits of This Structure


This interdisciplinary and structured approach enabled not only strong technical
outcomes, but also produced insights that are meaningful in the context of
student engagement, academic support, and career guidance. The
modularity of the project made the integration of different ML components
flexible, and the CRISP-DM methodology guaranteed coherence and traceability
from start to finish.

FIGURE 1 CRISP-DM M ETHODOLOGY WORKFLOW APPLIED


2.5 Key Data Science and AI Concepts Used
In the context of this project, several core concepts from artificial intelligence
and data science were applied to achieve meaningful and actionable results:
 Supervised Learning: Machine learning techniques that train models on
labeled data. In this project, algorithms such as Logistic Regression,
Random Forest, and Support Vector Machine (SVM) were used for
classification tasks including dropout prediction.
 Recommendation Systems: Algorithms designed to suggest items based
on user profiles or content similarity. Here, K-Nearest Neighbors (KNN)
and cosine similarity were employed to recommend internships tailored
to each student’s background and preferences.
 Evaluation Metrics:
 Accuracy: The percentage of correct predictions.
 Precision & Recall: Metrics used to assess model
performance, particularly in imbalanced datasets.
 F1-Score: The harmonic mean of precision and recall,
balancing false positives and false negatives.
 Confusion Matrix: A visual diagnostic tool comparing actual
versus predicted outcomes.
 Feature Engineering: The process of selecting, transforming, or
generating new features to enhance model learning and performance.
 Data Visualization: Essential for Exploratory Data Analysis (EDA),
visualization tools such as heatmaps, bar charts, and correlation
matrices were used to uncover data patterns, anomalies, and feature
relationships.
2.6 Conclusion
This chapter outlined the methodological and structural foundations that guided
the successful execution of the AI sprint project. By adopting the CRISP-DM
framework, the team ensured a clear, step-by-step approach to solving complex
data science problems, from business understanding to model deployment. The
modular division into Data Science Objectives (DSOs) allowed each team
member to explore a specific problem space while contributing to a coherent and
integrated solution. The application of key AI concepts such as supervised
learning, recommendation systems, and data visualization techniques further
reinforced the robustness and practical relevance of the models developed. This
solid foundation not only enabled technical success but also ensured that the
solutions remained aligned with the educational goals of personalization,
retention, and student success.

.
Chapitre 3 – Outils et Technologies
Utilisées
Introduction:

Throughout this AI sprint project, a diverse set of tools and technologies


was employed to cover the entire lifecycle of the machine learning
workflow — from data collection and preprocessing to modeling,
evaluation, deployment, and user interaction.
Each tool was carefully selected based on its performance, ease of
integration, compatibility with the team's working environment, and its
relevance to the defined Data Science Objectives (DSOs). This
technology stack enabled efficient collaboration among team members
while ensuring clarity, reproducibility, and strong performance across all
stages of the project.
The following table summarizes the key technologies used, their specific
roles within the project, and the rationale behind their selection.
🔷 Logo 🛠 Tool / 🎯 Function ✅
Platform Justification

Python Main Flexible,


language for open-source,
data science rich in ML
and modeling libraries

Pandas / Data Reference


NumPy manipulation, libraries for
cleaning, and EDA
preparation
Scikit-learn Classical ML Well-
models: documented
KNN, SVM, and easy to
Logistic implement
Regression
XGBoost Advanced High
model for performance
dropout on structured
prediction data

Kaggle Dataset Widely used


source, platform for
collaboration data science
and projects
competitions
Flask Backend API Lightweight
for prediction and
deployment integrates
well with
ML models
Angular Web Modern and
interface for dynamic
interacting frontend
with framework
predictions
Git / Version Tracks code
GitHub control and changes and
team supports
collaboration teamwork

Google Notebook Cloud access


Colab / development and real-time
Jupyter and model collaboration
testing

Table 1 – Summary of Tools and Technologies Used

This project relied on a robust and well-integrated technology stack to


support the full data science lifecycle. Python served as the central
programming language due to its versatility and widespread adoption in
AI development. Libraries such as Pandas and NumPy were used
extensively for data cleaning, transformation, and manipulation, while
Scikit-learn facilitated the implementation of traditional machine learning
models including K-Nearest Neighbors, Support Vector Machines, and
Logistic Regression. For advanced prediction tasks like dropout risk
classification, the team employed XGBoost, known for its scalability and
high performance. Data sources were primarily obtained from Kaggle,
which provided a rich collection of realistic, structured datasets. For data
visualization and exploratory analysis, Matplotlib and Seaborn were used
to generate clear and interpretable plots such as correlation matrices and
ROC curves. To handle deployment, Flask was chosen as a lightweight
backend API framework, exposing model predictions through RESTful
endpoints. On the frontend, Angular was used to develop an interactive
and dynamic interface, enabling users to submit data and receive real-time
feedback. Google Colab supported collaborative development of Jupyter
notebooks, while Git and GitHub ensured version control and seamless
teamwork. For recommendation and sentiment analysis, the team applied
text-based techniques using TF-IDF vectorization, cosine similarity, and
sentiment scoring tools such as TextBlob. This coherent and
complementary set of tools enabled the team to design, build, and deploy
a reliable and user-friendly AI solution tailored to educational needs.
Chapitre4:Business Understanding and Project
Objectives

1.Business Understanding
1.1 .Project Objective
The central goal of this project is to enhance student engagement and retention
by personalizing their academic and professional journeys through Machine
Learning techniques. In a higher education environment where dropout rates and
misaligned career paths are major concerns, data-driven personalization has the
potential to significantly improve student satisfaction, support services, and
career outcomes.
This aligns with the broader mission of modern educational institutions: to guide
students effectively, increase retention, and match learners with the most
suitable opportunities, whether academic or professional.
1.2Defined Data Science Objectives (DSOs):
To support the overarching goal of increasing student retention and
personalization through machine learning, six complementary Data Science
Objectives (DSOs) were defined. Each DSO targets a specific challenge within
the student journey, from understanding profiles to anticipating disengagement.
The table below summarizes each objective:
DSO Description

Identify distinct student profiles using unsupervised learning (e.g.,


1. Profile Segmentation
clustering) to support differentiated educational strategies.

2. Internship Offer Group internship offers based on shared characteristics to better match
Segmentation students with suitable opportunities.

Develop a recommendation system that suggests relevant internship


3. Personalized Internship
opportunities based on a student’s academic background and
Recommendation
application history.

4. Personalized Content Provide customized academic or extracurricular content tailored to each


Recommendation student's learning path and interests.

5. Academic Dropout Risk Predict the likelihood of academic dropout using classification models,
Prediction allowing early intervention and support.

Extract and interpret students’ emotional and motivational signals from


6. Sentiment Analysis
textual feedback using Natural Language Processing (NLP) techniques.

Table2:DSO
2. Data Understanding
This section presents the data exploration performed for each defined Data
Science Objective (DSO). Each team member worked independently and
applied Exploratory Data Analysis (EDA) to extract meaningful insights and
prepare the data for modeling.
2.1 DSO 1 – Personalized Internship Recommendation :
Objective
This module aims to build an intelligent system that recommends internship
offers to students based on their academic background and application
preferences. The core goal is to match students with the most relevant
opportunities using similarity-based machine learning techniques.
2.1.1Data Understanding:
The initial dataset includes detailed information about:
 Student IDs and features extracted from their academic path.
 Internship offers, including titles, descriptions, locations, required skills,
and stipends.
 Historical data such as past applications and acceptances.
Key Observations:
 Most internships are located in major cities and sectors like Data Science,
Web Development, and Marketing.
 The stipend (Stipend) distribution was highly skewed, with some unpaid
or extremely low-paid internships.
 Text features such as internship titles and skill descriptions were rich in
semantic information, justifying NLP techniques.
FIGURE 2 DISTRUBTION OF STIPENDS

FIGURE 3 T OP 10 COMPETENCES
2.1.2Data Preparation :
To prepare the dataset for modeling, several preprocessing steps were
implemented to ensure the quality and consistency of the data used for the
recommendation engine. First, a text preprocessing pipeline was applied
to internship titles and skills, including lowercasing, stop word removal,
and stemming. This was followed by the application of TF-IDF
vectorization, which transformed the textual data into numerical vectors
capturing the importance of each term across all internship descriptions.
In parallel, feature engineering was performed to encode relevant
categorical variables such as city, internship duration, and offer category.
These features were then combined with the text-based vectors to
construct a comprehensive student-offer profile vector.
To measure the similarity between students and internship offers, a cosine
similarity matrix was computed using the TF-IDF vectors. This allowed
the system to quantify how closely each student’s profile matched each
offer. Finally, a Top-N retrieval mechanism was implemented using K-
Nearest Neighbors (KNN) to extract the top 5 most relevant internship
recommendations for each student based on the similarity scores.
To assess the quality of the recommendation engine, a custom evaluation
metric was defined: the mean cosine similarity of the Top-5
recommended items. The function mean_top_similarity() computes, for
each job offer, the average similarity between it and its top 5 closest
recommendations (excluding itself). This method provides a reliable
measure of how semantically relevant the recommendations are. The
global average over the dataset offers a solid indicator of model
performance.

FIGURE 4: DEFINITION OF MEAN_TOP _SIMILARITY

In order to validate the generalization ability of the recommendation


system, we conducted a comparative evaluation between the training set
and the test set using the previously defined Top-5 mean similarity metric.
The results show an average similarity of 0.9253 on the training set and
0.8337 on the test set, indicating strong model performance with minimal
overfitting and good generalization to unseen data.
FIGURE 5: TRAIN BS T RAIN PERFORMANCE COMPARISON

The recommendation model is based on a content-based filtering


approach, which relies on computing the similarity between the student's
profile and each internship offer. Since this problem does not require a
traditional supervised learning setup with target labels, the modeling
process focuses on accurately capturing and comparing feature vectors.
The core of the model uses TF-IDF vector representations of internship
titles and required skills. These vectors are then used to calculate cosine
similarity scores, representing how semantically close each offer is to a
given student profile. The combination of textual data (title, skills) and
structured features (city, duration, category) ensures a rich and
representative profile for both students and internships.
To retrieve recommendations, the model applies a K-Nearest Neighbors
(KNN) strategy on the similarity matrix. For each student, the system
identifies the top N internship offers with the highest similarity scores —
typically the top 5 — and returns them as personalized suggestions. This
approach ensures that recommendations are contextually relevant, even
for students who have not previously applied to any internship.
FIGURE 6: RESULTAS
3. Deployment

FIGURE 7 :U SER INTERFACE OF INTERNSHIP RECOMMANDATION


2.3 DSO 2 – Academic Dropout Risk Prediction:
Objective
The goal of this DSO is to build a predictive model capable of identifying
students at risk of dropping out. Early detection of such students enables timely
academic support and intervention, which can greatly improve retention and
student success.
Data Understanding
The dataset includes detailed student records and internship-related features. The
exploration phase involved:
 Loading and verifying the dataset structure
 Handling missing values
 Statistical analysis of key variables (stipend, duration, etc.)
 Visualization through histograms, boxplots, and correlation heatmaps
 Detection and treatment of outliers
As part of the data understanding phase, a correlation matrix was used to
analyze the relationships between numerical variables. The matrix
revealed low correlations overall, indicating the importance of combining
multiple features to build predictive models. Distribution analysis showed
skewness in internship duration and class imbalance in the success
variable. These findings guided the selection of appropriate preprocessing
and modeling techniques in subsequent phases.
FIGURE 8: C ORRELATION MATRIX

Distribution of Available Positions


The variable nbr_places (number of internship positions) was analyzed to
understand its distribution across offers. The histogram shows a right-skewed
distribution, with the majority of internship offers providing between 250 and
350 positions. Very few offers exceed 600 places, indicating that large-scale
opportunities are rare.
FIGURE 9 : D ISTRIBUTION OF THE NUMBER OF PLACES

📝 Observation: The data is dense around the center, forming a nearly bell-
shaped curve, though slightly skewed to the right.
This insight is important for later modeling steps, as it highlights potential
outliers and helps understand the availability landscape for internship
candidates.
Distribution of Hackathon Duration
The histogram above shows the distribution of the duree_jours variable, which
represents the duration of hackathons (or internships) in days. The data appears
evenly spread across the range from 1 to 10 days, with no significant skewness
or concentration at a particular value.
FIGURE 10 DISTRIBUTION OF INTERNSHIP DURATIONS

📝 Observation: The frequency is relatively stable across all values, indicating a


uniform distribution of internship durations in the dataset.
This uniformity suggests that duration alone may not be a discriminative factor
in predicting outcomes, but could interact with other variables such as success or
dropout status.
3. Data Preparation
The data preparation phase involved transforming the raw dataset into a clean
and machine-learning-ready structure. Several key steps were carried out to
handle missing data, encode categorical variables, standardize features, and
ensure consistency across all observations.
Handling Missing Values
The dataset was first inspected for missing values. Variables with null or empty
entries were treated using appropriate imputation techniques depending on the
data type:
 Numerical values were filled with mean or median imputation.
 Categorical fields were imputed using the most frequent value (mode).
This step ensured that no NaN values interfered with the modeling pipeline.
Removing Redundant or Irrelevant Columns
Several columns that had no predictive value (e.g., IDs, timestamps, descriptive
text with no structure) were removed to reduce dimensionality and improve
model efficiency.

Encoding Categorical Variables


All categorical variables were encoded into numerical format:
 Label encoding was applied to binary or ordinal variables.
 One-hot encoding was used for nominal features such as location,
category, or sector.
This step transformed textual data into machine-readable vectors.
Outlier Detection and Treatment
Outliers were identified using boxplots and z-score thresholds. Key numeric
fields such as nbr_places and duree_jours were inspected. Some extreme values
were either removed or capped to avoid distorting model performance.
FIGURE 11 BOXPLOTS

4. Modeling
In this phase, we built and evaluated machine learning models to predict the risk
of academic dropout. Two classification algorithms were implemented and
compared: K-Nearest Neighbors (KNN) and Support Vector Machine
(SVM).

The first model implemented was K-Nearest Neighbors (KNN), a distance-based


classification algorithm. It predicts a student’s risk of dropout by examining the
outcomes of the k most similar students in the training set. To prepare for
modeling, the dataset was split into 80% training data and 20% test data to allow
for reliable performance evaluation.
To determine the optimal number of neighbors, the elbow method was applied.
By plotting the model's error rate for different values of k, the curve revealed
that k = 13 provided the best trade-off between underfitting and overfitting.
Once the ideal value of k was selected, the model was trained on the
standardized feature set, ensuring all variables contributed equally to the
distance calculations.
Upon testing, the model returned binary predictions (0 for no risk, 1 for dropout
risk) for each student. The quality of the predictions was assessed using a
confusion matrix, ROC curve, and standard classification metrics. The model
demonstrated strong performance, showing high accuracy and a good ability to
discriminate between at-risk and retained students.

FIGURE 12 ELBOW CURVE


The performance of the K-Nearest Neighbors (KNN) model was assessed using
a confusion matrix, which visually represents the model’s classification ability
on the test dataset. As shown in the figure, the model correctly predicted 390
students as at risk (Échec) and 99 students as likely to succeed (Succès).
However, it also misclassified 87 students who would have succeeded as
failures, and 195 successful students were wrongly predicted as dropouts.
These results suggest that the model has a strong ability to identify at-risk
students (true negatives), but a lower precision in detecting successful ones. This
imbalance can be interpreted as the model being more conservative —
prioritizing risk detection over success recognition, which is often preferable in
educational contexts where preventing dropouts is critical.
The overall structure of errors reveals that while the model is not perfect, it
provides valuable guidance for early intervention, particularly for students most
likely to disengage or fail.

FIGURE 13: C ONFUSION MATRIX


5. Evaluation
The performance of the predictive models was rigorously evaluated using a
combination of quantitative metrics and visual diagnostic tools. The evaluation
involved calculating standard classification metrics such as accuracy, precision,
recall, and F1-score, as well as examining the ROC curve and AUC to gauge the
discriminative ability of the models. In particular, the confusion matrix obtained
from the KNN model—illustrating a high number of correctly identified at-risk
students versus some misclassifications among successful cases—suggests that
the model is effectively prioritizing the detection of potential dropouts. While
the model shows a strong ability to flag at-risk individuals, adjustments such as
threshold tuning or further feature engineering might be explored to improve the
precision for predicting student success. Overall, these evaluation steps confirm
that the model performs adequately for early intervention strategies in
educational settings, ultimately enabling more targeted support for students at
risk of dropping out.

6. Deployment
After completing the evaluation phase, the final step of the CRISP-DM process
involves deploying the predictive system into a real educational setting. The
objective is to transform the machine learning model into a fully operational
application that supports academic staff in identifying at-risk students and taking
timely, data-driven actions.
In our case, the trained KNN and SVM models were deployed through a Flask-
based REST API, enabling seamless integration with a front-end interface. This
setup allows real-time predictions using new student data submitted through a
web form or dashboard.
The deployment solution includes the following components:
 Automated Prediction Interface: Academic staff or system administrators
can input student records through the front-end interface. The Flask API
processes the data, runs the trained model, and returns a prediction
indicating dropout risk (0 = low risk, 1 = high risk).
 Interactive Dashboard: The front-end displays prediction results clearly,
along with visual indicators such as risk level, confidence score, and
contextual explanations. This helps non-technical users (e.g., advisors)
interpret the results effectively.
 Alert and Notification System: When a student is predicted to be at high
risk, the system can trigger alerts (e.g., banners, color flags) on the
dashboard to prompt academic follow-up or guidance intervention.
 Continuous Learning via Feedback Loop: The system architecture allows
for future enhancements where outcomes (e.g., success or dropout status)
can be fed back into the model pipeline. This enables model retraining
and continuous improvement based on real student behavior.
By deploying the solution via Flask and integrating it into a user-friendly front-
end, the system bridges the gap between machine learning output and actionable
educational decisions. It demonstrates how AI can be effectively applied to
support institutional goals such as student retention and success.
User Interface Integration
To provide a user-friendly and accessible way for academic staff to interact with
the prediction model, a web interface was developed and integrated into the
existing educational platform. This front-end module allows users to input key
parameters related to a hackathon or training session—such as location, number
of seats, duration, difficulty level, and importance level—and instantly receive a
prediction on the probability of success.
Once the user clicks the "Tester" button, the data is sent to the back-end Flask
API, which processes the input using the trained machine learning model (KNN
or SVM). The prediction is then returned and displayed dynamically in the
interface, indicating either:
 ✅ Succès: if the probability of success exceeds the model threshold (e.g.,
62.84% in the second example).
 ❌ Échec: if the predicted likelihood of success is lower (e.g., 43.21%).
The result box is color-coded (green for success, red for failure) to ensure instant
visual feedback, and can optionally be followed by recommendations or next
steps
This interface demonstrates how artificial intelligence can be operationalized in
real time, offering valuable decision support in educational planning and student
engagement strategies.

FIGURE 14: USER INTERFACE SUCCESS PREDICTION


FIGURE 15: U SER INTERFACE F AILURE PREDICTION
DSO 3 – Sentiment Analysis :
1. Business Understanding
The goal of this analysis is to understand how a student's average sentiment
score—represented by the variable sentiment_moyen—relates to their academic
status. Instead of performing natural language processing on raw text, this DSO
leverages a precomputed numerical sentiment score to determine whether
students expressing more positive emotions tend to succeed academically.
This can help academic institutions integrate emotional indicators into their
student monitoring systems to detect disengagement, low motivation, or
potential dropout risk.
2. Data Understanding
The dataset includes the variable sentiment_moyen, which ranges from -1 (very
negative) to +1 (very positive). This score was likely computed using an
external NLP pipeline such as TextBlob or VADER applied to student
comments.
A histogram was generated to visualize the distribution of sentiment scores. The
result shows a concentration around neutral to slightly positive values,
suggesting that most students tend to express mildly positive or neutral
feedback.
FIGURE 16: DISTRIBUTION OF SENTIMENT
3. Data Preparation
Since sentiment_moyen is already numeric and clean, no additional text processing was
required. However:
 The variable was included in the set of numeric predictors for correlation and model
input.
 All numeric variables (including sentiment) were standardized using a StandardScaler
to ensure uniform contribution to distance-based models.
 Non-numeric features were either dropped or encoded, depending on their relevance.

4. Sentiment Score Analysis and Integration into Modeling:


4.1 Correlation with Academic Outcome
To determine the impact of emotional tone on student performance, the
sentiment_moyen variable was analyzed in relation to status_academique, the target
variable representing academic outcome.
 A correlation matrix was generated including all key numeric features.
 sentiment_moyen displayed a moderate positive correlation with academic success.
 This suggests that students who express more positive sentiments are statistically more
likely to perform better or complete their academic program
FIGURE 17: M ATRIX CORRELATION
4.2 Feature Role in Predictive Models
The sentiment_moyen variable was integrated into multiple machine learning models
as a behavioral predictor:
 Standardization was applied to align its scale with other features.
 It was combined with variables such as note_moyenne, score_activite, and
temps_platforme.
 Models tested include: K-Nearest Neighbors, Decision Tree, Logistic Regression, and
XGBoost.
📌 Observation: While sentiment_moyen alone was not sufficient for prediction, its
inclusion slightly improved overall model performance, particularly in capturing
disengaged or at-risk students.
FIGURE 18: KNN
The graph above shows the cross-validation accuracy of the K-Nearest Neighbors
(KNN) model for different values of K, with sentiment_moyen included as a feature.
As the number of neighbors increases, the model's stability improves gradually, with
the best validation accuracy observed around K = 26 to 28.
This confirms that sentiment_moyen contributes positively, albeit modestly, to the
classification performance when combined with other features. Although the precision
values remain moderate (~58%), the trend illustrates the model’s ability to capture
useful emotional signals embedded in the sentiment score.
Interpretation: The inclusion of sentiment in the model leads to better generalization
performance, and optimal K selection remains essential for tuning KNN behavior.
5. Evaluation
The performance of the KNN classifier was evaluated using standard classification
metrics, including accuracy, precision, recall, and the confusion matrix.
5.1 Confusion Matrix – KNN Model

FIGURE 19 MATRIX CONFUSION


This confusion matrix provides the following insights:
 The model correctly predicted 199 non-success cases and 217 success cases.
 It made 125 false negatives (actual 1, predicted 0) and 160 false positives (actual 0,
predicted 1).
 The model demonstrates a relatively balanced performance, though with room to
improve its precision on class 0 (non-success).
5.2 Model Comparison Overview:

FIGURE 20: A CCURANCY COMPARAISON ACCROSS MODELS

The bar chart above compares the accuracy scores of various


classification algorithms applied to the same dataset that included the
sentiment_moyen feature

From this comparison, we observe that:


 Logistic Regression and SVM yielded the best results in terms of
accuracy.
 Models like Decision Tree and Random Forest performed slightly lower,
likely due to overfitting or sensitivity to feature interaction.
 The inclusion of sentiment_moyen had a consistent, positive impact
across all models, although the effect varied in magnitude.
6. Deployment:

FIGURE 21: U SER INTERFACE SENTIMENT


6.1 Conclusion
This DSO explored the use of a precomputed sentiment score
(sentiment_moyen) as a predictive and analytical feature in the academic
context. Rather than applying traditional NLP to raw textual data, the
study focused on integrating this emotional indicator into classical
machine learning pipelines.
Key findings include:
 The distribution of sentiment scores was concentrated around neutral and
slightly positive values.
 sentiment_moyen showed a moderate positive correlation with academic
success, supporting its relevance.
 When used in models such as KNN, SVM, Logistic Regression, and
XGBoost, sentiment contributed to improved predictive performance.
 Additional analyses revealed slight demographic and academic
differences in sentiment expression (e.g., older students tended to express
more positivity).
While sentiment_moyen is not a dominant predictor, it adds a human
behavioral layer to data-driven decision-making and provides valuable
insights into student engagement and satisfaction.
DSO – Predicting Dropout Risk in Training Programs

1. Business Understanding
The objective of this DSO is to predict whether a training program participant is at
risk of dropping out (at_risk = 1). This predictive model helps improve training
retention, personalize follow-up, and enhance participant engagement.

1. Data Understanding:
To better understand the relationships between input features and the
target variable (at_risk), a Pearson correlation matrix was generated.
The strongest negative correlations with at_risk were observed for:
grade (-0.38)
profil_complet (-0.37)
score_activite (-0.34)
These findings suggest that academic performance, engagement, and
user profile completeness are important factors in dropout risk.
Other features like nb_sessions, note_theorique, or note_pratique
showed weak correlation individually, but may still contribute when
combined in a predictive model.
This correlation analysis helps prioritize features before modeling and
supports the feature selection strategy used in the next phase.

FIGURE 22 CORRELATION MATRIX BETWEEN NUMERICAL FEATURES


3. Data Preparation
Before training the machine learning models, several preprocessing steps were applied
to ensure the dataset was clean, consistent, and ready for classification tasks.
3.1 Handling Missing Values
Missing values were checked using df.isnull().sum() and handled appropriately.
Depending on the feature:
 Some missing values were imputed with the mean or mode.
 Others were dropped if not critical or if the proportion was too small.
3.2 Encoding Categorical Variables
Categorical features such as:
 Status (e.g., "Completed", "Dropped", etc.)
 grade (if stored as a category or label instead of numeric)
...were encoded using:
 LabelEncoder for binary or ordinal variables
 Or manual mappings (e.g., {'Completed': 0, 'Dropped': 1})
This step ensures that all features are in a numeric format compatible with machine
learning models.
FIGURE 23: OUTLIERS

FIGURE 24: DISTRIBUTION OF PARTICIPANTS BY EDUCATION LEVEL


This bar chart displays the number of participants grouped by niveau_etude, which
was encoded numerically as:
 0 → e.g., Secondary / High School
 1 → e.g., Bachelor
 2 → e.g., Master or higher
The distribution appears balanced, with a slight predominance of level 0. This
confirms that the dataset is not biased toward a specific education level, allowing the
model to generalize across different participant profiles.

4.Modeling :
Several classification algorithms were implemented and evaluated, including K-Nearest
Neighbors (KNN), Support Vector Machine (SVM), Decision Tree, and Logistic Regression.
After comparative analysis, XGBoost was selected as the final model due to its strong
performance, regularization ability, and compatibility with mixed data types.
To confirm its learning capacity, an initial ROC curve was plotted during training, showing an
AUC score of 1.00 on the training set, as seen below:
FIGURE 25: COURBE R OC-XG BOOST
📷 Figure – ROC Curve of XGBoost on Training Set
To evaluate model behavior with increasing data, a learning curve was plotted for the K-
Nearest Neighbors (KNN) model.
FIGURE 26:L EARNING CURVE FOR KNN
The chart shows that while the training accuracy gradually improves with more samples, the
validation accuracy remains relatively low and unstable (around 0.67–0.68). This indicates
that the KNN model is overfitting the training data and fails to generalize properly, especially
with large datasets and high dimensionality.
6. Deployement:

FIGURE 27: U SER INTERFACE DROPOUT R ISK

The prediction model was successfully integrated into a web application developed with
Angular. This interface allows administrators to view a list of training participants along with
real-time dropout risk predictions. The deployment uses a Flask API backend that serves the
XGBoost model, and displays the result in the “Risk” column of the participant table,
indicating whether each user is “At Risk” or “Not at Risk.” The integration enables instant
classification based on profile data such as grade, platform activity, sentiment score, and
profile completeness. This deployment supports decision-making by helping instructors or
program managers identify vulnerable participants and intervene early. The interface is clean,
responsive, and ready for production use, with possibilities for automated alerts and
continuous model updates as new data becomes available.
The objective of this DSO was to build a predictive model capable of identifying participants
at risk of dropping out from a training program. Multiple machine learning models were
tested, and XGBoost was selected for its superior performance and interpretability.
The results showed that variables such as grade, profile completeness, platform activity,
and sentiment score were the most influential in predicting dropout. The final model
achieved a high AUC score and demonstrated strong generalization capabilities.
General Conclusion

This AI sprint project successfully demonstrated the application of


machine learning techniques in addressing key educational challenges,
particularly in enhancing student engagement, predicting dropout risks,
and offering personalized academic and professional recommendations.
By structuring our work around the CRISP-DM methodology and distinct
Data Science Objectives (DSOs), we ensured a coherent, modular, and
goal-driven approach throughout the project lifecycle.
The use of supervised models such as XGBoost and K-Nearest Neighbors,
combined with unsupervised methods like clustering and NLP-based
sentiment analysis, enabled us to generate actionable insights and
intelligent predictions. Our models not only identified students at risk but
also offered recommendations tailored to their unique profiles, supporting
institutional goals related to retention and satisfaction.
Furthermore, the integration of these models into a real-time web
application through Flask and Angular bridged the gap between predictive
analytics and operational decision-making. The deployed system provided
a user-friendly interface for academic staff, reinforcing the project’s
practical impact.
Overall, this project illustrates how AI can be harnessed to deliver
personalized, scalable, and proactive support in educational settings.
Future work could focus on expanding the dataset, refining models with
additional behavioral and social data, and implementing continuous
feedback loops to improve system performance over time.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy