rapportml (2)
rapportml (2)
Machine Learning
Sprint IA
Performed by:
Team Dev Dynamos
Team Members:
Dorsaf RIAHI
Wissal DAOUD
Slim-Fady HANAFI
Raslen FERCHICHI
Saif ROMDHANI
Mohamed Aziz BENDEG
Supervised by: Supervisor: Mrs. Jihen HLEL
Summary
Contents
Chapter 1 – Project Context and Problem Framing................................................................................5
1.1Introduction:.....................................................................................................................................5
1.2Problem Statement :.........................................................................................................................5
1.3Proposed Solution :...........................................................................................................................6
Chapter 2 – Project Structure and Methodology....................................................................................8
Chapitre 3 – Outils et Technologies Utilisées.......................................................................................13
Introduction:....................................................................................................................................13
Chapitre4:Business Understanding and Project Objectives..................................................................17
General Conclusion..............................................................................................................................53
List of Figures
1.1Introduction:
In today's data-driven world, information has become a key strategic asset.
Artificial Intelligence (AI) and Machine Learning (ML) are now essential tools for
extracting value from data, providing powerful levers for analysis, prediction,
and decision-making. This project is aligned with that dynamic, aiming to
harness the capabilities of machine learning to address concrete analytical
objectives.
Our work relies on a real-world dataset that includes detailed information on
student profiles, internship opportunities, and academic journeys. By applying
appropriate data processing and modeling techniques, we strive to build
predictive models that support insightful decision-making in educational and
career-oriented contexts.
Throughout this report, we follow the CRISP-DM methodology to ensure a clear
and rigorous presentation of our AI sprint. Each stage — from business
understanding to deployment — is carefully detailed with relevant figures,
clean visualizations, and scientifically grounded references, ensuring
coherence, clarity, and value throughout the entire document.
.
1.2Problem Statement :
.
Chapitre 3 – Outils et Technologies
Utilisées
Introduction:
1.Business Understanding
1.1 .Project Objective
The central goal of this project is to enhance student engagement and retention
by personalizing their academic and professional journeys through Machine
Learning techniques. In a higher education environment where dropout rates and
misaligned career paths are major concerns, data-driven personalization has the
potential to significantly improve student satisfaction, support services, and
career outcomes.
This aligns with the broader mission of modern educational institutions: to guide
students effectively, increase retention, and match learners with the most
suitable opportunities, whether academic or professional.
1.2Defined Data Science Objectives (DSOs):
To support the overarching goal of increasing student retention and
personalization through machine learning, six complementary Data Science
Objectives (DSOs) were defined. Each DSO targets a specific challenge within
the student journey, from understanding profiles to anticipating disengagement.
The table below summarizes each objective:
DSO Description
2. Internship Offer Group internship offers based on shared characteristics to better match
Segmentation students with suitable opportunities.
5. Academic Dropout Risk Predict the likelihood of academic dropout using classification models,
Prediction allowing early intervention and support.
Table2:DSO
2. Data Understanding
This section presents the data exploration performed for each defined Data
Science Objective (DSO). Each team member worked independently and
applied Exploratory Data Analysis (EDA) to extract meaningful insights and
prepare the data for modeling.
2.1 DSO 1 – Personalized Internship Recommendation :
Objective
This module aims to build an intelligent system that recommends internship
offers to students based on their academic background and application
preferences. The core goal is to match students with the most relevant
opportunities using similarity-based machine learning techniques.
2.1.1Data Understanding:
The initial dataset includes detailed information about:
Student IDs and features extracted from their academic path.
Internship offers, including titles, descriptions, locations, required skills,
and stipends.
Historical data such as past applications and acceptances.
Key Observations:
Most internships are located in major cities and sectors like Data Science,
Web Development, and Marketing.
The stipend (Stipend) distribution was highly skewed, with some unpaid
or extremely low-paid internships.
Text features such as internship titles and skill descriptions were rich in
semantic information, justifying NLP techniques.
FIGURE 2 DISTRUBTION OF STIPENDS
FIGURE 3 T OP 10 COMPETENCES
2.1.2Data Preparation :
To prepare the dataset for modeling, several preprocessing steps were
implemented to ensure the quality and consistency of the data used for the
recommendation engine. First, a text preprocessing pipeline was applied
to internship titles and skills, including lowercasing, stop word removal,
and stemming. This was followed by the application of TF-IDF
vectorization, which transformed the textual data into numerical vectors
capturing the importance of each term across all internship descriptions.
In parallel, feature engineering was performed to encode relevant
categorical variables such as city, internship duration, and offer category.
These features were then combined with the text-based vectors to
construct a comprehensive student-offer profile vector.
To measure the similarity between students and internship offers, a cosine
similarity matrix was computed using the TF-IDF vectors. This allowed
the system to quantify how closely each student’s profile matched each
offer. Finally, a Top-N retrieval mechanism was implemented using K-
Nearest Neighbors (KNN) to extract the top 5 most relevant internship
recommendations for each student based on the similarity scores.
To assess the quality of the recommendation engine, a custom evaluation
metric was defined: the mean cosine similarity of the Top-5
recommended items. The function mean_top_similarity() computes, for
each job offer, the average similarity between it and its top 5 closest
recommendations (excluding itself). This method provides a reliable
measure of how semantically relevant the recommendations are. The
global average over the dataset offers a solid indicator of model
performance.
📝 Observation: The data is dense around the center, forming a nearly bell-
shaped curve, though slightly skewed to the right.
This insight is important for later modeling steps, as it highlights potential
outliers and helps understand the availability landscape for internship
candidates.
Distribution of Hackathon Duration
The histogram above shows the distribution of the duree_jours variable, which
represents the duration of hackathons (or internships) in days. The data appears
evenly spread across the range from 1 to 10 days, with no significant skewness
or concentration at a particular value.
FIGURE 10 DISTRIBUTION OF INTERNSHIP DURATIONS
4. Modeling
In this phase, we built and evaluated machine learning models to predict the risk
of academic dropout. Two classification algorithms were implemented and
compared: K-Nearest Neighbors (KNN) and Support Vector Machine
(SVM).
6. Deployment
After completing the evaluation phase, the final step of the CRISP-DM process
involves deploying the predictive system into a real educational setting. The
objective is to transform the machine learning model into a fully operational
application that supports academic staff in identifying at-risk students and taking
timely, data-driven actions.
In our case, the trained KNN and SVM models were deployed through a Flask-
based REST API, enabling seamless integration with a front-end interface. This
setup allows real-time predictions using new student data submitted through a
web form or dashboard.
The deployment solution includes the following components:
Automated Prediction Interface: Academic staff or system administrators
can input student records through the front-end interface. The Flask API
processes the data, runs the trained model, and returns a prediction
indicating dropout risk (0 = low risk, 1 = high risk).
Interactive Dashboard: The front-end displays prediction results clearly,
along with visual indicators such as risk level, confidence score, and
contextual explanations. This helps non-technical users (e.g., advisors)
interpret the results effectively.
Alert and Notification System: When a student is predicted to be at high
risk, the system can trigger alerts (e.g., banners, color flags) on the
dashboard to prompt academic follow-up or guidance intervention.
Continuous Learning via Feedback Loop: The system architecture allows
for future enhancements where outcomes (e.g., success or dropout status)
can be fed back into the model pipeline. This enables model retraining
and continuous improvement based on real student behavior.
By deploying the solution via Flask and integrating it into a user-friendly front-
end, the system bridges the gap between machine learning output and actionable
educational decisions. It demonstrates how AI can be effectively applied to
support institutional goals such as student retention and success.
User Interface Integration
To provide a user-friendly and accessible way for academic staff to interact with
the prediction model, a web interface was developed and integrated into the
existing educational platform. This front-end module allows users to input key
parameters related to a hackathon or training session—such as location, number
of seats, duration, difficulty level, and importance level—and instantly receive a
prediction on the probability of success.
Once the user clicks the "Tester" button, the data is sent to the back-end Flask
API, which processes the input using the trained machine learning model (KNN
or SVM). The prediction is then returned and displayed dynamically in the
interface, indicating either:
✅ Succès: if the probability of success exceeds the model threshold (e.g.,
62.84% in the second example).
❌ Échec: if the predicted likelihood of success is lower (e.g., 43.21%).
The result box is color-coded (green for success, red for failure) to ensure instant
visual feedback, and can optionally be followed by recommendations or next
steps
This interface demonstrates how artificial intelligence can be operationalized in
real time, offering valuable decision support in educational planning and student
engagement strategies.
1. Business Understanding
The objective of this DSO is to predict whether a training program participant is at
risk of dropping out (at_risk = 1). This predictive model helps improve training
retention, personalize follow-up, and enhance participant engagement.
1. Data Understanding:
To better understand the relationships between input features and the
target variable (at_risk), a Pearson correlation matrix was generated.
The strongest negative correlations with at_risk were observed for:
grade (-0.38)
profil_complet (-0.37)
score_activite (-0.34)
These findings suggest that academic performance, engagement, and
user profile completeness are important factors in dropout risk.
Other features like nb_sessions, note_theorique, or note_pratique
showed weak correlation individually, but may still contribute when
combined in a predictive model.
This correlation analysis helps prioritize features before modeling and
supports the feature selection strategy used in the next phase.
4.Modeling :
Several classification algorithms were implemented and evaluated, including K-Nearest
Neighbors (KNN), Support Vector Machine (SVM), Decision Tree, and Logistic Regression.
After comparative analysis, XGBoost was selected as the final model due to its strong
performance, regularization ability, and compatibility with mixed data types.
To confirm its learning capacity, an initial ROC curve was plotted during training, showing an
AUC score of 1.00 on the training set, as seen below:
FIGURE 25: COURBE R OC-XG BOOST
📷 Figure – ROC Curve of XGBoost on Training Set
To evaluate model behavior with increasing data, a learning curve was plotted for the K-
Nearest Neighbors (KNN) model.
FIGURE 26:L EARNING CURVE FOR KNN
The chart shows that while the training accuracy gradually improves with more samples, the
validation accuracy remains relatively low and unstable (around 0.67–0.68). This indicates
that the KNN model is overfitting the training data and fails to generalize properly, especially
with large datasets and high dimensionality.
6. Deployement:
The prediction model was successfully integrated into a web application developed with
Angular. This interface allows administrators to view a list of training participants along with
real-time dropout risk predictions. The deployment uses a Flask API backend that serves the
XGBoost model, and displays the result in the “Risk” column of the participant table,
indicating whether each user is “At Risk” or “Not at Risk.” The integration enables instant
classification based on profile data such as grade, platform activity, sentiment score, and
profile completeness. This deployment supports decision-making by helping instructors or
program managers identify vulnerable participants and intervene early. The interface is clean,
responsive, and ready for production use, with possibilities for automated alerts and
continuous model updates as new data becomes available.
The objective of this DSO was to build a predictive model capable of identifying participants
at risk of dropping out from a training program. Multiple machine learning models were
tested, and XGBoost was selected for its superior performance and interpretability.
The results showed that variables such as grade, profile completeness, platform activity,
and sentiment score were the most influential in predicting dropout. The final model
achieved a high AUC score and demonstrated strong generalization capabilities.
General Conclusion