Assignment 1 Predict Student Success
Assignment 1 Predict Student Success
URL to dataset:
https://archive-beta.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+su
ccess
Assignment 1a (due Week 2): you should complete the following sections ONLY:
Overview (Objective)
Assignment 1b (due Week 3): all sections of this template should be completed. Modifications of the
three sections submitted in Assignment 1a should be made based on feedback from the instructor.
This template should be used in conjunction with the assignment instructions. The size of the text area
below will expand to the length of your response; the area should not be interpreted as a required or
suggested length of response. Responses within the text area should be single spaced with Times New
Roman 12pt font. The body of the document will likely be 6-9 pages, not including the Appendix; length
may vary depending on specifics of the analysis and the dataset. As needed, APA format in-text citations
should be included, along with a full references list at the end of the document.
Overview
Problem Domain: give some background and context about the problem domain (application area).
For instance, if you are doing the analysis for predicting heart disease, provide some context about the
disease and include some interesting statistics about it. Also, discuss how the method is relevant for the
chosen problem.
In Education, understanding the factors that contribute to student success is important for de-
signing effective educational policies. By leveraging data and statistical models, such as logis-
tic regression, we can gain insights into the predictors that significantly influence student out-
comes.
Student success refers to academic achievement and the ability to meet goals, such as obtain-
ing high grades or graduating on time. The goal of this project is to identify patterns and rela-
tionships between various attributes or characteristics of students and their likelihood of suc-
ceeding academically. Many schools also receive funding and grants based on student success
and graduation rates.
Logistic regression is particularly relevant for this problem as it is well-suited for binary clas-
sification tasks, where the outcome variable is categorical and takes one of two possible val-
ues. In our case, the outcome variable will represent whether a student is successful (1) or not
(0). By conducting a logistic regression analysis, we can uncover important insights about the
1
factors that contribute to student success. This information can inform targeted interventions,
resource allocation, and personalized support systems to improve educational outcomes. Ulti-
mately, the goal is to leverage data-driven approaches to enhance student success rates and
promote a more equitable and inclusive educational system.
Objective: clearly state the objective of the analysis in relation to the kind of algorithm you are
employing. Use specific language as to what question(s) you are trying to answer using the specific
analysis/modeling type.
The objective of this analysis is to develop a logistic regression model to predict student
success based on relevant attributes and variables. The logistic regression model will allow us
to answer the following questions:
Which factors significantly influence student success? How accurately can we predict student
success? Which variables have the strongest predictive power?
Through the logistic regression model, we can determine the variables that contribute the most
to the prediction of student success. This information can guide educational institutions in
prioritizing interventions and support systems.
We can then provide educational stakeholders with actionable insights to improve student
success rates. The logistic regression model will serve as a tool to identify the most influential
factors, assess the predictive accuracy, and enhance our understanding of the relationships
between various student attributes and academic outcomes.
Analysis
Exploratory Analysis: describe the data including the source, the collection method, and variables.
Perform exploratory analysis. Also, select few key variables (including the target variable for
supervised learning) and study their distributions using plots such as histograms, box plot, bar chart,
etc.
The data was aggregated using various university data sets and includes information known by
the university at the time of enrollment as well as student success after the first and second
semesters. Key Variables include Marital Status, Previous Education Level, Parents Education
Level, Parental Occupations, Time Attended Class (Morning, Evening), 1st and 2nd Semester
Courses and Grades, as well as GDP, Unemployment, and other student factors such as gen-
der, age, special needs, tuition, etc.
2
Figure 1.1 Shows the Frequency of data that falls into the three target categories.
Based on my exploratory data analysis I can see that most students graduated but more
dropped out than remained enrolled. Most students are younger than 25 and Single. Most had
finished secondary education but had not completed another degree course. Using box plots I
can see that while grade averages for the 1st and 2nd semesters remained similar there were stu-
dents who received zeros. Additionally, very few students attended night classes, which could
indicate they were full-time students.
3
Figure 1.2 Shows marital status at time of enrollment. 1 – single 2 – married 3 – widower 4 –
divorced 5 – facto union 6 – legally separated.
4
schooling 14 - 10th year of schooling 15 - 10th year of schooling - not completed 19 - Basic
education 3rd cycle (9th/10th/11th year) or equiv. 38 - Basic education 2nd cycle (6th/7th/8th
year) or equiv. 39 - Technological specialization course 40 - Higher education - degree (1st cy-
cle) 42 - Professional higher technical course 43 - Higher education - master (2nd cycle)
Figure 1.4 Shows number of students who attended night classes (Label 0) or daytime classes
(Label 1).
Figure 1.5 First semester grades for students enrolled in the course
5
Figure 1.6 Second Semester Grades for students enrolled in the same course as the first semes-
ter.
My exploratory analysis also helped me to determine that I will most likely need to do some
data processing to improve the readability of my charts. Many of the variables in this data set
use integers to represent categorical data which can be hard to read without referencing the
data set information to understand what category each integer represents. Using python, I can
easily convert the values back and forth depending on the need.
Preprocessing: armed with the exploratory analysis, perform the necessary preprocessing, both general
and specific types appropriate for the modeling type being employed.
In order to make the provided Logistic Regression testing and validation work I have con-
verted the strings in the Target column of the data set into numerical values mapping 0 to
Dropout, 1 to Graduate, and 2 to Enrolled. I then dropped the Enrolled students as they can
still either hit the target of Graduate or fail and drop out. Doing so resulted in losing roughly
700 rows of data but leaves me with a Training set of a little over 2900 and a Test set of a little
over 700.
Model Fitting: explain the key steps and activities you perform to fit the model. Experiment (as
6
appropriate) with parameters tuning. This is key, what separates highly accurate model from a less
accurate ones is the amount of performance tuning performed.
Next, I dropped the Parents Qualifications and Parents Occupations to see the effect on accu-
racy. Since it did not affect accuracy, I know that those features do not influence our Target.
Next I removed Marital Status, Unemployment, and GDP, this had minimal effect on accu-
racy, none of the scores changed by meaningful amounts but False Positives increased by 2
and True Positives increased by 2.
This has left me with the Previous Qualification (Degree), Age, 1st Sem Grades, and 2nd Sem
Grades as features. I ran the model again dropping each of these to confirm if they were key
features or not.
I found that 2nd Sem Grades had the biggest impact on accuracy and decided to add in the En-
rolled credit amounts for the 1st and 2nd Semesters.
Previous Qualification seemed to have no impact and was removed. Age and 1st Sem grades
had minimal impact but were retained. After Adding Enrolled credits, the accuracy scores did
not change.
Results
Model Properties: explain the components of the fitted model and their characteristics. Leverage
functions to summarize the model properties. Also, leverage visualization as required.
My final components are Age at Enrollment, 1st Sem Grades, and 2nd Sem Grades.
The model function is y=-.37(Age)+.39(1st Sem Grades)+.96(2nd Sem Grades) + .34
This aligns with my model testing that 2nd Sem Grades had the biggest influence with Age and
1st Sem Grades having minimal effect.
7
Figure 2.1 Shows a bar chart of the coefficients of the final model
Output Interpretation: explain the result and interpret the final model output using terms that reflect
the application area and in relation to the stated objective. This is where you check whether or not the
stated objective is met.
8
Evaluation: employ appropriate metrics to quantitatively evaluate the performance of the
fitted model. For supervised classification, this includes simple accuracy, precision & recall
(or sensitivity & specificity), all of which can be generated from a confusion matrix, or ROC.
9
Figure 2.2 Confusion Matrix for the final model
Conclusion
Summary: highlight the main findings in relation to the stated objective. You don’t need to
discuss the details of the analysis and the model such as accuracy here, just focus on the key
findings.
I was surprised to find that Parents education and occupation as well as whether a student had
a previous degree had such a negligible effect on the final model. The key findings of this
model is that Grades has the biggest impact on Graduation. Age also affects Graduation which
makes sense as older students tend to have more life events that can get in the way of success.
Limitations & Improvement areas: discuss the limitations of the analysis and identify
potential improvement areas for future work. This could be related to the data, algorithm, or a
combination of the two.
There are many limitations to this analysis. First off is measuring student success via just grad-
uation rates. Nothing in the data points to the success of students who drop out and become en-
trepreneurs. A much better metric for student success would be measuring income and or net
worth after set periods of time. This would then allow students to figure out which things they
can do to improve their economic standing as well as predict how long it will take them to pay
off a given degree. Additionally, this data was collected from multiple universities, however
there is no indication of which universities. Based on the funding and creator’s locations it is
reasonable to assume this data is from Portugal or at best various European universities. This
makes any analysis restricted as this model is only applicable to similar students to those
found in the data.
10
Appendix
11
# %% [markdown]
# ##We are going to create a machine learning model for a telecommunication company, to
determine if its customers will leave for a competitor, in order to take proactive action to retain
the customers.
#
# #What is the difference between linear regression and logistic regression?
# ## Linear regression is appropriate for predicting dependent variables that are composed of
continous values (e.g., predicting house prices), but it is inappropriate for predicting dependent
variables that categorical (e.g., yes or no, true or false, etc).
#
# #Python libraries
# ## Pandas: "Pandas is a software library written for the Python programming language for
data manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the three-
clause BSD license" (Wikipedia, 2023).
#
# ## Numpy: "NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays" (Wikepdia, 2023).
#
# ## Scikit-Learn: "Scikit-learn is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support-vector machines, ..." (Wikipedia, 2023).
#
# ## Matplotlib: "Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or
GTK" (Wikipedia, 2023).
#
#
#
#
#
#
12
# %%
#Import libraries that are required for the creation of the machine learning model
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# %% [markdown]
# #Dataset
# ## The dataset that we will be using is a collection of student information used to predict
student success.
# %%
#Load the data from a CSV file
data_df = pd.read_csv("data.csv", delimiter=';')
print(column_data_types)
# %% [markdown]
# # Data Pre-Processing
13
# %%
# Convert strings in 'Target' column to numeric values
data_df['Target'] = data_df['Target'].map({'Dropout': 0, 'Graduate': 1, 'Enrolled': 2})
# Drop data for students still in 'Enrolled' status as they are not completed results
data_df = data_df[data_df['Target'] != 2]
# %% [markdown]
# # Exploratory Analysis
# ## We will conduct exploratory analysis on the histograms and bar charts of various key
variables.
#
# %%
# Plot a histogram of the 'Target' variable
plt.figure(figsize=(8, 6))
sns.histplot(data=data_df, x='Target')
plt.title('Histogram of Target')
plt.xlabel('Target')
plt.ylabel('Frequency')
plt.show()
14
plt.figure(figsize=(8, 6))
sns.histplot(data=data_df, x='Mother\'s qualification')
plt.title('Histogram of Mother\'s qualification')
plt.xlabel('Mother\'s qualification')
plt.ylabel('Frequency')
plt.show()
15
plt.figure(figsize=(10, 8))
sns.countplot(data=data_df, x='Previous qualification')
plt.title('Bar Chart of Previous Qualification')
plt.xlabel('Previous Qualification')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
16
# %% [markdown]
# # In this step, we need to define our X and our Y. X= Features or independent variables and
Y= Dependent variable or target vector
# %%
X = np.asarray(data_df[['Age at enrollment', 'Curricular units 1st sem (grade)', 'Curricular
units 2nd sem (grade)']])
X[0:5]
# %%
y = np.asarray(data_df['Target'])
y [0:5]
# %% [markdown]
# #In this step, we normalize our dataset.
# %%
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
# %% [markdown]
# # In this step we split the dataset into train/test sets
# %%
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
17
# %%
from sklearn.linear_model import LogisticRegression
# You can experiment with these optimizers to determine if they can yield greater accuracy:
‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers'
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
# %%
yhat = LR.predict(X_test)
yhat
# %% [markdown]
# # Let's evaluate our machine learning model
#
# ## jaccard index
# ### If the entire set of predicted labels for a sample strictly match with the true set of labels,
then the subset accuracy is 1.0; otherwise it is 0.0.
# %%
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat, pos_label=0)
# %% [markdown]
# ### Log loss( Logarithmic loss) measures the performance of a classifier where the
predicted output is a probability value between 0 and 1. "The more the predicted probability
diverges from the actual value, the higher is the log-loss value" (Gaurav Dembla, 2020).
18
#
# Reference
# Gaurav Dembla. (2020). Intuition behind Log-loss score. Retrieved on March 19, 2023 from
https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a
# %%
from sklearn.metrics import log_loss
yhat_prob = LR.predict_proba(X_test)
log_loss(y_test, yhat_prob)
# %% [markdown]
# # Confusion matrix
# %%
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
19
print(cm)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))
# %%
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)
20
title='Confusion matrix')
# %% [markdown]
# # Now, lets calculate the precision and recall
# %%
print (classification_report(y_test, yhat))
# %% [markdown]
# ### Based on the count of each section, we can calculate precision and recall of each label:
#
# ### Precision is a measure of the accuracy provided that a class label has been predicted. It
is defined by: precision = TP / (TP + FP)
#
# ### Recall is the true positive rate. It is defined as: Recall = TP / (TP + FN)
#
# ### So, we can calculate the precision and recall of each class.
#
# ### F1 score: Now we are in the position to calculate the F1 scores for each label based on
the precision and recall of that label.
#
# ### The F1 score is the harmonic average of the precision and recall, where an F1 score
reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to
show that a classifer has a good value for both recall and precision.
# %% [markdown]
# # Model Characteritics
# %%
intercept = LR.intercept_
21
coefficients = LR.coef_
print("Intercept:", intercept)
print("Coefficients:", coefficients)
# %% [markdown]
# ### Graphing the Coeficients
# %%
# Reshape the coefficients array
coefficients = coefficients.reshape(-1)
22
References
23