0% found this document useful (0 votes)
48 views23 pages

Assignment 1 Predict Student Success

Uploaded by

cghsmalls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views23 pages

Assignment 1 Predict Student Success

Uploaded by

cghsmalls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

DATA 430 Technical Report Josh Short

Assignment 1 (a & b): Logistic Regression

Utilizing Logistic Regression to Predict Students Academic Success

URL to dataset:
https://archive-beta.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+su
ccess

Assignment 1a (due Week 2): you should complete the following sections ONLY:

 Overview (Problem Domain)

 Overview (Objective)

 Analysis (Exploratory Analysis)

Assignment 1b (due Week 3): all sections of this template should be completed. Modifications of the
three sections submitted in Assignment 1a should be made based on feedback from the instructor.

This template should be used in conjunction with the assignment instructions. The size of the text area
below will expand to the length of your response; the area should not be interpreted as a required or
suggested length of response. Responses within the text area should be single spaced with Times New
Roman 12pt font. The body of the document will likely be 6-9 pages, not including the Appendix; length
may vary depending on specifics of the analysis and the dataset. As needed, APA format in-text citations
should be included, along with a full references list at the end of the document.

Overview

Problem Domain: give some background and context about the problem domain (application area).
For instance, if you are doing the analysis for predicting heart disease, provide some context about the
disease and include some interesting statistics about it. Also, discuss how the method is relevant for the
chosen problem.

In Education, understanding the factors that contribute to student success is important for de-
signing effective educational policies. By leveraging data and statistical models, such as logis-
tic regression, we can gain insights into the predictors that significantly influence student out-
comes.
Student success refers to academic achievement and the ability to meet goals, such as obtain-
ing high grades or graduating on time. The goal of this project is to identify patterns and rela-
tionships between various attributes or characteristics of students and their likelihood of suc-
ceeding academically. Many schools also receive funding and grants based on student success
and graduation rates.
Logistic regression is particularly relevant for this problem as it is well-suited for binary clas-
sification tasks, where the outcome variable is categorical and takes one of two possible val-
ues. In our case, the outcome variable will represent whether a student is successful (1) or not
(0). By conducting a logistic regression analysis, we can uncover important insights about the

1
factors that contribute to student success. This information can inform targeted interventions,
resource allocation, and personalized support systems to improve educational outcomes. Ulti-
mately, the goal is to leverage data-driven approaches to enhance student success rates and
promote a more equitable and inclusive educational system.
Objective: clearly state the objective of the analysis in relation to the kind of algorithm you are
employing. Use specific language as to what question(s) you are trying to answer using the specific
analysis/modeling type.

The objective of this analysis is to develop a logistic regression model to predict student
success based on relevant attributes and variables. The logistic regression model will allow us
to answer the following questions:

Which factors significantly influence student success? How accurately can we predict student
success? Which variables have the strongest predictive power?

Through the logistic regression model, we can determine the variables that contribute the most
to the prediction of student success. This information can guide educational institutions in
prioritizing interventions and support systems.

We can then provide educational stakeholders with actionable insights to improve student
success rates. The logistic regression model will serve as a tool to identify the most influential
factors, assess the predictive accuracy, and enhance our understanding of the relationships
between various student attributes and academic outcomes.

Analysis

Exploratory Analysis: describe the data including the source, the collection method, and variables.
Perform exploratory analysis. Also, select few key variables (including the target variable for
supervised learning) and study their distributions using plots such as histograms, box plot, bar chart,
etc.

The data was aggregated using various university data sets and includes information known by
the university at the time of enrollment as well as student success after the first and second
semesters. Key Variables include Marital Status, Previous Education Level, Parents Education
Level, Parental Occupations, Time Attended Class (Morning, Evening), 1st and 2nd Semester
Courses and Grades, as well as GDP, Unemployment, and other student factors such as gen-
der, age, special needs, tuition, etc.

2
Figure 1.1 Shows the Frequency of data that falls into the three target categories.
Based on my exploratory data analysis I can see that most students graduated but more
dropped out than remained enrolled. Most students are younger than 25 and Single. Most had
finished secondary education but had not completed another degree course. Using box plots I
can see that while grade averages for the 1st and 2nd semesters remained similar there were stu-
dents who received zeros. Additionally, very few students attended night classes, which could
indicate they were full-time students.

3
Figure 1.2 Shows marital status at time of enrollment. 1 – single 2 – married 3 – widower 4 –
divorced 5 – facto union 6 – legally separated.

Figure 1.3 Students previous qualifications at time of enrollment. 1 - Secondary education 2 -


Higher education - bachelor's degree 3 - Higher education - degree 4 - Higher education - mas-
ter's 5 - Higher education - doctorate 6 - Frequency of higher education 9 - 12th year of
schooling - not completed 10 - 11th year of schooling - not completed 12 - Other - 11th year of

4
schooling 14 - 10th year of schooling 15 - 10th year of schooling - not completed 19 - Basic
education 3rd cycle (9th/10th/11th year) or equiv. 38 - Basic education 2nd cycle (6th/7th/8th
year) or equiv. 39 - Technological specialization course 40 - Higher education - degree (1st cy-
cle) 42 - Professional higher technical course 43 - Higher education - master (2nd cycle)

Figure 1.4 Shows number of students who attended night classes (Label 0) or daytime classes
(Label 1).

Figure 1.5 First semester grades for students enrolled in the course

5
Figure 1.6 Second Semester Grades for students enrolled in the same course as the first semes-
ter.

My exploratory analysis also helped me to determine that I will most likely need to do some
data processing to improve the readability of my charts. Many of the variables in this data set
use integers to represent categorical data which can be hard to read without referencing the
data set information to understand what category each integer represents. Using python, I can
easily convert the values back and forth depending on the need.

Preprocessing: armed with the exploratory analysis, perform the necessary preprocessing, both general
and specific types appropriate for the modeling type being employed.

In order to make the provided Logistic Regression testing and validation work I have con-
verted the strings in the Target column of the data set into numerical values mapping 0 to
Dropout, 1 to Graduate, and 2 to Enrolled. I then dropped the Enrolled students as they can
still either hit the target of Graduate or fail and drop out. Doing so resulted in losing roughly
700 rows of data but leaves me with a Training set of a little over 2900 and a Test set of a little
over 700.
Model Fitting: explain the key steps and activities you perform to fit the model. Experiment (as

6
appropriate) with parameters tuning. This is key, what separates highly accurate model from a less
accurate ones is the amount of performance tuning performed.

I originally used the following parameters and liblinear optimizer:


'Previous qualification', 'Mother\'s qualification', 'Father\'s qualification', 'Mother\'s occupation'
, 'Father\'s occupation' ,'Marital status', 'Age at enrollment' ,'Curricular units 1st sem (grade)',
'Curricular units 2nd sem (grade)','Unemployment rate', 'GDP'
These resulted in the following model accuracy scores:
Jaccard Score: 0.5340136054421769
Log Loss: 0.4414910791717128
There were 20 False positives and 117 False Negatives
F1 (Dropout): .70
F1 (Graduate): .86

Next, I dropped the Parents Qualifications and Parents Occupations to see the effect on accu-
racy. Since it did not affect accuracy, I know that those features do not influence our Target.
Next I removed Marital Status, Unemployment, and GDP, this had minimal effect on accu-
racy, none of the scores changed by meaningful amounts but False Positives increased by 2
and True Positives increased by 2.
This has left me with the Previous Qualification (Degree), Age, 1st Sem Grades, and 2nd Sem
Grades as features. I ran the model again dropping each of these to confirm if they were key
features or not.
I found that 2nd Sem Grades had the biggest impact on accuracy and decided to add in the En-
rolled credit amounts for the 1st and 2nd Semesters.
Previous Qualification seemed to have no impact and was removed. Age and 1st Sem grades
had minimal impact but were retained. After Adding Enrolled credits, the accuracy scores did
not change.
Results

Model Properties: explain the components of the fitted model and their characteristics. Leverage
functions to summarize the model properties. Also, leverage visualization as required.

My final components are Age at Enrollment, 1st Sem Grades, and 2nd Sem Grades.
The model function is y=-.37(Age)+.39(1st Sem Grades)+.96(2nd Sem Grades) + .34
This aligns with my model testing that 2nd Sem Grades had the biggest influence with Age and
1st Sem Grades having minimal effect.

7
Figure 2.1 Shows a bar chart of the coefficients of the final model

Output Interpretation: explain the result and interpret the final model output using terms that reflect
the application area and in relation to the stated objective. This is where you check whether or not the
stated objective is met.

Interpreting the coefficients:


Age at Enrollment: The coefficient for Age is -0.37, indicating that for every one unit increase
in Age, the predicted outcome (student success) is expected to decrease by 0.37 units. This
suggests that older students may have a slightly lower likelihood of success compared to
younger students, according to the model.
1st Sem Grades: The coefficient for 1st Sem Grades is 0.39, implying that for every one unit
increase in 1st Semester Grades, the predicted outcome is expected to increase by 0.39 units.
This suggests that better grades in the first semester positively influence the likelihood of stu-
dent success.
2nd Sem Grades: The coefficient for 2nd Sem Grades is 0.96, indicating that for every one unit
increase in 2nd Semester Grades, the predicted outcome is expected to increase by 0.96 units.
This implies that the grades achieved in the second semester have the most significant influ-
ence on student success, as indicated by the largest coefficient magnitude.

8
Evaluation: employ appropriate metrics to quantitatively evaluate the performance of the
fitted model. For supervised classification, this includes simple accuracy, precision & recall
(or sensitivity & specificity), all of which can be generated from a confusion matrix, or ROC.

precision recall f1-score support

0 0.88 0.58 0.70 274


1 0.79 0.95 0.86 452

accuracy 0.81 726


macro avg 0.83 0.77 0.78 726
weighted avg 0.82 0.81 0.80 726

9
Figure 2.2 Confusion Matrix for the final model

Conclusion

Summary: highlight the main findings in relation to the stated objective. You don’t need to
discuss the details of the analysis and the model such as accuracy here, just focus on the key
findings.

I was surprised to find that Parents education and occupation as well as whether a student had
a previous degree had such a negligible effect on the final model. The key findings of this
model is that Grades has the biggest impact on Graduation. Age also affects Graduation which
makes sense as older students tend to have more life events that can get in the way of success.

Limitations & Improvement areas: discuss the limitations of the analysis and identify
potential improvement areas for future work. This could be related to the data, algorithm, or a
combination of the two.

There are many limitations to this analysis. First off is measuring student success via just grad-
uation rates. Nothing in the data points to the success of students who drop out and become en-
trepreneurs. A much better metric for student success would be measuring income and or net
worth after set periods of time. This would then allow students to figure out which things they
can do to improve their economic standing as well as predict how long it will take them to pay
off a given degree. Additionally, this data was collected from multiple universities, however
there is no indication of which universities. Based on the funding and creator’s locations it is
reasonable to assume this data is from Portugal or at best various European universities. This
makes any analysis restricted as this model is only applicable to similar students to those
found in the data.

10
Appendix

11
# %% [markdown]
# ##We are going to create a machine learning model for a telecommunication company, to
determine if its customers will leave for a competitor, in order to take proactive action to retain
the customers.
#
# #What is the difference between linear regression and logistic regression?
# ## Linear regression is appropriate for predicting dependent variables that are composed of
continous values (e.g., predicting house prices), but it is inappropriate for predicting dependent
variables that categorical (e.g., yes or no, true or false, etc).
#
# #Python libraries
# ## Pandas: "Pandas is a software library written for the Python programming language for
data manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the three-
clause BSD license" (Wikipedia, 2023).
#
# ## Numpy: "NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays" (Wikepdia, 2023).
#
# ## Scikit-Learn: "Scikit-learn is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support-vector machines, ..." (Wikipedia, 2023).
#
# ## Matplotlib: "Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or
GTK" (Wikipedia, 2023).
#
#
#
#
#
#

12
# %%
#Import libraries that are required for the creation of the machine learning model
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# %% [markdown]
# #Dataset
# ## The dataset that we will be using is a collection of student information used to predict
student success.

# %%
#Load the data from a CSV file
data_df = pd.read_csv("data.csv", delimiter=';')

# Display first rows to ensure data imported correctly


data_df.head()

# Check the variable data type of every column in the DataFrame


column_data_types = data_df.dtypes

print(column_data_types)

# %% [markdown]
# # Data Pre-Processing

13
# %%
# Convert strings in 'Target' column to numeric values
data_df['Target'] = data_df['Target'].map({'Dropout': 0, 'Graduate': 1, 'Enrolled': 2})

# Drop data for students still in 'Enrolled' status as they are not completed results
data_df = data_df[data_df['Target'] != 2]

#data_df = data_df[['Target','Previous qualification', 'Mother\'s qualification', 'Father\'s


qualification', 'Curricular units 1st sem (grade)', 'Curricular units 2nd sem
(grade)','Unemployment rate', 'GDP']]
data_df.head()

# %% [markdown]
# # Exploratory Analysis
# ## We will conduct exploratory analysis on the histograms and bar charts of various key
variables.
#

# %%
# Plot a histogram of the 'Target' variable
plt.figure(figsize=(8, 6))
sns.histplot(data=data_df, x='Target')
plt.title('Histogram of Target')
plt.xlabel('Target')
plt.ylabel('Frequency')
plt.show()

# Plot a histogram of the 'Mother\'s qualification' variable

14
plt.figure(figsize=(8, 6))
sns.histplot(data=data_df, x='Mother\'s qualification')
plt.title('Histogram of Mother\'s qualification')
plt.xlabel('Mother\'s qualification')
plt.ylabel('Frequency')
plt.show()

# Plot a histogram of the 'Age at enrollment' variable


plt.figure(figsize=(8, 6))
sns.histplot(data=data_df, x='Age at enrollment')
plt.title('Histogram of Age at Enrollment')
plt.xlabel('Age at Enrollment')
plt.show()

# Plot a bar chart of the 'Marital status' variable


plt.figure(figsize=(8, 6))
sns.countplot(data=data_df, x='Marital status')
plt.title('Bar Chart of Marital Status')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.show()

# Pie Chart of scholarships


plt.figure(figsize=(8, 6))
data_df['Scholarship holder'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title('Pie Chart of Scholarship Holder')
plt.ylabel('')
plt.show()

#Bar chart of previous qualification

15
plt.figure(figsize=(10, 8))
sns.countplot(data=data_df, x='Previous qualification')
plt.title('Bar Chart of Previous Qualification')
plt.xlabel('Previous Qualification')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

#Bar chart of attendance time


plt.figure(figsize=(8, 6))
sns.countplot(data=data_df, x='Daytime/evening attendance\t')
plt.title('Bar Chart of Daytime/Evening Attendance')
plt.xlabel('Daytime/Evening Attendance')
plt.ylabel('Count')
plt.show()

#Grade 1st Sem Boxplot


plt.figure(figsize=(8, 6))
sns.boxplot(data=data_df, y='Curricular units 1st sem (grade)')
plt.title('Box Plot of Curricular Units 1st Sem Grade')
plt.ylabel('Grade')
plt.show()

#Grade 2nd Sem Boxplot


plt.figure(figsize=(8, 6))
sns.boxplot(data=data_df, y='Curricular units 2nd sem (grade)')
plt.title('Box Plot of Curricular Units 2nd Sem Grade')
plt.ylabel('Grade')
plt.show()

16
# %% [markdown]
# # In this step, we need to define our X and our Y. X= Features or independent variables and
Y= Dependent variable or target vector

# %%
X = np.asarray(data_df[['Age at enrollment', 'Curricular units 1st sem (grade)', 'Curricular
units 2nd sem (grade)']])
X[0:5]

# %%
y = np.asarray(data_df['Target'])
y [0:5]

# %% [markdown]
# #In this step, we normalize our dataset.

# %%
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

# %% [markdown]
# # In this step we split the dataset into train/test sets

# %%
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)

17
# %%
from sklearn.linear_model import LogisticRegression

#We will need to confusion matrix for the assignment


from sklearn.metrics import confusion_matrix

# You can experiment with these optimizers to determine if they can yield greater accuracy:
‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers'

LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

# %%
yhat = LR.predict(X_test)
yhat

# %% [markdown]
# # Let's evaluate our machine learning model
#
# ## jaccard index
# ### If the entire set of predicted labels for a sample strictly match with the true set of labels,
then the subset accuracy is 1.0; otherwise it is 0.0.

# %%
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat, pos_label=0)

# %% [markdown]
# ### Log loss( Logarithmic loss) measures the performance of a classifier where the
predicted output is a probability value between 0 and 1. "The more the predicted probability
diverges from the actual value, the higher is the log-loss value" (Gaurav Dembla, 2020).

18
#
# Reference
# Gaurav Dembla. (2020). Intuition behind Log-loss score. Retrieved on March 19, 2023 from
https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a

# %%
from sklearn.metrics import log_loss
yhat_prob = LR.predict_proba(X_test)
log_loss(y_test, yhat_prob)

# %% [markdown]
# # Confusion matrix

# %%
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

19
print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)


plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'


thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))

# %%
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix


plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Target=0','Target=2'],normalize= False,

20
title='Confusion matrix')

# %% [markdown]
# # Now, lets calculate the precision and recall

# %%
print (classification_report(y_test, yhat))

# %% [markdown]
# ### Based on the count of each section, we can calculate precision and recall of each label:
#
# ### Precision is a measure of the accuracy provided that a class label has been predicted. It
is defined by: precision = TP / (TP + FP)
#
# ### Recall is the true positive rate. It is defined as: Recall = TP / (TP + FN)
#
# ### So, we can calculate the precision and recall of each class.
#
# ### F1 score: Now we are in the position to calculate the F1 scores for each label based on
the precision and recall of that label.
#
# ### The F1 score is the harmonic average of the precision and recall, where an F1 score
reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to
show that a classifer has a good value for both recall and precision.

# %% [markdown]
# # Model Characteritics

# %%
intercept = LR.intercept_

21
coefficients = LR.coef_
print("Intercept:", intercept)
print("Coefficients:", coefficients)

# %% [markdown]
# ### Graphing the Coeficients

# %%
# Reshape the coefficients array
coefficients = coefficients.reshape(-1)

# Plotting the coefficients


plt.figure(figsize=(8, 6))
plt.bar(range(len(coefficients)), coefficients)
plt.xticks(range(len(coefficients)), X.columns)
plt.xlabel('Predictor Variables')
plt.ylabel('Coefficient')
plt.title('Linear Regression Coefficients')
plt.show()

22
References

Realinho,Valentim, Vieira Martins,Mónica, Machado,Jorge, and Baptista,Luís. (2021). Predict


students' dropout and academic success. UCI Machine Learning Repository.
https://doi.org/10.24432/C5MC89.

23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy