0% found this document useful (0 votes)
75 views5 pages

Predicting Student Academic Performance Using Data Mining Methods

This study aims to predict student academic performance at university using only pre-university marks, without socio-economic or demographic data. The researchers used data mining techniques to analyze marks from 214 civil engineering students from two batches at a university in Pakistan. Various classification algorithms were applied to the pre-university marks to predict the students' final class interval, which was divided into five categories based on final degree marks. The best results were obtained using decision tree algorithms.

Uploaded by

Fasih Dawood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views5 pages

Predicting Student Academic Performance Using Data Mining Methods

This study aims to predict student academic performance at university using only pre-university marks, without socio-economic or demographic data. The researchers used data mining techniques to analyze marks from 214 civil engineering students from two batches at a university in Pakistan. Various classification algorithms were applied to the pre-university marks to predict the students' final class interval, which was divided into five categories based on final degree marks. The best results were obtained using decision tree algorithms.

Uploaded by

Fasih Dawood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.

5, May 2017 187

Predicting Student Academic Performance using Data Mining


Methods
Raheela Asif1, Saman Hina1, Saba Izhar Haque1
1
N.E.D University of Engineering & Technology /Department of Computer Science & Software Engineering, Karachi,
75270, Pakistan

Summary
The aim of this study is to use data mining techniques for 2. Literature Review
predicting the students’ graduation performance in final year at
university using only pre-university marks and examination The literature review discloses that predicting performance
marks of early years at university, no socio-economic or at higher education level has involved substantial attention
demographic features are use. in the recent past and persists to remain focus of research
Key words: and discussion. A number of studies investigated the
Educational data mining, predicting performance, decision trees
performance of the students at higher level [3,4,5,6,7,].
The study conducted by [3] employs the Adaptive Neuro-
Fuzzy Inference system (ANFIS) to predict student
1. Introduction academic performance which will help the students to
In the past three decades the computer hardware improve their academic success.
technology has become very powerful. This has boosted Acharya and Sinha [4] apply Machine Learning
up the database and information industry. As a result a Algorithms for the prediction of students’ results. They
large number of databases and information repositories are found that best results were obtained with the decision tree
available and the organizations stored plenty of data. This class of algorithms.
has increased the need for powerful data analysis which is Kaur et al. [5] identify slow learners among students and
not possible without powerful tools. Data mining tools displaying it by a predictive data mining model using
analyze data from different perspectives and summarize classification based algorithms.
the results as useful information. They are employed to Gurlur et al. [6] attempt to find out student demographics
operate on large amounts of data to find out hidden that are associated with their success by using decision
patterns and associations that can be helpful in decision trees.
making [1]. The application of data mining methods to Vandamme et al. [7] use decision trees, neural networks
educational data is called Educational Data Mining (EDM) and linear discriminate analysis to make early predictions
which is novel and promising field [2]. Researchers and of students’ academic success in first academic year at
experts in education are using EDM techniques in higher university.
education institutions to enhance learning. The literature review about predicting performance
This paper focused on the capabilities of data mining in mentioned above show that it is possible to predict
higher learning institutions for the study of educational performance of students with a reasonable accuracy. All
data. It reflects on how data mining may help to improve the mentioned works use cross validation to assess their
decision-making processes in universities. This work aims results. However, we take one batch to train the classifier
on predicting students’ academic performance at the end and the other batch to test the prediction results. This
of four year bachelor’s degree program and identifying aspect differ our works from other works.
effective indicators of at risk students in early years of
their study. It provides the institution with the needed 3. Data and Methodology
information using which it can outline measures to
improve quality. 3.1 Data
The paper is arranged as follows: The next section is
devoted to literature review. Section 3 describes the data In this study, we used the data of two academic cohorts or
collection and methodology used for this study. Results batches of Civil Engineering Department at NEDUET,
and discussions are presented in Section 4. Finally, Section Pakistan, which entailed altogether 214 undergraduate
5 concludes the paper. students enrolled in the academic batches of 2005–06 and

Manuscript received May 5, 2017


Manuscript revised May 20, 2017
188 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.5, May 2017

2006–07. We use pre-university marks i.e. HSC (High the class intervals of students in the next
School Certificate) marks and the examination marks of batch i.e. 2006–07. Batch and Interval
students’ in first and second year courses that are taught in statistics are presented in Table 2.
first and second academic years, shown in Table 1. The
prediction variable is the class interval which is calculated Table 2: Batches and Class Interval Statistics
on the basis of the final marks of the degree. The final

in

in

in

in

in
of
marks of the degree is divided into 5 class intervals:

Academic

students’
students’

students’

students’

students’

students’
l

l
number
Cohort
Class_A (90%–100%), Class_B (80%–89%), Class_C

Total

Class

Class

Class

Class

Class
(70%–79%), Class_D (60%–69%), and Class_E (50–59%)

I
2005–06 99 - 3 46 44 6
Table 1: List of variables used in study 2006–07 115 - 3 51 44 17
Variable Description Table 2 shows that the distribution of students amongst the
Class Interval 5 promising values(Class_A, Class_B, class intervals is unbalanced. ‘Class_C’ interval contains
Class_C, Class_D and Class_E) the most students. Predicting a class interval ‘Class_C’
Adj_Marks HSC Examination total marks
would have an accuracy of 44.34%. This is the baseline of
accuracy that we want to improve.
Maths_Marks HSC Examination Mathematics marks
MPC Maths+ Physics+ Chemistry marks We ran a number of classifiers like Decision Tree
CE-101 Engineering Drawing-I produced with Gini Index (DT-GI), Decision Tree
CE-102 Engineering Mechanics produced with Information Gain (DT-IG), Decision Tree
produced with Accuracy (DT-Acc), Naive Bayes, Neural
CE-103 Surveying-I
Networks (NN), Random Forest produced with Gini Index
CE-104 Engineering Materials (RF-GI), Random Forest produced with Information Gain
EE-102 Electrical Engineering (RF-IG) and Random Forest produced with Accuracy (RF-
HS-101 English
Acc).
HS-105/127 Pakistan Studies
ME-105 Applied Thermodynamics 4. Analysis and Results
MS-105 Applied Chemistry
Table 3 shows the results of accuracy and kappa for the
MS-111 Calculus
classifiers. We have applied other classifiers like Decision
MS-121 Applied Physics Tree with Gain Ratio, Rule Induction with Information
CE-201 Surveying-II gain, Rule Induction with Accuracy, I-NN, Linear
CE-202 Introduction to Computing
Regression and Support Vector Machines. Their results are
not mentioned here as the classification accuracies are not
CE-203 Engineering Drawing-II above the baseline.
CE-204 Fluid Mechanics-I
Table 3: Prediction accuracy and Kappa results
CE-205 Mechanics of Solids-I
CE-206 Engineering Geology
CE-209 Structural Analysis-I
MS-331 Applied Probability & Statistics
HS-205/206 Islamic Studies
MS-221 Linear Algebra & Ordinary Differential
Equations
HS-303 Engineering Economics

3.2 Methodology
To predict the performance of the students as early as
possible, we use HSC marks and the marks in first and
second year courses to predict the performance of the
students’. We used the data of batch 2005–06 to train the
prediction models which were then used to predict
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.5, May 2017 189

To improve the accuracy of the classifiers, we apply We also investigated the Pearson’s correlation of first and
different feature selection techniques available in Rapid second year courses with the final marks obtained in the
Miner. The Recursive Feature Elimination (RFE) operator examination. The results of correlations are presented in
available in RapidMiner has four criterions to weight Table 3.
attributes: Weight by Gini index (GI), weight by
information gain ratio (IG), weight by chi-squared (Chi- Table 2: Correlation results between first and second year courses and
SS) and weight by rule induction to choose subsets of final marks
variables. We have four different subsets of variables from
the four criterions of the RFE operator. Each subset
contains seven variables. It is interesting to observe that
two subsets contain HSC marks. This means that HSC
marks play an important role in student’s university
performance at Civil Engineering Department. The
prediction models of Table 2, i.e. decision tree produced
with the criterion Gini index (DT-GI), decision tree
produced with the criterion information gain (DT-IG),
decision tree produced with the criterion accuracy (DT-
Acc), naive Bayes (NB), neural networks (NN) and
random forest trees produced with the criterion Gini index
(RF-GI), random forest trees produced with the criterion
information gain (RF-IG) and random forest trees
produced with the criterion accuracy (RF-Acc) were built
again using these four subsets of variables. Figure 1 gives
the results of feature selection algorithms.

Fig. 1 Comparison of classifiers accuracy for Applying Feature Selection

We can see from the Figure 1, that there is no feature


selection technique that improves the accuracy for all The five courses that we selected through the intersection
classifiers or a big majority of them. However, the of the subsets of RFE-IG and RFE-Chi-SS include one
accuracy for RFE-Chi-SS improves for two classifiers and non-course of second year (i.e. MS-331), two core courses
stays the same for three classifiers. RFE- IG gives the best from first year and two core courses from second year.
accuracies for two of the decision trees as compare to They are highlighted in Table 3. We can see from above
other feature selection techniques. We are more interested table that all these five courses have high correlation with
in decision trees result as they are understandable and can the final marks.
be used in implementing some policy. The set of attributes
selected by RFE-Chi-SS is: CE-102, CE-103, CE-202, CE- This subset of 5 courses was used with the same eight
203, CE-204, CE-206, MS-331. The set of attributes classifiers. The results are presented in Table 4. The three
selected by RFE-IG is: Adj_Marks, CE-101, CE-102, CE- decision trees that are obtained by using these 5 courses
103, CE-202, CE-204, MS-331. If we take the intersection are shown in Fig.1, Fig.2 and Fig. 3.
of these two sets we have 5 courses in common i.e. CE-
102, CE-103, CE-202, CE-204 and MS-331. The meaning Table 4: Comparison of Prediction Accuracies after applying feature
selection based on intersection of RFE-Chi SS and RFE IG
of these courses is given in Table 1.
190 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.5, May 2017

Fig.4: Decision tree produced with the accuracy with K=5

By examining the above trees, one can observed that there


are two indicators of low performance: CE-102 and CE-
202. A low performance in CE-102 leads to a leaf C or D
and a low performance in CE-202 lead to a leaf with D or
E interval in all the three trees. This suggests that a student
having a mark lower than or equal to 48 in CE-102 are
likely to achieve their degree with a poor mark. This
suggests also that students having 52 or less in CE-202 are
likely to obtain 52 or less in other subjects as well again
because of the way the final mark is calculated.
The 2 indicators of low performance contain one course
from first year and one from second year. CE-102, the first
year course should be taken as indicator to warn students
in first year. This can be abridged as follows:
• In first year, those students whose marks are
around or less than 48 in CE-102, are likely to
have a mark in the ‘D’ interval at the end of the
degree.
• In second year, students whose marks are around
or below 52 in CE-202 are likely to have a mark
in the ‘D’ or ‘E’ interval at the end of the degree.

The above findings can be used to implement some policy.


For example, the instructors of the course CE-102 could
Fig. 2 Decision tree produced with the Gini index with K=5
report about students with marks equal or less than 48.
There is a possibility that these students are at risk and
they need more academic support. A similar possibility of
identifying at risk students could take place in second year,
where the instructors could report about students whose
marks are less than 52 in CE-102. These suggestions may
help the University to pay extra attention to those students
who are at risk by arranging more academic facilities e.g.
extra classes or extra consultation hours with the
instructors.

Fig. 3: Decision tree produced with the information gain with K=5
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.5, May 2017 191

5. Conclusion
The result of the study shows that we can predict the
graduation performance in a four-years university program
using only pre-university marks and marks of first and
second year courses, no socio-economic or demographic
features, with a reasonable accuracy, and that the model
established for one cohort generalizes to the following
cohort. It makes the implementation of a performance
support system in a university simpler because from an
administrative point of view, it is easier to gather marks of
students than their socio-economic data. The result also
shows that decision trees can be used to identify the
courses that act as indicator of low performance. By
identifying these courses, we can give warning to students
earlier in the degree program.

References
[1] J. Han, and M. Kamber, Data Mining Concepts and
Techniques, 2nd ed. San Francisco: Morgan Kaufmann,
pp.5-7, 2006.
[2] R.S.J.D Baker, and K. Yacef, “The State of Educational
Data Mining in 2009: A Review and Future Visions”, 2nd
International Conference on Educational Data
Mining, Proceedings. Cordoba, Spain, pp. 1, 3-17, July 1-3,
2009.
[3] [3] A. Altaher , O. BaRukab,”Prediction of Student’s
Academic Performance Based on Adaptive Neuro-Fuzzy
Inference”, International Journal of Computer Science and
Network Security, Vol.17 No.1, January 2017.
[4] [4] A. Acharya, D. Sinha, “Early prediction of student
performance using machine learning techniques”,
International Journal of Computer Applications, Volume
107–No. 1, December 2014.
[5] [5] P. Kaur, M. Singh, G. S. Josan, “Classification and
prediction based data mining algorithms to predict slow
learners in education sector”, 3rd International Conference
on Recent Trends in Computing 2015(ICRTC-2015).
[6] [6] H. Guruler , A. Istanbullu , M. Karahasan. “A new
student performance analysing system using knowledge
discovery in higher educational databases”. Computer and
Education. 2010. 247-254.
[7] [7] J. P. Vandamme, N. Meskens , J. F. Superby,
“ Predicting Academic Performance by Data Mining
Methods”, Education Economics, Volume 15, No. 4, 2007.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy