Predicting Student Academic Performance Using Data Mining Methods
Predicting Student Academic Performance Using Data Mining Methods
Summary
The aim of this study is to use data mining techniques for 2. Literature Review
predicting the students’ graduation performance in final year at
university using only pre-university marks and examination The literature review discloses that predicting performance
marks of early years at university, no socio-economic or at higher education level has involved substantial attention
demographic features are use. in the recent past and persists to remain focus of research
Key words: and discussion. A number of studies investigated the
Educational data mining, predicting performance, decision trees
performance of the students at higher level [3,4,5,6,7,].
The study conducted by [3] employs the Adaptive Neuro-
Fuzzy Inference system (ANFIS) to predict student
1. Introduction academic performance which will help the students to
In the past three decades the computer hardware improve their academic success.
technology has become very powerful. This has boosted Acharya and Sinha [4] apply Machine Learning
up the database and information industry. As a result a Algorithms for the prediction of students’ results. They
large number of databases and information repositories are found that best results were obtained with the decision tree
available and the organizations stored plenty of data. This class of algorithms.
has increased the need for powerful data analysis which is Kaur et al. [5] identify slow learners among students and
not possible without powerful tools. Data mining tools displaying it by a predictive data mining model using
analyze data from different perspectives and summarize classification based algorithms.
the results as useful information. They are employed to Gurlur et al. [6] attempt to find out student demographics
operate on large amounts of data to find out hidden that are associated with their success by using decision
patterns and associations that can be helpful in decision trees.
making [1]. The application of data mining methods to Vandamme et al. [7] use decision trees, neural networks
educational data is called Educational Data Mining (EDM) and linear discriminate analysis to make early predictions
which is novel and promising field [2]. Researchers and of students’ academic success in first academic year at
experts in education are using EDM techniques in higher university.
education institutions to enhance learning. The literature review about predicting performance
This paper focused on the capabilities of data mining in mentioned above show that it is possible to predict
higher learning institutions for the study of educational performance of students with a reasonable accuracy. All
data. It reflects on how data mining may help to improve the mentioned works use cross validation to assess their
decision-making processes in universities. This work aims results. However, we take one batch to train the classifier
on predicting students’ academic performance at the end and the other batch to test the prediction results. This
of four year bachelor’s degree program and identifying aspect differ our works from other works.
effective indicators of at risk students in early years of
their study. It provides the institution with the needed 3. Data and Methodology
information using which it can outline measures to
improve quality. 3.1 Data
The paper is arranged as follows: The next section is
devoted to literature review. Section 3 describes the data In this study, we used the data of two academic cohorts or
collection and methodology used for this study. Results batches of Civil Engineering Department at NEDUET,
and discussions are presented in Section 4. Finally, Section Pakistan, which entailed altogether 214 undergraduate
5 concludes the paper. students enrolled in the academic batches of 2005–06 and
2006–07. We use pre-university marks i.e. HSC (High the class intervals of students in the next
School Certificate) marks and the examination marks of batch i.e. 2006–07. Batch and Interval
students’ in first and second year courses that are taught in statistics are presented in Table 2.
first and second academic years, shown in Table 1. The
prediction variable is the class interval which is calculated Table 2: Batches and Class Interval Statistics
on the basis of the final marks of the degree. The final
in
in
in
in
in
of
marks of the degree is divided into 5 class intervals:
Academic
students’
students’
students’
students’
students’
students’
l
l
number
Cohort
Class_A (90%–100%), Class_B (80%–89%), Class_C
Total
Class
Class
Class
Class
Class
(70%–79%), Class_D (60%–69%), and Class_E (50–59%)
I
2005–06 99 - 3 46 44 6
Table 1: List of variables used in study 2006–07 115 - 3 51 44 17
Variable Description Table 2 shows that the distribution of students amongst the
Class Interval 5 promising values(Class_A, Class_B, class intervals is unbalanced. ‘Class_C’ interval contains
Class_C, Class_D and Class_E) the most students. Predicting a class interval ‘Class_C’
Adj_Marks HSC Examination total marks
would have an accuracy of 44.34%. This is the baseline of
accuracy that we want to improve.
Maths_Marks HSC Examination Mathematics marks
MPC Maths+ Physics+ Chemistry marks We ran a number of classifiers like Decision Tree
CE-101 Engineering Drawing-I produced with Gini Index (DT-GI), Decision Tree
CE-102 Engineering Mechanics produced with Information Gain (DT-IG), Decision Tree
produced with Accuracy (DT-Acc), Naive Bayes, Neural
CE-103 Surveying-I
Networks (NN), Random Forest produced with Gini Index
CE-104 Engineering Materials (RF-GI), Random Forest produced with Information Gain
EE-102 Electrical Engineering (RF-IG) and Random Forest produced with Accuracy (RF-
HS-101 English
Acc).
HS-105/127 Pakistan Studies
ME-105 Applied Thermodynamics 4. Analysis and Results
MS-105 Applied Chemistry
Table 3 shows the results of accuracy and kappa for the
MS-111 Calculus
classifiers. We have applied other classifiers like Decision
MS-121 Applied Physics Tree with Gain Ratio, Rule Induction with Information
CE-201 Surveying-II gain, Rule Induction with Accuracy, I-NN, Linear
CE-202 Introduction to Computing
Regression and Support Vector Machines. Their results are
not mentioned here as the classification accuracies are not
CE-203 Engineering Drawing-II above the baseline.
CE-204 Fluid Mechanics-I
Table 3: Prediction accuracy and Kappa results
CE-205 Mechanics of Solids-I
CE-206 Engineering Geology
CE-209 Structural Analysis-I
MS-331 Applied Probability & Statistics
HS-205/206 Islamic Studies
MS-221 Linear Algebra & Ordinary Differential
Equations
HS-303 Engineering Economics
3.2 Methodology
To predict the performance of the students as early as
possible, we use HSC marks and the marks in first and
second year courses to predict the performance of the
students’. We used the data of batch 2005–06 to train the
prediction models which were then used to predict
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.5, May 2017 189
To improve the accuracy of the classifiers, we apply We also investigated the Pearson’s correlation of first and
different feature selection techniques available in Rapid second year courses with the final marks obtained in the
Miner. The Recursive Feature Elimination (RFE) operator examination. The results of correlations are presented in
available in RapidMiner has four criterions to weight Table 3.
attributes: Weight by Gini index (GI), weight by
information gain ratio (IG), weight by chi-squared (Chi- Table 2: Correlation results between first and second year courses and
SS) and weight by rule induction to choose subsets of final marks
variables. We have four different subsets of variables from
the four criterions of the RFE operator. Each subset
contains seven variables. It is interesting to observe that
two subsets contain HSC marks. This means that HSC
marks play an important role in student’s university
performance at Civil Engineering Department. The
prediction models of Table 2, i.e. decision tree produced
with the criterion Gini index (DT-GI), decision tree
produced with the criterion information gain (DT-IG),
decision tree produced with the criterion accuracy (DT-
Acc), naive Bayes (NB), neural networks (NN) and
random forest trees produced with the criterion Gini index
(RF-GI), random forest trees produced with the criterion
information gain (RF-IG) and random forest trees
produced with the criterion accuracy (RF-Acc) were built
again using these four subsets of variables. Figure 1 gives
the results of feature selection algorithms.
Fig. 3: Decision tree produced with the information gain with K=5
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.5, May 2017 191
5. Conclusion
The result of the study shows that we can predict the
graduation performance in a four-years university program
using only pre-university marks and marks of first and
second year courses, no socio-economic or demographic
features, with a reasonable accuracy, and that the model
established for one cohort generalizes to the following
cohort. It makes the implementation of a performance
support system in a university simpler because from an
administrative point of view, it is easier to gather marks of
students than their socio-economic data. The result also
shows that decision trees can be used to identify the
courses that act as indicator of low performance. By
identifying these courses, we can give warning to students
earlier in the degree program.
References
[1] J. Han, and M. Kamber, Data Mining Concepts and
Techniques, 2nd ed. San Francisco: Morgan Kaufmann,
pp.5-7, 2006.
[2] R.S.J.D Baker, and K. Yacef, “The State of Educational
Data Mining in 2009: A Review and Future Visions”, 2nd
International Conference on Educational Data
Mining, Proceedings. Cordoba, Spain, pp. 1, 3-17, July 1-3,
2009.
[3] [3] A. Altaher , O. BaRukab,”Prediction of Student’s
Academic Performance Based on Adaptive Neuro-Fuzzy
Inference”, International Journal of Computer Science and
Network Security, Vol.17 No.1, January 2017.
[4] [4] A. Acharya, D. Sinha, “Early prediction of student
performance using machine learning techniques”,
International Journal of Computer Applications, Volume
107–No. 1, December 2014.
[5] [5] P. Kaur, M. Singh, G. S. Josan, “Classification and
prediction based data mining algorithms to predict slow
learners in education sector”, 3rd International Conference
on Recent Trends in Computing 2015(ICRTC-2015).
[6] [6] H. Guruler , A. Istanbullu , M. Karahasan. “A new
student performance analysing system using knowledge
discovery in higher educational databases”. Computer and
Education. 2010. 247-254.
[7] [7] J. P. Vandamme, N. Meskens , J. F. Superby,
“ Predicting Academic Performance by Data Mining
Methods”, Education Economics, Volume 15, No. 4, 2007.