0% found this document useful (0 votes)

65 views15 pages

10.1007@978 981 13 6861 548

This document discusses using data mining techniques to analyze student performance and predict dropout rates. It reviews related literature on applying classification and clustering algorithms like decision trees, naive Bayes, and k-means to educational data. The paper then describes analyzing a dataset of student academic records and socioeconomic information from a university using visualization and preparing the data for descriptive and classification modeling. The goal is to identify at-risk students and improve understanding of factors influencing academic performance and dropout rates.

Uploaded by

sonia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views15 pages

10.1007@978 981 13 6861 548

Uploaded by

sonia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

The Analysis of Student Performance

Using Data Mining

Leo Willyanto Santoso and Yulia

Abstract This paper presents the study of data mining in the education industry to
model the performance for students enrolled in university. Two algorithms of data
mining were used. Firstly, a descriptive task based on the K-means algorithm was
utilized to select several student clusters. Secondly, a classification task supported
two classification techniques, known as decision tree and Naïve Bayes, to predict
the dropout because of poor performance in a student’s first four semesters. The
student academic data collected during the admission process of those students were
used to train and test the models, which were assessed using a cross-validation tech-
nique. Experimental results show that the prediction of drop out student is improved,
and student performance is monitored when the data from the previous academic
enrollment are added.

Keywords Data mining · Education · Drop out · Student performance

1 Introduction

Data mining represents a significant computational advance in obtaining information

from hidden relationships between variables. This discipline aims to extract useful
knowledge from a high volume of data in which initially this knowledge is unknown,
but when applying mining techniques, these relationships are discovered. The appli-
cation of the technologies and tools of data mining in various educational contexts
is known as educational data mining (EDM) or data mining in education [1].
The contributions of data mining in education have been used to increase under-
standing of the educational process, with the main objective of providing teachers
and researchers with recommendations for the improvement of the teaching-learning
process. By implementing data mining applications in education, teachers and admin-
istrators could organize educational resources in a more efficient way.

L. W. Santoso (B) · Yulia

Petra Christian University, Surabaya, Indonesia
e-mail: leow@petra.ac.id

© Springer Nature Singapore Pte Ltd. 2019 559

S. K. Bhatia et al. (eds.), Advances in Computer Communication
and Computational Sciences, Advances in Intelligent Systems and Computing 924,
https://doi.org/10.1007/978-981-13-6861-5_48
560 L. W. Santoso and Yulia

The objective of the EDM is to apply data mining to traditional teaching systems—
in particular to learning content management systems and intelligent web-based
education systems. Each of these systems has different data sources for knowledge
discovery. After the pre-processing of the data in each of these systems, the different
techniques of data mining are applied: statistics and visualization, grouping and
classification, association rules and data mining.
The amount of academic information stored in the databases of educational insti-
tutions is very useful in the teaching and learning process; that is why nowadays there
has been significant research interest in the analysis of the academic information.
This research focuses to apply data mining techniques to the academic records of the
students that entered the academic periods between July 2010 and June 2014 through
the construction of a mining model of descriptive data, which allows to create the
different profiles of the admitted students with socioeconomic information. For the
development of the research, the CRISP-DM methodology was used to structure
the lifecycle of a data mining project in six phases, described in four levels, which
interact with each other during the development of the research [2].
This paper is organized as follows: Sect. 2 contains the background of the research
and a review of the state of the art of data mining and the use of its techniques in
the educational industry sector. In Sect. 3, the understanding of the data is made,
in order to perform a preliminary exploration of the data. The preparation of these
covers all the activities necessary for the construction of the final dataset, the selection
of tables, records, and attributes. Section 4 focuses on the design and evaluation of
a descriptive and classification model. Finally, in Sect. 5, the conclusions and future
work are presented.

2 Literature Review

Data mining is widely used in many interdisciplinary fields [3], including in the
education sector. There have been many researches in data mining for education.
Araque et al. [4] conducted a study on the factors that affect the university dropout by
developing a prediction model. This model could measure the risk of abandonment
of a student with socioeconomic information and academic records, through the
technique of decision tree and logistic regression, to quantify students at high risk of
dropping out.
Kotsiantis et al. [5] present the study of a learning algorithm for the prediction of
student desertion—i.e., when a student abandons studies. The background of their
research is the large number of students who do not complete the course in universities
that offer distance education. A large number of testing were carried out with the
academic data, the algorithms of decision tree, neural network, Naive Bayes, logistic
regression, and support vector machines were compared to know the performance
of the proposed system. The analysis of the results showed that the Naive Bayes
algorithm is the most appropriate to predict the performance of students in a distance
education system.
The Analysis of Student Performance Using Data Mining 561

Kuna et al. [6], in their work “The discovery of knowledge obtained through the
process of Induction of decision trees,” used decision trees to model classifications
of the data. One of the main results obtained was the characterization of students at
high risk of abandoning their university studies.
Kovacic [7] studied socioeconomic variables such as age, gender, ethnicity, dis-
ability, employment status, and the distance study program. The objective of the
research was to identify students at high risk of dropping out of school. Data mining
techniques, decision trees, and logistic regression were used in this research.
Yadav et al. [8] presented a data mining project to generate predictive models
and identify students at high risk of dropping out taking into account student records
at the first enrollment. The quality of the prediction models was examined with
the algorithms ID3, C4.5, and ADT of the decision tree techniques. ADT machine
learning algorithms can learn from predictive models with student data from previous
years. With the same technique, Quadril and Kalyankar [9] presented the study of
data mining to construct and evaluate a predictive model to determine the probability
of desertion of a particular student; they used the decision tree technique to classify
the students with the application of the algorithm C4.5.
Zhang and Oussena [10] proposed the construction of a mining course manage-
ment system based on data mining. Once the data were processed in the system, the
authors identified the characteristics of students who did not succeed in the semester.
In this research, support vector machine, Naive Bayes and decision tree were used.
The highest precision in the classification was presented with the Naive Bayes algo-
rithm, while the decision tree obtained one of the lowest values.
The evaluation of the important attributes that may affect student performance
could improve the quality of the higher education system [11–14]. Radaideh et al.
[15] presented a classification model by implementing ID3 and C4.5 algorithms of
the decision tree techniques and the Naive Bayes. The classification for the three algo-
rithms is not very high, to generate a high-quality classification model, it is necessary
to add enough attributes. In the same study, Yudkselturk et al. [16] examined the pre-
diction of dropout in online academic programs, in order to classify students who
dropped out, three mining techniques were applied: decision tree, Naive Bayes, and
neural network. These algorithms were trained and evaluated using a cross-validation
technique. On the other hand, Pal [17] presented a data mining application to gen-
erate predictive models taking into account the records of the students of the first
period. The decision tree is used for validation and training to find the best classifier
to predict students who dropped out.
Bhise et al. [18] studied the evaluation factors of students to improve performance,
using grouping technique through the analysis of the K-Means algorithm, to char-
acterize the student population. Moreover, Erdogan and Timor [19] presented the
relationship of university students between the entrance examinations and the results
of success. The study was carried out using algorithm techniques of group analysis
and K-Means. Bhardwaj and Bhardwaj [20] presented the application of data mining
in the environment of engineering education, the relationship between the university,
and the results obtained by students, through the analysis of K-algorithm techniques.
562 L. W. Santoso and Yulia

3 Data Analysis and Modeling

This chapter focuses on the understanding of the data where visualization techniques
are applied, such as histograms, in order to perform a preliminary exploration of the
records and verify the quality of the data. Once the analysis is done, we proceed with
the data preparation phase, which includes the tasks of selecting the data to which
the modeling techniques will be applied for their respective analysis.
The first task is collecting the initial data. The objective of this task is to obtain
the data sources of the academic information system of the University. The first set
of data grouped the socioeconomic information and the result of the admission tests
(Language, English, Mathematics, and Logic). The second set of data is made up
of the academic and grading history obtained by the students: the academic year
and period of the student’s admission; the program in which he/she is enrolled; the
student’s academic situation (academic blocking due to low academic performance
and no academic blocking); and number of academic credits registered, approved,
lost, canceled, and failed. The generated queries were made through the PostgreSQL
database management system. A process of concatenation of the two datasets was
performed, obtaining a flat file with 55 attributes and 1665 records of students admit-
ted and enrolled in the systems and electronics engineering programs.
The next task is data exploration. Exploratory analysis is a task that allows detailed
analysis of some variables and identifying characteristics; for this, some of the visu-
alization tools such as tables and graphs were used, with the purpose of describing
the data mining objectives of the comprehension phase.
The task of checking the quality of the data specifies a revision of the same as the
lost or those that have missing values committed by coding errors. In this section, the
quality of the data corresponding to the socioeconomic information of the admitted
student is verified.
The next task is data selection. In this task, the process of selecting the relevant
data for the development of the data mining objectives is carried out. A first pre-
processing, for the final selection of the data, is the selection of attributes. It was
obtained that there are 55 attributes or variables that contain values that may or
may not contribute to the study; this is based on the exploration initial of the data
and in the description of the fields defined in the variable dictionary. In the dataset
selected for the modeling, no errors were found in the fields; differences in the
selected records, the errors that were presented in some cases were missing, due to
the fact that the processing was not adequate at the time of the typing such as email,
residence address, telephone number, date of birth, type of blood, and ethnicity that
are attributes considered not relevant to the case under study.
To develop the model, the application RapidMiner was used for automatic learning
for analysis and data mining; this program allows the development of data analysis
processes through the linking of operators through a graphic environment. For the
implementation of the algorithm, the K-Means operator of the grouping and segmen-
tation library using the Euclidean distance was used to evaluate the quality of the
groups found. The algorithm is responsible for both numerical and categorical values.
The Analysis of Student Performance Using Data Mining 563

Fig. 1 Selection of the Group number (K) for students admitted

Table 1 Distribution of the registration number in the application of the K-means algorithm
Group 0 Group 1 Group 2 Group 3 Group 4
The number of record 317 130 418 389 409
Percentage (%) 19 8 25 23 25

However, additional pre-processing was performed to normalize all the numerical

attributes between 0 and 1 with the normalize operator. All attributes must have the
same scale for a fair comparison between them.
A grouping model was applied to the dataset for the characterization of the admit-
ted students, create the different profiles of the students in the different groups found,
and determine what other factors define the separation of groups produced by the
K-Means algorithm.
Repeated interaction was performed to determine the value of K or the number
of groups. The value of K varied from 2 to 14. The results were evaluated based on
the quadratic error of each iteration; for the selection of the group number, the elbow
method was used.
Figure 1 shows the iterations performed to find the value of k in the first dataset of
the admitted students, the k with value of 5 was selected, where the SSE is equal to
7954. The K-Means algorithm produced a model with five groups, from the descrip-
tion of these groups is expected to characterize the profiles of admitted students.
Table 1 shows the distribution of the number of records and the percentage of each
of the resulting groups. Group 2 and 4 group the largest number of records, on the
contrary, the lowest percentage of records are in group 1.
The model was necessary to “de-normalize” it, to put each one of the values of
the variables in their original ranges. The analysis of the model was made with the
socioeconomic information and the results of the admission tests; then, an analy-
sis was made about the academic situation of the students who in each group had
academic block with four enrollments.
564 L. W. Santoso and Yulia

Table 2 Distribution of the number of students with four enrollments

No. of groups Enrollment Enrollment Enrollment Enrollment Total
1 2 3 4
Group 0 32 13 11 10 66
Group 1 127 63 47 29 266
Group 2 178 58 72 23 331
Group 3 160 90 48 41 339
Group 4 66 32 45 52 195
Number of records 563 256 223 155 1197
Percentage (%) 47 21 19 13 100

Table 3 Distribution of the number of students with four enrollments

No. of Enrollment 1 Enrollment 2 Enrollment 3 Enrollment 4
groups No Block No Block No Block No Block
block acad. block acad. block acad. block acad.
Group 0 23 26 18 2 14 3 15 0
Group 1 17 30 13 11 15 2 9 2
Group 2 43 11 15 3 22 0 7 0
Group 3 22 26 21 6 14 0 11 1
Group 4 16 17 12 5 23 0 26 1

Table 2 shows the distribution of the number of records in each of the groups in
the first four semesters or academic enrollments. Groups 2 and 3 are characterized
by grouping the largest number of records with 28% in each group. In group 0, on
the other hand, there is the smallest number of records with 6%. 47% of the registers
are students of the first semester, 21% present second enrollment, 19% with third
enrollment, and 13% of the remaining registers have four academic enrollments.
Table 3 shows in each group the academic status of the students in the first four
enrollments. Group 1 is characterized by grouping the highest percentage of students
with academic block, in contrast to group 2 where you can see the lowest percentage
of students with academic block. The group 0 is characterized by good performance
in the admission tests, grouped 26% of students with blocking in the first enrollment,
2% in the second and 3% in the third enrollment. Group 1 groups the students with
the lowest performance of the admission tests similar to group 2. 30% of the students
present blocking in the first enrollment, 11% in the second, and 2% are in the third
and fourth enrollment.
Group 2 is characterized by grouping the students with the lowest performance of
the admission tests and the least number of students with blocks. About 11% of the
students have a block in the first enrollment and 3% in the second enrollment. Group
3 is characterized by grouping the students with good performance in the admission
tests similar to group 0.26% of students present blockage in the first enrollment,
6% with two and 1% with four enrollments. Finally, group 4 is characterized by
The Analysis of Student Performance Using Data Mining 565

grouping the smallest number of students with blocks. About 17% of students with a
registration have a block, 5% correspond to students with two enrollments, and 1%
with four enrollments.

4 Result and Analysis

In this section, two models of data mining to analyze the academic and non-academic
data of the students are presented. The models used two classification techniques,
decision tree, and Naïve Bayes, in order to predict the loss of academic status due
to low academic performance in their study. The historical academic records and the
data collected during the admission process were used to train the models, which
were evaluated using cross-validation.
Table 4 presents the total number of registrations or students with first enrollment
or enrollment and number of students with academic block due to underperformance.
Table 5 shows the number of students with academic block in each period or
enrollment. The largest number of students with academic block is presented in the
first enrollment. The second, third, and fourth enrollment shows a decrease in the
number of students with blocks. In the 2010–01 entry period, the highest number of
students with academic blocks was presented in each academic enrollment.
The classification model proposed in this research uses the socioeconomic infor-
mation. The classification model uses two widely used techniques, decision trees
and a Bayesian classifier. The reason for selecting these algorithms is their great
simplicity and interpretability.
The decision tree is the first technique used to classify the data; this algorithm
generates a recursive decision tree when considering the criterion of the highest
proportion of information gain—that is, it chooses the attribute that best classifies the
data. It is a technique where an instance is classified following the path of conditions,
from the root to a leaf, which will correspond to a labeled class. A decision tree can
easily be converted into a set of classification rules. The most representative algorithm
is C4.5, which handles both categorical and continuous attributes. It generates a
decision tree recursively when considering the criterion of the highest proportion
of information gain. The root node will be the attribute whose gain is maximum.
Algorithm C4.5 uses pessimistic pruning to eliminate unnecessary branches in the
decision tree and to improve classification accuracy.
The second technique to be considered for the construction of the model is a
Bayesian classifier. It is one of the most effective classification models. Bayesian
classifiers are based on Bayesian networks; these are models probabilistic graphs
that allow modeling in a simple and precise way the underlying probability distribu-
tion to a dataset. Bayesian networks are graphic representations of dependency and
independence relationships between the variables present in the dataset that facilitate
the understanding and interpretability of the model. Numerous algorithms have been
proposed to estimate these probabilities. Naive Bayes is one of the practical learning
566

Table 4 Registration number and academic blocks per academic period

Academic condition Academic period
2010–1 2010–2 2011–1 2011–2 2012–1 2012–2 2013–1 2013–2 2014–1 2014–2
No block 115 145 137 119 100 146 131 151 110 166
Acad. block 70 32 49 30 31 22 25 35 27 23
Total record 185 177 186 149 131 168 156 186 137 189
L. W. Santoso and Yulia
Table 5 Academic block by entry period or first enrollment
Income period Academic block
2010–1 2010–2 2011–1 2011–2 2012–1 2012–2 2013–1 2013–2 2014–1 2014–2
2010–1 40 15 7 5 0 1 2 0 0 0
2010–2 28 0 0 3 0 0 0 0 0
2011–1 39 6 2 2 0 0 0 0
2011–2 22 8 0 0 0 0 0
2012–1 20 11 0 0 0 0
The Analysis of Student Performance Using Data Mining

2012–2 14 8 0 0 0
2013–1 15 10 0 0
2013–2 28 7 0
2014–1 26 1
2014–2 23
567
568 L. W. Santoso and Yulia

Table 6 Academic situation in the first four enrollments

Academic situation Enrollment 1 Enrollment 2 Enrollment 3 Enrollment 4
No block 309 190 214 145
Acad. block 255 66 9 10
Total record 564 256 223 155

Table 7 Test, training and validation dataset

Number of enrollment Total Training and validation Test data 20%
record data 80%
No block Acad. No block Acad.
block block
Enrollment 1 564 247 204 62 51
Enrollment 2 256 152 53 38 13
Enrollment 3 223 171 7 43 2
Enrollment 4 155 116 8 29 2

algorithms most used for its simplicity, resistance to noise, short time for processing,
and high predictive power.
Different models were trained and tested to predict if a student will be blocked
in a particular enrollment. The first model analyzed the loss of academic status
based on socioeconomic information and the results of the tests collected during the
admission process. The second model was analyzed with the initial information of the
enrollment process and the academic records of the first four registrations. Table 6
describes the number of registrations in the first four enrollments with academic
status (No Block and Academic Block).
For the design of the model, the RapidMiner application was used; this is a program
for automatic learning and data mining process, through a modular concept, which
allows the design of learning models using chain operators for various problems. For
the validation of the classification model, stratified sampling technique was used.
The operator to partition the dataset called split data; this operator creates partition
to the dataset in subsets according to the defined size and the selected technique. For
the implementation of the decision tree algorithm, the decision tree operator and the
Bayesian algorithm Naive Bayes were used. Table 7 shows the number of records
in the first four enrollment, and 80% of the records were taken as training set and
10-fold cross-validation and 20% of the sample was used as a test set.
To estimate the performance of the model, the X-Validation operator was used.
This operator allows to define the process of cross-validation with 10-fold on the input
dataset to evaluate the learning algorithm. The performance of the model was mea-
sured with the operator performance binomial classification. This operator presents
the performance results of the algorithm in terms of accuracy, precision, recall, error,
and ROC curve. To analyze the errors generated from a classification model, the
confusion matrix is used. It is a visualization tool that is used in supervised learning.
The Analysis of Student Performance Using Data Mining 569

Table 8 Prediction model of the loss of academic condition with the training and validation dataset
Prediction Decision tree Naïve Bayes
Enrollment Enrollment Enrollment Enrollment Enrollment Enrollment
2 3 4 2 3 4
Measure-F 0 0 30.77% 0 53.41% 38.51%
Precision 0 0 33.33% 0.00% 56.27% 44.17%
Exhaustive 0.00% 0.00% 28.57% 0.00% 51.45% 36.00%
Accuracy 54.76% 74.14% 94.93% 92.76% 59.43% 69.81%
Error 45.24% 25.86% 5.07% 7.24% 40.57% 30.19%
Curve 0.5 0.5 0 0 0.608 0.63
(AUC)
Kappa 0.0 0.0 0.282 −0.015 0.177 0.19
Specificity 100% 100% 97.61% 99.23% 66.02% 81.65%
Sensitivity 0.00% 0.00% 28.57% 0.00% 51.45% 36.00%
False 0% 0% 2% 1% 19% 14%
positive
False 45% 26% 3% 6% 22% 17%
negative

Each column of the matrix represents the number of predictions of each class, while
each row represents the instances in the real class.
The following measurements are calculated during the experiment: accuracy, clas-
sification error, exhaustiveness (recall), precision, f _measure, specificity, sensitivity,
false-negative rate, false-positive rate, and area under the curve (AUC).
In this stage, different models were trained and tested to classify students with
academic block in the first four academic enrollments, using the socioeconomic
information. For the configuration of the experiments, we used cross-validation with
10-fold to train the models and the evaluation of the model we used the test dataset.
The performance of the model was evaluated with 80% of the training and validation
data, and 20% of the sample was used as a test set. In the decision tree technique
with training and validation data, the tree depth was varied from 1 to 20; the lowest
classification error was found in depth 3, where the error begins to show some stability
in each of the four academic periods. Finally, the training and validation models were
evaluated with the test dataset.
Table 8 presents the results of the pre-condition model of the loss of the academic
condition with training and validation data, comparing the different classification
techniques in terms of the different performance parameters.
Analyzing the results of the training and validation dataset with the admission
information of the admission process, it is observed how the Bayesian classifier
presents the best accuracy of academic block records that were correctly classified.
In the third, enrollment increased by 7% with respect to the decision tree. Simi-
larly, after by reviewing the area under the curve (AUC), the decision tree in the
570 L. W. Santoso and Yulia

Table 9 Prediction model of the loss of the academic condition using the training and validation
data
Prediction Decision tree Naïve Bayes
Enrollment Enrollment Enrollment Enrollment Enrollment Enrollment
2 3 4 2 3 4
Measure-F 74.42% 0 11.11% 71.00% 41.67% 40.00%
Precision 66.40% 0.00 10.00% 60.75% 29.41% 33.33%
Exhaustive 86.67% 0.00% 12.50% 87.00% 71.43% 50.00%
Accuracy 84.45% 94.31% 87.05% 81.02% 92.06% 90.19%
Error 15.55% 5.69% 12.95% 18.98% 25.86% 9.81%
Curve 0.851 0 0 0.912 0 0
(AUC)
Kappa 0.637 −0.224 0.042 0.578 0.295 0.350
Specificity 83.58% 98.20% 92.12% 78.90% 92.93% 92.94%
Sensitivity 86.67% 0.00% 12.50% 87.00% 71.43% 50.00%
False 13% 2% 7% 16% 7% 6%
positive
False 1% 4% 6% 3% 1% 3%
negative

first and second enrollment shows a poor performance below 0.5. The Naive Bayes
algorithm presents the highest percentage of cases with no academic blockade that
were classified incorrectly with academic block. The decision tree presents the high-
est proportion of class with academic block that were classified incorrectly with no
academic block.
Table 9 presents the results of the model of the pre-condition of the loss of the
academic condition with the admission information of the admission process and the
academic record of the previous semester with the data of training and validation, the
different classification techniques are compared in terms of different performance
parameters.
Analyzing the results of the training and validation dataset, we observe how the
decision tree increased its level of accuracy in the second and fourth enrollment. The
Bayesian classifier increased the accuracy of records with academic blocks that were
correctly classified. Similarly, by reviewing the area under the curve (AUC), both
algorithms in the second enrollment have a good performance above 0.7.
Table 10 presents the results of the model of pre-condition of the loss of the
academic condition with the admission information of the admission process and the
academic record of the previous semester with the test data; the different classification
techniques are compared in terms of performance parameters.
Analyzing the results of the test dataset, we observe how the decision tree presents
the highest number of predictions with academic blocking that were correctly clas-
sified in the second enrollment. Likewise, by reviewing the area under the curve
The Analysis of Student Performance Using Data Mining 571

Table 10 Prediction model of the loss of academic condition using the test data
Prediction Decision tree Naïve Bayes
Enrollment Enrollment Enrollment Enrollment Enrollment Enrollment
2 3 4 2 3 4
Measure-F 74.29% 0 0 70.27% 0% 0%
Precision 59.09% 0.00% 0.00% 54.17% 0.00% 0.00%
Exhaustive 100% 0.00% 0.00% 100% 0.00% 0.00%
Accuracy 17.65% 4.44% 6.45% 21.57% 6.67% 9.68%
Error 82.35% 95.56% 93.55% 78.43% 93.33% 90.32%
Curve 0.882 0.500 0.534 0.913 0.907 0.828
(AUC)
Kappa 0.622 0.00 0.000 0.556 −0.031 −0.045
Specificity 76.32% 100% 100% 71.05% 97.67% 96.55%
Sensitivity 100% 0.00% 0.00% 100% 0.00% 0.00%
False 18% 0% 0% 22% 2% 6%
positive
False 0% 4% 6% 0% 4% 3%
negative

(AUC), the Naive Bayes algorithm presents a good performance with an area greater
than 0.9 in comparison with the algorithm of the decision tree.

5 Conclusion

In recent years, there has been great interest in data analysis in educational institu-
tions, in which high volumes of data are generated, given the new techniques and
tools that allow an understanding of the data. For this research, a set of data was
compiled from the database of the “X” University with socioeconomic information
and the academic record of the previous enrollment, for the training and validation
of the descriptive and predictive models.
The objective of the application of the K-Means algorithm of the descriptive
model was to analyze the student population of the university to identify similar
characteristics among the groups. It was interesting to establish that some initial
socioeconomic characteristics allowed to define some profiles or groups. In the eval-
uation of the model, it was observed that the student’s socioeconomic information
affects the results of their academic performance, showing that the groups with the
highest academic performance in the knowledge test results were found in the schools
with low socioeconomic status.
The classification model presented in this paper analyzed the socioeconomic infor-
mation and the academic record of the student’s previous enrollment. The decision
572 L. W. Santoso and Yulia

tree algorithm with the test data presented a better performance with the addition of
the academic record of the previous semester compared to the Naive Bayes algorithm.
The analysis of the data could show that there are different types of performance
according to the student’s socioeconomic profile and academic record, demonstrat-
ing that it is feasible to make predictions and that this research can be a very useful
tool for decision making.
This research can be used for decision making, by the permanency and graduation
program of the University and can be used as a starting point for future data mining
research in education. Another important recommendation is that to improve the
performance of the model, other sources of data should be integrated, such as the
information of the student who is registered as a senior in high school in the senior
high school, before entering the university.

References

1. Romero, C., Ventura, S.: Educational data mining: a survey from 1995 to 2005. Expert Syst.
Appl. 33(1), 135–146 (2007)
2. Chapman, P.: CRISP-DM 1.0: Step-by-Step Data Mining Guide. SPSS, New York (2000)
3. Khamis, A., Xu, Y., Mohamed, A.: Comparative study in determining features extraction for
islanding detection using data mining techniques: correlation and coefficient analysis. Int. J.
Electr. Comput. Eng. (IJECE) 7(3), 1112–1134 (2017)
4. Arague, F., Roldan, C., Salguero, A.: Factors influencing university drop-out rates. Comput.
Educ. 53(3), 563–574 (2009)
5. Kotsiantis, S., Pierrakeas, C., Pintelas, P.: Preventing student dropout in distance learning
systems using machine learning techniques. In: Proceedings of the 7th International Con-
ference Knowledge-Based Intelligent Information and Engineering System (KES), Oxford,
pp. 267–274 (2003)
6. Kuna, H., Garcia-Martinez, R., Villatoro, F.: Pattern discovery in university students desertion
based on data mining. In: Proceedings of the IV Meeting on Dynamics of Social and Economic
Systems, Buenos Aires, pp. 275–285 (2009)
7. Kovacic, Z.: Predicting student success by mining enrolment data. Res. Higher Educ. J. 15,
1–20 (2012)
8. Yadav, S., Bharadwaj, B., Pal, S.: Mining educational data to predict student’s retention: a
comparative study. Int. J. Comput. Sci. Inf. Secur. (IJCSIS) 10(2), 113–117 (2012)
9. Quadril, M., Kalyankar, N.: Drop out feature of student data for academic performance using
decision tree techniques. Global J. Comput. Sci. Technol. 10(2), 2–5 (2010)
10. Zhang, Y., Oussena, S., Clark, T., Kim, H.: Use data mining to improve student retention
in higher education—a case study. In: Proceedings of the 12th International Conference on
Enterprise Information Systems, pp. 190–197 (2010)
11. Santoso, L., Yulia.: Predicting student performance using data mining. In: Proceedings of the
5th International Conference on Communication and Computer Engineering (ICOCOE) (2018)
12. Rao, M., Gurram, D., Vadde, S., Tallam, S., Chand, N., Kiran, L.: A predictive model for
mining opinions of an educational database using neural networks. Int. J. Electr. Comput. Eng.
(IJECE) 5(5), 1158–1163 (2015)
13. Santoso, L., Yulia.: Data warehouse with big data technology for higher education, Proc.
Comput. Sci. 124(1), 93–99 (2017)
14. Santoso, L., Yulia.: Analysis of the impact of information technology investments—a survey
of Indonesian universities. ARPN JEAS 9(12), 2404–2410 (2014)
The Analysis of Student Performance Using Data Mining 573

15. Al-Radaideh, Q., Al-Shawakfa, E., Al-Najjar, M.: Mining student data using decision trees. In:
Proceedings of the 2006 International Conference on Information Technology (ACIT), pp. 1–5
(2006)
16. Yudkselturk, E., Ozekes, S., Turel, Y.: Predicting dropout student: an application of data mining
methods in an online education program. Eur. J. Open Distance E-Learning 17(1), 118–133
(2014)
17. Pal, S.: Mining educational data to reduce dropout rates of engineering students. Int. J. Inf.
Eng. Electron. Bus. 4(2), 1–7 (2012)
18. Bhise, R., Thorat, S., Superkar, A.: Importance of data mining in higher education system.
IOSR J. Humanit. Soc. Sci. 6(6), 18–21 (2013)
19. Erdogan, S., Timor, M.: A data mining application in a student database. J. Aeronaut. Space
Technol. 2(2), 53–57 (2005)
20. Bhardwaj, A., Bhardwadj, A.: Modified K-means clustering algorithm for data mining in edu-
cation domain. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3(11), 1283–1286 (2013)

Pcs TCM 2800 - 200
No ratings yet
Pcs TCM 2800 - 200
1 page
Operations and Service Manual 69NT40-561-300 To 399: Container Refrigeration
100% (1)
Operations and Service Manual 69NT40-561-300 To 399: Container Refrigeration
154 pages
Steel Tubes Bs 1387 en 10255pdf
No ratings yet
Steel Tubes Bs 1387 en 10255pdf
6 pages
Schneider Electric - FTE R&D Job Description - 2022 Batch
No ratings yet
Schneider Electric - FTE R&D Job Description - 2022 Batch
32 pages
The Current Transformer Model With ATP-EMTP For Transient Response Characteristic and Its Effect On Differential Relays Performnce
No ratings yet
The Current Transformer Model With ATP-EMTP For Transient Response Characteristic and Its Effect On Differential Relays Performnce
6 pages
ISU ITSM Tool Requirements
No ratings yet
ISU ITSM Tool Requirements
100 pages
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
No ratings yet
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
63 pages
Qms Buyer's Guide
No ratings yet
Qms Buyer's Guide
11 pages
Ict 7 - Q1 Exam
No ratings yet
Ict 7 - Q1 Exam
3 pages
Olt Configuration Detail PDF
No ratings yet
Olt Configuration Detail PDF
108 pages
Oceantrx 4: 1.15 M (4') Maritime Stabilized Vsat System
No ratings yet
Oceantrx 4: 1.15 M (4') Maritime Stabilized Vsat System
4 pages
Wimax Technology PDF
No ratings yet
Wimax Technology PDF
39 pages
ZYX-S2 User Manual
No ratings yet
ZYX-S2 User Manual
7 pages
OdinSchool DataScience Bootcamp - Brochure-1
No ratings yet
OdinSchool DataScience Bootcamp - Brochure-1
13 pages
Software Project Management Assignment#2: Submitted By: Abeera Afridi Submitted To: Dr. Muhammad Shahab Siddiqui
No ratings yet
Software Project Management Assignment#2: Submitted By: Abeera Afridi Submitted To: Dr. Muhammad Shahab Siddiqui
6 pages
Centrifuga
No ratings yet
Centrifuga
21 pages
Eco Informed Materials Selection Lecture Unit 11 PPTEFFEN21
No ratings yet
Eco Informed Materials Selection Lecture Unit 11 PPTEFFEN21
16 pages
The X MSCI Programmer's Handbook
No ratings yet
The X MSCI Programmer's Handbook
123 pages
St. Paul University Surigao: Users' Satisfaction of SPUS Library Online Services
No ratings yet
St. Paul University Surigao: Users' Satisfaction of SPUS Library Online Services
8 pages
Bray / Mccannalok: 41R High Performance Valves For The SUGAR INDUSTRY
No ratings yet
Bray / Mccannalok: 41R High Performance Valves For The SUGAR INDUSTRY
4 pages
C-Language Syllabus
No ratings yet
C-Language Syllabus
3 pages
Working Principle and Applications of Capacitive Pressure
No ratings yet
Working Principle and Applications of Capacitive Pressure
2 pages
FINAS
No ratings yet
FINAS
24 pages
Volkswagen India Digital Marketing Case Study
No ratings yet
Volkswagen India Digital Marketing Case Study
2 pages
Resume: Name: M. Vasantharao Email ID: Mobile No: +91
No ratings yet
Resume: Name: M. Vasantharao Email ID: Mobile No: +91
3 pages
Four-Pole Squirrel-Cage Induction Motor 579493 (8221-05) : Labvolt Series Datasheet
No ratings yet
Four-Pole Squirrel-Cage Induction Motor 579493 (8221-05) : Labvolt Series Datasheet
3 pages
PDF Course Outline 230
No ratings yet
PDF Course Outline 230
2 pages
FOX 615 Teleprotection
No ratings yet
FOX 615 Teleprotection
8 pages
Naukri Pawankumar (2y 2m)
No ratings yet
Naukri Pawankumar (2y 2m)
2 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Module 4 - Yarn
No ratings yet
Module 4 - Yarn
34 pages
Module 5 - Mahout
No ratings yet
Module 5 - Mahout
20 pages
Module 5 - Flume
No ratings yet
Module 5 - Flume
23 pages
Module 4 - Yarn Schedulers
No ratings yet
Module 4 - Yarn Schedulers
21 pages
Module 5
No ratings yet
Module 5
4 pages
Big Data Lecture
No ratings yet
Big Data Lecture
49 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
ISTQB CTFL40 Sample-Exam-Answers SET-E v1.2 GTB-edition Engl en
No ratings yet
ISTQB CTFL40 Sample-Exam-Answers SET-E v1.2 GTB-edition Engl en
59 pages
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2141)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10.1007@978 981 13 6861 548

Uploaded by

10.1007@978 981 13 6861 548

Uploaded by

The Analysis of Student Performance

Using Data Mining

Leo Willyanto Santoso and Yulia

Keywords Data mining · Education · Drop out · Student performance

Data mining represents a significant computational advance in obtaining information

L. W. Santoso (B) · Yulia

© Springer Nature Singapore Pte Ltd. 2019 559

3 Data Analysis and Modeling

Fig. 1 Selection of the Group number (K) for students admitted

However, additional pre-processing was performed to normalize all the numerical

Table 2 Distribution of the number of students with four enrollments

Table 3 Distribution of the number of students with four enrollments

4 Result and Analysis

Table 4 Registration number and academic blocks per academic period

Table 6 Academic situation in the first four enrollments

Table 7 Test, training and validation dataset

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.