Interview Bot Development With Natural Language PR
Interview Bot Development With Natural Language PR
http://ijtech.eng.ui.ac.id
Abstract. Interview for competency assessment takes essential roles in Human Resource
Management practices. However, the traditional competency interview process needs considerable
time and costs and often requires face-to-face meetings that may endanger both interviewers and
interviewees during a pandemic. This study aims to present the development of an interview bot
for identifying competency based on the Behavioural Event Interview method by using artificial
intelligence technology. It is an automation of the interview process to explore a person’s
competencies levels based on past behavioural experiences. The development of the interview bot
involved two main activities. The first is the data training process to develop learning models to
determine competency levels based on provided valid participant’s responses. The second is the
testing and evaluation model for assessment to determine competency levels. We found that our
method can predict a person's competence levels based on their responses. Our approach can make
predictions with acceptable accuracy. The interview bot is a valuable and reliable tool to conduct
online interviews and support the assessment centre process, especially with conditions of physical
and social distancing constraints. It provides flexibility in terms of time and place for participants,
and its process is delivered in Indonesia's Language. The interview bot is more cost efficient than
traditional interviews with the same behavioural event interview methods, and it would also be
preferable for millennials.
Keywords: Artificial intelligence; Behavioural event interview; Chat bot; Interview bot; Machine
learning
1. Introduction
Recognizing the importance of competencies for competitive advantages, the
Government of Indonesia (GOI), as a policymaker, issued regulations that encourage both
governmental and private business organizations to increase employee competencies. One
of the government’s regulations is the Decree of the Employment Minister of the Republic
of Indonesia, Number 2 of 2016, concerning the National Work Competency
Standardization System. The statute contains a comprehensive and synergic arrangement
of national work competency standards intended to improve Indonesian human resources
competencies. With the issuance of the regulation, Indonesian workers must meet the
established competency standards to be able to work in an organization. Therefore,
*Corresponding author’s email: j.siswanto@ti.itb.ac.id, Tel.: +62-22-2508149; Fax: +62-22-2508149
doi: 10.14716/ijtech.v13i2.5018
Siswanto et al. 275
and Google Assistant. A chatbot has several advantages, including ease of access, efficiency,
availability, scalability, cost, and insight. Chatbot technology has been applied in various
fields, such as handling e-commerce queries (Pricilla et al., 2018), web shopping helpers,
hotel reservation agents, and FAQ agents (Siddig & Hines, 2019), and various digital
consumers (Rese et al., 2020). However, chatbot applications for supporting HRM practices
remain underdeveloped.
A chatbot architecture can be further developed to have an information retrieval
function and interactively “generate” questions by applying AI technology. This may work
in two ways and can support artificial interviews (Suakanto et al., 2021). AI technology
includes machine learning, deep learning, neural networks, and natural language
processing (NLP). Cowgill (2018) used machine learning for hiring white-collar workers.
The challenge of developing AI and machine learning for HRM is related to the number of
data sets, which tends to be relatively small by data science standards (Tambe et al., 2019).
As a branch of AI, NLP has been employed in human interview systems. An interactive
interview bot system based on NLP was developed to conduct interviews and generate
results automatically (Yakkundi et al., 2019). One of the critical benefits of NLP is its ability
to process and understand unstructured text data automatically.
Conducting competency assessments using interview bots provides many advantages.
The interview process is conducted with prospective job applicants or employees, who will
be assessed for competencies by bots that have been designed to present adaptive
multilevel interview questions and have the ability to analyse the initial competency level
from the answers given. It is expected that an interview bot will consistently assess
competency levels to reduce interviewers’ subjective bias. Interview bots may also provide
a suitable interface for millennials, who prefer to interact with the help of intelligent
computer applications. Moreover, during the pandemic, in which face-to-face interviews
need to be avoided or at least minimized, interview bots will substantially contribute to
preventing the spread of viruses through face-to-face interview processes. In addition,
interview bots will allow companies to increase their assessment capacity and reduce
interview costs. This study presents an interview bot application using the BEI method for
interview text in the Indonesian language that helps organizations and companies assess
competency levels more accurately and efficiently. This research focuses on developing a
text-based interview bot algorithm for competency assessment and evaluating its
performance.
responding to questions. The interviewee also needs to answer several questions for
validation purposes in this introduction. Finally, when the system determines that the
interviewee is registered and valid, the process continues; otherwise, they need to return
to the registration process or report to the HR department. The second stage is the
competency assessment process. The interview bot prepares open questions based on the
situation, task, actions, and results (STAR) structure. Soon after each inquiry is completely
responded to, the system will record, analyze, and provide the competency levels according
to the learned algorithm. This process is repeated until all required measuring
competencies are completed in line with the competency dictionary provided. The final
stage is validating the results and closing the session. Interviewees will be shown the
descriptions of their competency levels based on the interview bot’s diagnosis. When the
interviewees agree with the results, the final competency levels will be recorded.
Otherwise, interviewees’ concerns will be recorded and reanalyzed by the system or
provided to assessors for further manual analyses. At the closing of the session, the system
will inform the interviewee that any information provided during the assessment is kept
confidential and solely used for official purposes. No data can be read or copied by an
unauthorized person. The simplified flow design for the interview bot is shown in Figure 1.
frequently used in NLP. Then, Bayesian inference, one of the most popular statistical
techniques, is applied. With Bayesian interference, the prior probabilities of an event are
updated when new data are gathered. The applied data training method is presented in
Figure 2.
In the next stage, the computation of term frequency (TF) is applied to every keyword
of competency levels. If a word appears more frequently, it will get more weight or higher
probability. TF and inverse TF (IDF) are prevalent feature extraction techniques. They are
used as a statistical measure to represent the importance level of the words in a set of
sentences or documents. Equation 1 shows the formula to compute TF.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒 𝑤𝑜𝑟𝑑 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
𝑇𝐹 =
𝑇𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
or
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒 𝑤𝑜𝑟𝑑𝑚,𝑖
𝑇𝐹 = 𝑃𝑚,𝑖 = 𝑇𝑜𝑡𝑎𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑙𝑒𝑣𝑒𝑙 𝑚 (1)
Figure 4 The applied model contains competency levels and their sets of words and TF results
2.3. Testing Method
After the applied model is formed, the testing and evaluation process can be conducted.
The developed model is the word collection for each level and each competition. The dataset
contains the keywords and the weight ( Pm,i), where m shows the level, and i shows the
keyword in ith at the associated competency level. The proposed testing method is shown
in Figure 5.
The testing is carried out by implementing the NLP process following a similar process
as in the training stage. In this stage, the NLP process will produce keywords that keep the
scores computed on each level. After the keywords are obtained, we can count the value of
each competency level, which represents the TF of competency level m. Then, the formula
in Equation 2 is applied to calculate the score on each level.
Pm Pm,i (2)
i
where Pm is the total weight level at competency m, ( Pm,i ), where weight at competency
level m word i.
The proposed competency level is the competency level that has the maximum score
out of all levels. Therefore, the competence level is formulated by Equation 3.
L max( Pm ) (3)
where L is the competence level, and Pm is the total weight level at competency m.
From the computation, each competency level is given a score. For example, based on
Figure 6, P1 = 0.56, P2 = 0.34, P3 = 0.12, and P4 = 0. Hence, it can be seen that P1 has the
highest score in level 1 (max), which we mark with max* in Figure 6. Thus, the machine will
conclude that the competency level is level 1 because P1 is the highest score.
2.4. All-Zero Result Condition
In several cases, the computation of all competency levels will result in a score of 0 for
all competency levels. This condition is called the “all-zero result”. For example: P1 = 0, P2
= 0, P3 = 0, and P4 = 0. This condition is interpreted as the machine being “incapable” of
producing the score or predicting competency levels. The likely reason for this is that the
interviewee’s answer is too short, inconsequential, or irrelevant to the question. This
situation also indicates no keyword match between the set of “keywords” extracted from
the interviewee’s response and the set of keywords in the competency dictionary. In actual
practice, this situation also may occur in conventional assessment scoring. In a traditional
assessment of competency, there is a condition that the assessor cannot determine the level
of competency based on the snippet of a response. This condition occurs when the
interviewee does not answer questions following the STAR structure. Primarily, they
respond to questions too generally or briefly or do not meet expectations. In this situation,
assessors may leave the score empty (blank) to prevent ambiguity in perceiving the lowest
score. Hence, the interviewee’s data or the answers cannot be used to form the conclusion.
In such a case, we suggest a parameter to calculate the number of data interpretable by a
machine or the other way around (produce an all-zero result). It is important to note that
an empty or blank “score” is not the same as competency level 0 (zero). The competency
level zero means neutral, while the level blank means the competency level cannot be
determined by the interview bot or the assessors.
2.5. Evaluation Method
A new evaluation metric is suggested for the evaluation of our interview bot for
competency level measurement. It relates to the number of data that can be interpreted by
machine learning. We propose the concept of coverage, which shows the number of
Siswanto et al. 281
interpretable records compared to the total records. Thus, the formula which represents
the coverage is shown in Equation 4.
Ni
C (4)
N i N u
where C is the coverage, N i is the number of interpretable records, and N u is the number
of uninterpretable records.
Having the coverage, we can set accuracy as the parameter success of this method.
Accuracy is also included as an essential factor of NLP and machine learning. The accuracy
formula is presented in Equation 5.
Na
A (5)
N a N x
where A is the accuracy, N a is the number of accurate records, and N x is the number of
inaccurate records.
Q (machine): What would you do if you become a leader to encourage your team to follow the pace of works with existing norms,
values, and ethics?
I (interviewee): In the beginning, I encouraged them to read the rules first. We conduct regular briefings about the current rules. After
that, I will evaluate and check it regularly.
Q (machine): How much effort do you make to behave according to your organisation's rules or norms, values, and ethics?
I (interviewee): I always try to obey the established norms and values. I always try to comply with the rules or norms. By remembering
the norm, I will evaluate myself against what I have done so far continuously.
Q (machine): If you are already working and have a leader, there are times when the leader gives policies that are not following the
rules. What would you do if your leader does like that?
I (interviewee): I will try to discuss with the leader regarding this matter. However, if the leader tends to ignore or does not listen to
me, I will discuss it with or report to the higher leader.
Figure 6 Example of the bot asking questions and an interviewee giving feedback or answering the
bot in Indonesian, with translation
282 Interview Bot Development with Natural Language Processing and Machine Learning
We can also perform the experiments using the split-half concept—that is, splitting into
two parts—by leveraging this technique. This splitting can be performed on cross-testing
for both data training and data testing. For example, when the data training uses group data
A, data testing applies group data B. The combination of group data training and testing will
be used to determine the characteristics of the result and to know the accuracy. The testing
result can be seen in the result section.
3.3. Performance Results
The AI approach is also used to map and evaluate participant’s answers into specific
competency values. The interview bot will automatically assess the participant’s responses
Siswanto et al. 283
and compare the results to the results assessed by human experts. If the gap between the
interview bot and human judgment is too large, the robot is considered incapable of
mapping the results. Conversely, if the difference between the assessments of robots and
human experts is small, it means that the robot be able to perform competency
assessments. Table 3 illustrates comparisons of competency levels provided by the
interview bot and assessors.
Table 3 Illustration of competency levels provided by interview bot and assessors
Group Assessor Interview Bot
Employee Competency Difference Status
Data Score Score
A 1 Teamwork 2 2 0 Accurate
A 2 Result Oriented 3 1 -2 Inaccurate (Under)
A 3 Communication 4 4 0 Accurate
B 4 Integrity 2 2 0 Accurate
B 5 Result Oriented 2 2 0 Accurate
B 6 People Development 2 4 2 Inaccurate (Over)
The result displayed in Table 3 shows four accurate results and two inaccurate results.
Thus, the accuracy A is [4:(4+2)]x100% = 66.7%.
The experiments are carried out by forming a combination between the different
groups of data to identify the characteristics of the learning model. For instance, the data
training uses group data A, and then the data testing uses group data B. The combination is
then carried out between data training and data testing to obtain the characteristics of the
learning process and accuracy. The performance results can be seen in Table 4.
Table 4 Performance evaluation
Coverage Accuracy Over Under
No Data Training Data Testing
(C) (A) Judgment Judgment
1 A A 98.1% 96.1% 3.9% 0.0%
2 A A&B 39.9% 72.8% 10.4% 16.8%
3 A B 37.3% 70.0% 11.2% 18.8%
4 B B 96.0% 79.3% 15.5% 5.1%
5 B A&B 95.8% 78.1% 16.6% 5.4%
6 B A 90.4% 48.9% 40.4% 10.6%
7 A&B A 98.1% 64.7% 27.5% 7.8%
8 A&B B 96.0% 79.0% 16.3% 4.7%
9 A&B A&B 96.1% 78.4% 16.8% 4.8%
3.4. Discussion
AI has been researched for use in developing the next generation of HRM (Margherita,
2021; Pereira, 2021). Robots or chatbots have been considered for recruitment and
Selection in HRM (Pereira, 2021). The developed interview bot, as a further enhancement
of a chatbot, provides better performance. Based on our experiments, the interview bot
provides good and acceptable outcomes with more than 70% accuracy. The highest
accuracy was achieved when both training and testing data were taken from the same data
set. For example, if both training and testing use data group A, the coverage value is 98.1 %,
and the accuracy is 96 %. Similarly, when data group B is used for both the data training
and data testing, the coverage is 96%, and the accuracy is 79 %. The accuracy of data group
B is a bit lower than that of data group A. This is because data group B is comprised of
traditional BEI manuscripts.
A challenge occurs when the data training and data set are crossing. For example, when
the training uses data A and the testing uses data B, the experiment delivers 70% accuracy
and 37% coverage. This accuracy value is good enough, although the coverage is decreased
compared to when the data training and data testing use the same data. When data B is used
284 Interview Bot Development with Natural Language Processing and Machine Learning
for data training and the data testing uses data A, the accuracy is 48% with a 90% coverage.
This condition is unique, as the coverage is increased but the accuracy is decreased. Hence,
the variability of data training is a crucial part of the learning process.
In the following scenario, if all the data (A & B) is used for data training, it could
generate better coverage since the amount of data learning is greater. When all data are
used for data training, the coverage value increases to 96.0% to 98.1%. The accuracy is
better than the lower limit, which is 64.7%. The highest accuracy only reaches 79%.
This study found that the degree of data variability determines the machine learning
process. Therefore, adding to the amount of data training does not guarantee that the
accuracy will increase. Still, the coverage is expected to increase. This study could support
the development of interview bots for interviewing without human assistance. While the
accuracy is less than 90%, the study can be used as a starting point to develop competence
mapping based on an interview bot and machine learning. As the interview bot is used more
often and more situations, its accuracy will improve. Our findings support Reilly’s (2018)
conclusion that incorporating AI at work can make employee recruitment processes faster
and more effective. Furthermore, Margherita et al. (2021) also noted that the positive
impacts of digital technology on HRM include conversational AI and hiring tools and talent
experience management platforms.
4. Conclusions
This study has successfully developed an interview bot that uses machine learning to
determine the competence level of a person. In this research, we use an Indonesian
language dataset. To converse with human participants, we use NLP technology. This study
demonstrated very good accuracy in various scenarios. The results of this study can be used
as the basis for developing an interview bot that is closer to professional interviews. One of
the important aspects of this system is datasets. With a more extensive and comprehensive
dataset, it is possible that the system would be richer in information and achieve better
accuracy. For future works, the system could be enhanced to use voice interaction instead
of text-based chat. The frequency, types, and depth of the questions could also be made
more adaptive to match the psychological aspects of the interviewee.
References
Berawi, M.A., 2018. The Fourth Industrial Revolution: Managing Technology Development
for Competitiveness. International Journal of Technology, Volume 9(1), pp. 1–4
Berawi, M.A., 2020. Managing Artificial Intelligence Technology for Added Value.
International Journal of Technology, Volume 11(1), pp. 1–4
Cowgill, B., 2018. Bias and Productivity in Humans and Algorithms: Theory and Evidence
from Résumé Screening. Working paper, Columbia University, New York
Eubanks, B., 2017. Artificial Intelligence for HR Use AI to Support and Develop a Successful
Workforce. Kogan Page. Available online at
https://books.google.com/books?hl=en&lr=&id=hrN7DwAAQBAJ&oi=fnd&pg=PP1&d
q=Artificial+Intelligence+for+HR+Use+AI+to+Support+and+Develop+a+Successful+
Workforce&ots=jc5-Zm6nbw&sig=rXBsT7SPUpQbJWXMAh6O9gDxsbo
Kim, S., Lee, J., Gweon, G., 2019. Comparing Data from Chatbot and Web Surveys: Effects of
Platform and Conversational Style on Survey Response Quality. In: Proceedings of the
2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12
Siswanto et al. 285