38 s12909 024 05125 7
38 s12909 024 05125 7
Abstract
Background Large language models like ChatGPT have revolutionized the field of natural language processing
with their capability to comprehend and generate textual content, showing great potential to play a role in medical
education. This study aimed to quantitatively evaluate and comprehensively analysis the performance of ChatGPT
on three types of national medical examinations in China, including National Medical Licensing Examination (NMLE),
National Pharmacist Licensing Examination (NPLE), and National Nurse Licensing Examination (NNLE).
Methods We collected questions from Chinese NMLE, NPLE and NNLE from year 2017 to 2021. In NMLE and NPLE,
each exam consists of 4 units, while in NNLE, each exam consists of 2 units. The questions with figures, tables or
chemical structure were manually identified and excluded by clinician. We applied direct instruction strategy via
multiple prompts to force ChatGPT to generate the clear answer with the capability to distinguish between single-
choice and multiple-choice questions.
Results ChatGPT failed to pass the accuracy threshold of 0.6 in any of the three types of examinations over the five
years. Specifically, in the NMLE, the highest recorded accuracy was 0.5467, which was attained in both 2018 and
2021. In the NPLE, the highest accuracy was 0.5599 in 2017. In the NNLE, the most impressive result was shown in
2017, with an accuracy of 0.5897, which is also the highest accuracy in our entire evaluation. ChatGPT’s performance
showed no significant difference in different units, but significant difference in different question types. ChatGPT
performed well in a range of subject areas, including clinical epidemiology, human parasitology, and dermatology, as
well as in various medical topics such as molecules, health management and prevention, diagnosis and screening.
Conclusions These results indicate ChatGPT failed the NMLE, NPLE and NNLE in China, spanning from year 2017 to
2021. but show great potential of large language models in medical education. In the future high-quality medical
data will be required to improve the performance.
Keywords Medical education, Medical examination, Artificial intelligence, Natural language processing, ChatGPT
1
†
Hui Zong and Jiakun Li contributed equally to this work. Department of Urology and Institutes for Systems Genetics, Frontiers
Science Center for Disease-related Molecular Network, West China
*Correspondence: Hospital, Sichuan University, No. 37, Guoxue Alley, Chengdu
Bairong Shen 610212, China
bairong.shen@scu.edu.cn
© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The
Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available
in this article, unless otherwise stated in a credit line to the data.
Zong et al. BMC Medical Education (2024) 24:143 Page 2 of 9
a pass rate of 29.19%. In 2018, there were 566,613 test- selected”, “The correct choice is” and “This is a multiple
takers, with 79,900 successful candidates and a pass rate choices question, please return the correct answer”. These
of 14.10%. In 2019, there were 133,000 successful candi- prompts force the model to generate the clear answer, as
dates, resulting in a pass rate of 18.72%. In 2020, there well as the capability to distinguish between single-choice
were 610,132 test-takers, but the number of successful and multiple-choice questions.
candidates is not released. In 2021, there were 450,973
test-takers, with 80,840 successful candidates and a pass Evaluation
rate of 17.93%. In 2022, there were 495,419 test-takers, For each question, the response of ChatGPT was
with 97,400 successful candidates and a pass rate of reviewed by an experienced clinicians to determine the
19.66%. For NNLE, the total number of test-takers each predicted answer, which was then compared with the
year from 2012 to 2020 ranged between approximately true answer. The score was calculated based on whether
690,000 and 730,000, with the number of successful can- the answers match or not. A score of 1 was awarded if
didates ranging from approximately 380,000 to 420,000. there is agreement between the predicted answer and
In this study, we aimed to quantitatively evaluate the true answer, whereas a score of 0 was given if there is dis-
performance of ChatGPT on three types of national agreement. The evaluation process has been conducted
medical examinations in China, namely NMLE, NPLE on all datasets of NMLE, NPLE and NNLE over past five
and NNLE. To enhance the reliability of our findings, we years.
meticulously collected a substantial corpus of real-world
medical question-answer data from examinations con- Data analysis
ducted from the year 2017 to 2021. We also conducted Data process was performed in Python (version 3.9.13,
a comparative analysis of the performance of different Python Software Foundation) using Jupyter Notebook.
units. For cases where incorrect responses were gener- Statistical analysis was performed using GraphPad Prism
ated, we solicited feedback from domain experts and 9 Software. The significance of differences among groups
performed thorough assessment and error analysis. Our was set at p < 0.05.
study yields valuable insights for researchers and devel-
opers to improve large language models’ performance in Results
the medical domain. Overall performance
As shown in Fig. 2, ChatGPT failed to pass the accuracy
Methods threshold of 0.6 in any of the three types of examinations
Medical examination datasets over the five years. Specifically, in the Chinese NMLE,
We collected questions from Chinese NMLE, NPLE and the highest recorded accuracy was 0.5467, which was
NNLE from year 2017 to 2021. In NMLE, each exam con- attained in both 2018 and 2021. In the Chinese NPLE,
sists of 4 units, each unit has 150 questions, for a total of the highest accuracy was 0.5599 in 2017. In the Chi-
600 questions. In NPLE, each exam consists of 4 units, nese NNLE, the most impressive result was shown in
each unit has 120 questions, for a total of 480 questions. 2017, with an accuracy of 0.5897, which is also the high-
In NNLE, each exam consists of 2 units, each unit has est accuracy in our entire evaluation. Conversely, the
120 questions, for a total of 240 questions. Based on the 2019 NPLE exam resulted in the lowest accuracy, with a
requirements of the examination, a correct response rate recorded value of 0.4356.
exceeding the accuracy threshold of 0.6 is considered
to meet the passing criteria. The questions with figures, Detailed performance
tables or chemical structure were manually identified The score of each unit in the Chinese NMLE is shown in
and excluded by a clinician with five years of clinical Table 1. The performance of ChatGPT has varied across
experience. different units and years. In 2017 and 2020, ChatGPT
performed best in Unit 2. In 2018 and 2019, ChatGPT
Model setting performed best in Unit 1. In the 2021, ChatGPT per-
We employed ChatGPT, an artificial intelligence chatbot formed best in both Unit 2 and Unit 3. In 2018 and 2021,
built upon the generative pre-trained transformer tech- ChatGPT correctly answered 328 out of 600 questions.
nology. The official API was utilized to invoke the chat- This is because the complexity and difficulty of question
bot, with gpt-3.5-turbo as model parameter and default in each unit were vary from year to year. On average,
values for other parameters. As shown in Fig. 1, the input ChatGPT achieved the highest score in Unit 2 (84.6), fol-
question consisted of the background description and lowed by Unit 1 (79.8), Unit 3 (78.2), Unit 4 (75.4).
choices. To elicit diverse responses, we applied direct The score of each unit in the Chinese NPLE is shown in
instruction strategy via prompt, such as “Please return Table 2. In NPLE, each unit has 120 questions, and each
the most correct answer”, “Only one best option can be exam has 480 questions. We identified and removed the
Zong et al. BMC Medical Education (2024) 24:143 Page 4 of 9
Fig. 1 The overview of interaction with ChatGPT. The question included background description and choices from three national licensing examinations,
including Chinese National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE) and National Nurse Licensing
Examination (NNLE). The prompt was designed to force a clear answer, as well as the ability to recognize single-choice or multiple-choice questions. The
response of ChatGPT were manually reviewed by an experienced clinician to determine the answer. The correct answer to this question is “D. Cor pulmo-
nale”. It should be noted that while English text was shown in the figure, the experiment itself used Chinese text as both the input and output language
Fig. 2 The performance of ChatGPT of three national licensing examinations over a period of five years from 2017 to 2021. The examinations included
Chinese National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE) and National Nurse Licensing Examination
(NNLE)
Zong et al. BMC Medical Education (2024) 24:143 Page 5 of 9
Table 1 The score of each unit in Chinese National Medical Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 81 87 84 71 76 79.8
Unit2 score 96 83 75 83 86 84.6
Unit3 score 79 79 72 75 86 78.2
Unit4 score 65 79 75 78 80 75.4
Total score 321 328 306 307 328 318
Questions 600 600 597 600 600 -
Accuracy 53.50% 54.67% 51.26% 51.17% 54.67% 53.05%
Table 2 The score of each unit in Chinese National Pharmacist Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 57 55 51 48 49 52
Unit2 score 63 54 52 60 61 58
Unit3 score 60 56 49 53 47 53
Unit4 score 77 58 51 59 65 62
Total score 257 223 203 220 222 225
Questions 459 450 466 458 463 -
Accuracy 55.99% 49.56% 43.56% 48.03% 47.95% 49.02%
Table 3 The score of each unit in Chinese National Nurse Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 65 57 72 54 67 63
Unit2 score 73 74 57 53 63 64
Total score 138 131 129 107 130 127
Questions 234 232 238 232 238 -
Accuracy 58.97% 56.47% 54.2% 46.12% 54.62% 54.08%
questions included figures, tables or chemical structure. NPLE (Fig. 3D), ChatGPT demonstrated higher perfor-
Such questions appeared the most in 2018 (30), followed mance in single-choice questions compared to multi-
by 2020 (22), 2017 (21), 2021 (17) and 2019 (14). On aver- ple-choice questions, with a highly statistical difference
age, the ChatGPT performed best in Unit 4 (62), followed (p < 0.0001).
by Unit 2 (58), Unit 3 (53) and Unit 1(52). In the year
2017, ChatGPT achieved highest score, and correctly Performance on different subjects and topics
answered 257 out of 459 questions. To better understand why ChatGPT failed in the Chinese
The Table 3 shown the detailed score of each unit of medical examination, we took the 2021 NMLE exam as
Chinese NNLE. There are totally 26 questions included an example, and labeled the medical subjects and topics
figures, tables or chemical structure were removed. In for each question (Fig. 4). The result revealed that Chat-
2017 and 2018, ChatGPT performed better in Unit2 than GPT excelled in clinical epidemiology, human parasi-
Unit1. Conversely, in 2019, 2020 and 2021, ChatGPT tology, and dermatology, with all questions answered
performed better in Unit1 than Unit2. On average, Chat- correctly. However, the model faltered in subjects such
GPT’s performance of the two units had no noticeable as pathology, pathophysiology, public health regulations,
difference. physiology, and anatomy, with the proportion of correct
In comparison, ChatGPT exhibited better proficiency answers was less than 0.5. Additionally, we observed that
in NNLE (54.08%), with NMLE (53.05%) and NPLE ChatGPT performed admirably in topics related to mol-
(49.02%) following behind. The result corresponds to the ecule, health management and prevention, diagnosis and
complexity and difficulty of the exam questions. screening, but its performance was lackluster in topics
such as clinical manifestations, indicator values, struc-
Performance on different units and question types tural location, cell, and tissue. Interestingly, we found
Figure 3 demonstrated the comparative analysis of Chat- no significant difference in performance on case-based
GPT’s performance differences across units and question questions and non-case-based questions.
types. The results shown there was no significant differ-
ence in across different units in NMLE (Fig. 3A), NPLE
(Fig. 3B), and NNLE (Fig. 3C). However, in the case of
Zong et al. BMC Medical Education (2024) 24:143 Page 6 of 9
Fig. 3 The performance of ChatGPT on different units and question types. For different units, there were no significant difference among (A) Chinese
National Medical Licensing Examination (NMLE), (B) National Pharmacist Licensing Examination (NPLE), and (C) National Nurse Licensing Examination
(NPLE). (D) However, ChatGPT demonstrated higher performance in single-choice questions than multiple-choice questions with a highly significant
difference (ns, no significant difference, ****p < 0.0001)
Fig. 4 The performance of ChatGPT on different subjects, topics and types of questions in the 2021 NMLE exam
Thirdly, while ChatGPT has remarkable ability to pro- The potential of large language model in medical
cess and generate text data, its proficiency in numerical education
computations is limited. For some questions related to As a significant milestone in the development of artificial
mathematical calculation, such as dosage calculation and intelligence, ChatGPT driven by large language model
laboratory values interpretation, may pose challenges for has powerful capability in language understanding and
a language model. Additionally, in some case, the task content generation. With its remarkable potential, Chat-
requires reading the question and selecting the most suit- GPT could be a valuable resource in acquiring medical
able answer, while there are some suboptimal answers in knowledge and learning clinical skills for students, and
given choices, and in such cases, ChatGPT is forced to serve as an informative assistant in preparing teaching
select a single choice as answer, which can limit its con- materials and evaluating course projects for teachers.
tent comprehension ability and lead to incorrect answer. In our study, ChatGPT has achieved an accuracy of
These findings provide deep insight into the strengths over 0.5 in most of the exams, indicating a significant
and weaknesses of ChatGPT in Chinese medical exami- potential for ChatGPT in medical education. Previ-
nations, and pave the way for future research to improve ous study shown in the Chinese Rural General Medical
the model’s capabilities in this domain. Licensing Examination, only 55% of students were able
to pass the written examination [29]. In China, the
Zong et al. BMC Medical Education (2024) 24:143 Page 8 of 9
Author contributions
significant healthcare burden necessitates a vast number H.Z., J.L., E.W., R.W., J.L. and B.S. involved in the study conceptualization. J.L.
of licensed clinical staff and healthcare providers. How- collected and preprocessed the data. H.Z. conducted data analysis, results
ever, the rigorous examinations lead to low pass rates, interpretation and manuscript preparation. H.Z. and B.S. contributed to the
review and editing of the manuscript. B.S. supervised the study. All authors
exacerbating the shortage of licensed practitioners, espe- read and approved the final manuscript.
cially in rural areas. The large language model presents a
promising avenue for enhancing medical education and Funding
This work was supported by the National Natural Science Foundation of China
advancing healthcare reform, with the potential to reduce (32270690 and 32070671).
medical burden.
Finally, the advancement of artificial intelligence (AI), Data availability
The data analyzed and reported in this study are available at https://github.
specifically large language models, in medical education com/zonghui0228/LLM-Chinese-NMLE.git.
needs public benchmarking datasets and fair evaluation
metrics for performance assessment. There is also a need Declarations
to interact with human experts from multiple dimensions
and obtain continuous feedback. In addition, the use of Ethics approval and consent to participate
Not applicable (NA).
such model must also consider data privacy, cognitive
bias, and comply with regulations. Consent for publication
Not applicable (NA).
18. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific 24. Weng TL et al. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin
writing? Crit Care. 2023;27(1):75. Med Assoc, 2023.
19. Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted 25. Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the Pharmacist
medical education using large language models. PLOS Digit Health. Licensing examination in Taiwan. J Chin Med Assoc, 2023.
2023;2(2):e0000198. 26. Seghier ML. ChatGPT: not all languages are equal. Nature. 2023;615(7951):216.
20. Gilson A, et al. How does ChatGPT perform on the United States Medical 27. Wang X. Experiences, challenges, and prospects of National Medical Licens-
Licensing examination? The implications of Large Language Models for Medi- ing examination in China. BMC Med Educ. 2022;22(1):349.
cal Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. 28. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to
21. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology those of medical students in Korea for taking a parasitology examination? A
Board-style examination: insights into current strengths and limitations. descriptive study. J Educ Eval Health Prof. 2023;20:1.
Radiology. 2023;307(5):e230582. 29. Han X, et al. Performance of China’s new medical licensing examination for
22. Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s rural general practice. BMC Med Educ. 2020;20(1):314.
performance on the UK Neurology Specialty Certificate Examination. BMJ
Neurol Open. 2023;5(1):e000451.
23. Humar P et al. ChatGPT is equivalent to First Year plastic surgery residents: Publisher’s Note
evaluation of ChatGPT on the plastic surgery In-Service exam. Aesthet Surg J, Springer Nature remains neutral with regard to jurisdictional claims in
2023. published maps and institutional affiliations.