0% found this document useful (0 votes)

30 views9 pages

38 s12909 024 05125 7

This study evaluates the performance of ChatGPT on three Chinese national medical licensing examinations (NMLE, NPLE, NNLE) from 2017 to 2021. ChatGPT did not achieve the required accuracy threshold of 0.6 in any exam, with the highest accuracy recorded at 0.5897 in the NNLE in 2017. The findings suggest that while ChatGPT shows potential in medical education, improvements in data quality are necessary for better performance.

Uploaded by

z1583372602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views9 pages

38 s12909 024 05125 7

Uploaded by

z1583372602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Zong et al.

BMC Medical Education (2024) 24:143 BMC Medical Education

https://doi.org/10.1186/s12909-024-05125-7

RESEARCH Open Access

Performance of ChatGPT on Chinese national

medical licensing examinations: a five-year
examination evaluation study for physicians,
pharmacists and nurses
Hui Zong1† , Jiakun Li1† , Erman Wu1 , Rongrong Wu1 , Junyu Lu1 and Bairong Shen1*

Abstract
Background Large language models like ChatGPT have revolutionized the field of natural language processing
with their capability to comprehend and generate textual content, showing great potential to play a role in medical
education. This study aimed to quantitatively evaluate and comprehensively analysis the performance of ChatGPT
on three types of national medical examinations in China, including National Medical Licensing Examination (NMLE),
National Pharmacist Licensing Examination (NPLE), and National Nurse Licensing Examination (NNLE).
Methods We collected questions from Chinese NMLE, NPLE and NNLE from year 2017 to 2021. In NMLE and NPLE,
each exam consists of 4 units, while in NNLE, each exam consists of 2 units. The questions with figures, tables or
chemical structure were manually identified and excluded by clinician. We applied direct instruction strategy via
multiple prompts to force ChatGPT to generate the clear answer with the capability to distinguish between single-
choice and multiple-choice questions.
Results ChatGPT failed to pass the accuracy threshold of 0.6 in any of the three types of examinations over the five
years. Specifically, in the NMLE, the highest recorded accuracy was 0.5467, which was attained in both 2018 and
2021. In the NPLE, the highest accuracy was 0.5599 in 2017. In the NNLE, the most impressive result was shown in
2017, with an accuracy of 0.5897, which is also the highest accuracy in our entire evaluation. ChatGPT’s performance
showed no significant difference in different units, but significant difference in different question types. ChatGPT
performed well in a range of subject areas, including clinical epidemiology, human parasitology, and dermatology, as
well as in various medical topics such as molecules, health management and prevention, diagnosis and screening.
Conclusions These results indicate ChatGPT failed the NMLE, NPLE and NNLE in China, spanning from year 2017 to
2021. but show great potential of large language models in medical education. In the future high-quality medical
data will be required to improve the performance.
Keywords Medical education, Medical examination, Artificial intelligence, Natural language processing, ChatGPT

1
†
Hui Zong and Jiakun Li contributed equally to this work. Department of Urology and Institutes for Systems Genetics, Frontiers
Science Center for Disease-related Molecular Network, West China
*Correspondence: Hospital, Sichuan University, No. 37, Guoxue Alley, Chengdu
Bairong Shen 610212, China
bairong.shen@scu.edu.cn

© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The
Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available
in this article, unless otherwise stated in a credit line to the data.
Zong et al. BMC Medical Education (2024) 24:143 Page 2 of 9

Introduction the appropriate medication is dispensed and admin-

In the last decade, artificial intelligence (AI) technology istered correctly, while nurses attend to patients’ daily
has undergone a rapid evolution, achieving noteworthy medical management and care service. Due to limited
breakthroughs in numerous fields [1, 2]. Recently, one medical resources, medical professionals in China face
such breakthrough that has garnered considerable atten- immense pressure, but remain committed to provid-
tion is ChatGPT [3], an AI chatbot powered by generative ing high-quality services. The advent of ChatGPT offers
pre-trained transformer (GPT) architecture, specifically a promising solution to ease this burden by delivering
GPT-3.5 with 175 billion parameters. This innovative intelligent, efficient, and precise medical services to phy-
technology is developed through human feedback rein- sicians, pharmacists, and nurses.
forcement learning and trained on extensive textual data. Medical examinations, including the Chinese National
Remarkably, ChatGPT exhibits remarkable capabilities Medical Licensing Examination (NMLE), the Chinese
in various tasks, including but not limited to intelligent National Pharmacist Licensing Examination (NPLE),
dialogue [4], knowledge question answering [5], and text and the Chinese National Nurse Licensing Examina-
generation [6], thus showcasing unprecedented potential tion (NNLE) are implemented by the government to
for further development. improve professional standards, ensure medical safety
In medical domain, there has been growing interest in and enhance healthcare services quality [27]. In NMLE,
exploration of large language models for tasks such as there are 4 units, each unit contains 150 questions, mak-
biomedical question answering (BioGPT [7]), and auto- ing a total of 600 questions. The NMLE is designed with
matic dialogue generation (DialoGPT [8, 9]). Regrettably, 4 modules, including Basic Medical Sciences, Medical
these studies have so far demonstrated limited practical Humanities, Clinical Medicine, and Preventive Medicine.
utility in clinical practice. However, ChatGPT, with its It is important to note that the questions within each
powerful language understanding and generation capa- module are randomly distributed across different units,
bilities, showing significant potential in the fields of clini- and the number of questions focus on each module is
cal response generation [5, 6], clinical decision support not fixed. In NPLE, there are 4 units, each unit has 120
[4, 10, 11], medical education [12, 13], literature informa- questions, making a total of 480 questions. The 4 units
tion retrieve [14], scientific writing [15–18], and beyond. focus on 4 specific modules, namely Pharmaceutical
Recent studies have demonstrated that ChatGPT can Knowledge I, Pharmaceutical Knowledge II, Pharmaceu-
pass the United States Medical Licensing Exam (USMLE) tical Management and Regulations, and Comprehensive
[19, 20], Radiology Board-style Examination [21], UK Pharmacy Knowledge and Skills. In NNLE, there are 2
Neurology Specialty Certificate Examination [22], and units, each unit has 120 questions, making a total of 240
Plastic Surgery In-Service Exam [23], with results that questions. Unit 1 focuses on clinical knowledge and unit
are comparable to those of human experts. Nevertheless, 2 focuses on clinical skills. Through these medical exami-
Other studies have also indicated that ChatGPT failed to nations, the medical knowledge, clinical skills, and ethi-
pass the Family Medicine Board Exam [24], and Pharma- cal standards mastered by medical staffs can significantly
cist Qualification Examination [25]. Possible explanations improve the quality of their services. This, in turn, can
for this performance difference include language and cul- reduce the incidence of medical errors and accidents,
tural differences, variations in examination content [26]. and protect the fundamental right to health and safety of
These studies highlighted the ChatGPT’s ability to com- patients.
prehend the complex language used in medical contexts These medical licensing examinations aim to compre-
and its potential for use in medical education. However, hensively evaluate candidate’s knowledge of medical sci-
current researches are limited in two aspects. Firstly, it ence, clinical examination, disease diagnosis, surgical
largely focuses on the English language, and secondly, treatment, patient prognosis, policies, and regulations,
it predominantly emphasizes the physician’s examina- among other areas. Successfully passing these examina-
tion. Additional investigation is necessary to explore the tions is a prerequisite for obtaining professional certifica-
potential of ChatGPT in other non-English languages tion for physicians, pharmacists, and nurses. The annual
and various medical examinations, which can deliver number of test-takers is high, while the successful can-
substantial benefits for its expanded application in the didates remain relatively low. For NMLE, according to
medical domain. official website and news reports, in 2017, there were
China, with a population of over 1.4 billion, faces a approximately 530,000 test-takers, followed by around
significant medical burden. The provision of healthcare 600,000 in 2018, around 540,000 in 2019, around 490,000
services involves a collaborative effort among physicians, in 2020, around 530,000 in 2021, around 510,000 in 2022.
pharmacists, and nurses who work diligently to offer the For NPLE, according to the data from official website
best possible care to patients. Physicians are responsible Certification Center for Licensed Pharmacist of NMPA,
for diagnosing and treating illnesses, pharmacists ensure in 2017, the number of test-takers was 523,296, with
Zong et al. BMC Medical Education (2024) 24:143 Page 3 of 9

a pass rate of 29.19%. In 2018, there were 566,613 test- selected”, “The correct choice is” and “This is a multiple
takers, with 79,900 successful candidates and a pass rate choices question, please return the correct answer”. These
of 14.10%. In 2019, there were 133,000 successful candi- prompts force the model to generate the clear answer, as
dates, resulting in a pass rate of 18.72%. In 2020, there well as the capability to distinguish between single-choice
were 610,132 test-takers, but the number of successful and multiple-choice questions.
candidates is not released. In 2021, there were 450,973
test-takers, with 80,840 successful candidates and a pass Evaluation
rate of 17.93%. In 2022, there were 495,419 test-takers, For each question, the response of ChatGPT was
with 97,400 successful candidates and a pass rate of reviewed by an experienced clinicians to determine the
19.66%. For NNLE, the total number of test-takers each predicted answer, which was then compared with the
year from 2012 to 2020 ranged between approximately true answer. The score was calculated based on whether
690,000 and 730,000, with the number of successful can- the answers match or not. A score of 1 was awarded if
didates ranging from approximately 380,000 to 420,000. there is agreement between the predicted answer and
In this study, we aimed to quantitatively evaluate the true answer, whereas a score of 0 was given if there is dis-
performance of ChatGPT on three types of national agreement. The evaluation process has been conducted
medical examinations in China, namely NMLE, NPLE on all datasets of NMLE, NPLE and NNLE over past five
and NNLE. To enhance the reliability of our findings, we years.
meticulously collected a substantial corpus of real-world
medical question-answer data from examinations con- Data analysis
ducted from the year 2017 to 2021. We also conducted Data process was performed in Python (version 3.9.13,
a comparative analysis of the performance of different Python Software Foundation) using Jupyter Notebook.
units. For cases where incorrect responses were gener- Statistical analysis was performed using GraphPad Prism
ated, we solicited feedback from domain experts and 9 Software. The significance of differences among groups
performed thorough assessment and error analysis. Our was set at p < 0.05.
study yields valuable insights for researchers and devel-
opers to improve large language models’ performance in Results
the medical domain. Overall performance
As shown in Fig. 2, ChatGPT failed to pass the accuracy
Methods threshold of 0.6 in any of the three types of examinations
Medical examination datasets over the five years. Specifically, in the Chinese NMLE,
We collected questions from Chinese NMLE, NPLE and the highest recorded accuracy was 0.5467, which was
NNLE from year 2017 to 2021. In NMLE, each exam con- attained in both 2018 and 2021. In the Chinese NPLE,
sists of 4 units, each unit has 150 questions, for a total of the highest accuracy was 0.5599 in 2017. In the Chi-
600 questions. In NPLE, each exam consists of 4 units, nese NNLE, the most impressive result was shown in
each unit has 120 questions, for a total of 480 questions. 2017, with an accuracy of 0.5897, which is also the high-
In NNLE, each exam consists of 2 units, each unit has est accuracy in our entire evaluation. Conversely, the
120 questions, for a total of 240 questions. Based on the 2019 NPLE exam resulted in the lowest accuracy, with a
requirements of the examination, a correct response rate recorded value of 0.4356.
exceeding the accuracy threshold of 0.6 is considered
to meet the passing criteria. The questions with figures, Detailed performance
tables or chemical structure were manually identified The score of each unit in the Chinese NMLE is shown in
and excluded by a clinician with five years of clinical Table 1. The performance of ChatGPT has varied across
experience. different units and years. In 2017 and 2020, ChatGPT
performed best in Unit 2. In 2018 and 2019, ChatGPT
Model setting performed best in Unit 1. In the 2021, ChatGPT per-
We employed ChatGPT, an artificial intelligence chatbot formed best in both Unit 2 and Unit 3. In 2018 and 2021,
built upon the generative pre-trained transformer tech- ChatGPT correctly answered 328 out of 600 questions.
nology. The official API was utilized to invoke the chat- This is because the complexity and difficulty of question
bot, with gpt-3.5-turbo as model parameter and default in each unit were vary from year to year. On average,
values for other parameters. As shown in Fig. 1, the input ChatGPT achieved the highest score in Unit 2 (84.6), fol-
question consisted of the background description and lowed by Unit 1 (79.8), Unit 3 (78.2), Unit 4 (75.4).
choices. To elicit diverse responses, we applied direct The score of each unit in the Chinese NPLE is shown in
instruction strategy via prompt, such as “Please return Table 2. In NPLE, each unit has 120 questions, and each
the most correct answer”, “Only one best option can be exam has 480 questions. We identified and removed the
Zong et al. BMC Medical Education (2024) 24:143 Page 4 of 9

Fig. 1 The overview of interaction with ChatGPT. The question included background description and choices from three national licensing examinations,
including Chinese National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE) and National Nurse Licensing
Examination (NNLE). The prompt was designed to force a clear answer, as well as the ability to recognize single-choice or multiple-choice questions. The
response of ChatGPT were manually reviewed by an experienced clinician to determine the answer. The correct answer to this question is “D. Cor pulmo-
nale”. It should be noted that while English text was shown in the figure, the experiment itself used Chinese text as both the input and output language

Fig. 2 The performance of ChatGPT of three national licensing examinations over a period of five years from 2017 to 2021. The examinations included
Chinese National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE) and National Nurse Licensing Examination
(NNLE)
Zong et al. BMC Medical Education (2024) 24:143 Page 5 of 9

Table 1 The score of each unit in Chinese National Medical Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 81 87 84 71 76 79.8
Unit2 score 96 83 75 83 86 84.6
Unit3 score 79 79 72 75 86 78.2
Unit4 score 65 79 75 78 80 75.4
Total score 321 328 306 307 328 318
Questions 600 600 597 600 600 -
Accuracy 53.50% 54.67% 51.26% 51.17% 54.67% 53.05%

Table 2 The score of each unit in Chinese National Pharmacist Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 57 55 51 48 49 52
Unit2 score 63 54 52 60 61 58
Unit3 score 60 56 49 53 47 53
Unit4 score 77 58 51 59 65 62
Total score 257 223 203 220 222 225
Questions 459 450 466 458 463 -
Accuracy 55.99% 49.56% 43.56% 48.03% 47.95% 49.02%

Table 3 The score of each unit in Chinese National Nurse Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 65 57 72 54 67 63
Unit2 score 73 74 57 53 63 64
Total score 138 131 129 107 130 127
Questions 234 232 238 232 238 -
Accuracy 58.97% 56.47% 54.2% 46.12% 54.62% 54.08%

questions included figures, tables or chemical structure. NPLE (Fig. 3D), ChatGPT demonstrated higher perfor-
Such questions appeared the most in 2018 (30), followed mance in single-choice questions compared to multi-
by 2020 (22), 2017 (21), 2021 (17) and 2019 (14). On aver- ple-choice questions, with a highly statistical difference
age, the ChatGPT performed best in Unit 4 (62), followed (p < 0.0001).
by Unit 2 (58), Unit 3 (53) and Unit 1(52). In the year
2017, ChatGPT achieved highest score, and correctly Performance on different subjects and topics
answered 257 out of 459 questions. To better understand why ChatGPT failed in the Chinese
The Table 3 shown the detailed score of each unit of medical examination, we took the 2021 NMLE exam as
Chinese NNLE. There are totally 26 questions included an example, and labeled the medical subjects and topics
figures, tables or chemical structure were removed. In for each question (Fig. 4). The result revealed that Chat-
2017 and 2018, ChatGPT performed better in Unit2 than GPT excelled in clinical epidemiology, human parasi-
Unit1. Conversely, in 2019, 2020 and 2021, ChatGPT tology, and dermatology, with all questions answered
performed better in Unit1 than Unit2. On average, Chat- correctly. However, the model faltered in subjects such
GPT’s performance of the two units had no noticeable as pathology, pathophysiology, public health regulations,
difference. physiology, and anatomy, with the proportion of correct
In comparison, ChatGPT exhibited better proficiency answers was less than 0.5. Additionally, we observed that
in NNLE (54.08%), with NMLE (53.05%) and NPLE ChatGPT performed admirably in topics related to mol-
(49.02%) following behind. The result corresponds to the ecule, health management and prevention, diagnosis and
complexity and difficulty of the exam questions. screening, but its performance was lackluster in topics
such as clinical manifestations, indicator values, struc-
Performance on different units and question types tural location, cell, and tissue. Interestingly, we found
Figure 3 demonstrated the comparative analysis of Chat- no significant difference in performance on case-based
GPT’s performance differences across units and question questions and non-case-based questions.
types. The results shown there was no significant differ-
ence in across different units in NMLE (Fig. 3A), NPLE
(Fig. 3B), and NNLE (Fig. 3C). However, in the case of
Zong et al. BMC Medical Education (2024) 24:143 Page 6 of 9

Fig. 3 The performance of ChatGPT on different units and question types. For different units, there were no significant difference among (A) Chinese
National Medical Licensing Examination (NMLE), (B) National Pharmacist Licensing Examination (NPLE), and (C) National Nurse Licensing Examination
(NPLE). (D) However, ChatGPT demonstrated higher performance in single-choice questions than multiple-choice questions with a highly significant
difference (ns, no significant difference, ****p < 0.0001)

Discussion languages to enhance its performance in non-English

In this study, we evaluated the performance of ChatGPT, medical exams.
an artificial intelligence chatbot, in answering medical Secondly, there are differences in medical policies, legal
exam questions from Chinese NMLE, NPLE, and NNLE regulations, and management agencies across countries
from year 2017 to 2021. with different languages or cultures. ChatGPT, being
a language model trained on a diverse range of data,
ChatGPT failed NNLE, NMLE and NPLE in China may not possess an in-depth understanding of the spe-
The results of our study revealed that ChatGPT was cific legal framework and requirements. In the Chinese
unsuccessful in meeting the requirements of the three NMLE, some questions relate to healthcare policies,
primary medical licensure assessments namely, NMLE, while in the Chinese NPLE, the entire unit 4 is officially
NPLE and NNLE in China, spanning from 2017 to 2021. designated as pharmaceutical management and regula-
There are several possible reasons for this. tions. These questions cover topics such as drug produc-
Firstly, According to OpenAI, ChatGPT has been tion, market circulation, pharmaceutical management,
trained with the vast majority of the data is in English, and legal regulations. The aim is to assess awareness and
with only a small amount of data in other languages, like compliance ability in clinical practice. Generally, these
Chinese. More richer training dataset allows the model to questions are relatively short in length and clear in mean-
learn more knowledge. Recent studies shown that Chat- ing. While ChatGPT has acquired a wealth of knowledge
GPT passed United States Medical Licensing Examina- on healthcare policies from English-speaking countries
tion [19, 20]. However, it failed to pass the Taiwanese due to its extensive English dataset, it may encounter
Pharmacist Licensing Examination [25], and performed difficulties in correctly understanding the healthcare
worse than medical students on Korean-based parasitol- policies of non-English-speaking countries, leading to
ogy examination [28]. These findings suggest that Chat- erroneous responses to related questions. Additionally,
GPT may require additional training data in non-English healthcare policies undergo regular updates over time,
making such questions more challenging.
Zong et al. BMC Medical Education (2024) 24:143 Page 7 of 9

Fig. 4 The performance of ChatGPT on different subjects, topics and types of questions in the 2021 NMLE exam

Thirdly, while ChatGPT has remarkable ability to pro- The potential of large language model in medical
cess and generate text data, its proficiency in numerical education
computations is limited. For some questions related to As a significant milestone in the development of artificial
mathematical calculation, such as dosage calculation and intelligence, ChatGPT driven by large language model
laboratory values interpretation, may pose challenges for has powerful capability in language understanding and
a language model. Additionally, in some case, the task content generation. With its remarkable potential, Chat-
requires reading the question and selecting the most suit- GPT could be a valuable resource in acquiring medical
able answer, while there are some suboptimal answers in knowledge and learning clinical skills for students, and
given choices, and in such cases, ChatGPT is forced to serve as an informative assistant in preparing teaching
select a single choice as answer, which can limit its con- materials and evaluating course projects for teachers.
tent comprehension ability and lead to incorrect answer. In our study, ChatGPT has achieved an accuracy of
These findings provide deep insight into the strengths over 0.5 in most of the exams, indicating a significant
and weaknesses of ChatGPT in Chinese medical exami- potential for ChatGPT in medical education. Previ-
nations, and pave the way for future research to improve ous study shown in the Chinese Rural General Medical
the model’s capabilities in this domain. Licensing Examination, only 55% of students were able
to pass the written examination [29]. In China, the
Zong et al. BMC Medical Education (2024) 24:143 Page 8 of 9

Author contributions
significant healthcare burden necessitates a vast number H.Z., J.L., E.W., R.W., J.L. and B.S. involved in the study conceptualization. J.L.
of licensed clinical staff and healthcare providers. How- collected and preprocessed the data. H.Z. conducted data analysis, results
ever, the rigorous examinations lead to low pass rates, interpretation and manuscript preparation. H.Z. and B.S. contributed to the
review and editing of the manuscript. B.S. supervised the study. All authors
exacerbating the shortage of licensed practitioners, espe- read and approved the final manuscript.
cially in rural areas. The large language model presents a
promising avenue for enhancing medical education and Funding
This work was supported by the National Natural Science Foundation of China
advancing healthcare reform, with the potential to reduce (32270690 and 32070671).
medical burden.
Finally, the advancement of artificial intelligence (AI), Data availability
The data analyzed and reported in this study are available at https://github.
specifically large language models, in medical education com/zonghui0228/LLM-Chinese-NMLE.git.
needs public benchmarking datasets and fair evaluation
metrics for performance assessment. There is also a need Declarations
to interact with human experts from multiple dimensions
and obtain continuous feedback. In addition, the use of Ethics approval and consent to participate
Not applicable (NA).
such model must also consider data privacy, cognitive
bias, and comply with regulations. Consent for publication
Not applicable (NA).

Limitations Competing interests

Our study has some limitations. First, the questions of The authors declare no conflict of competing interests.
China NMLE, NPLE, and NNLE are all multiple-choice
Received: 31 August 2023 / Accepted: 1 February 2024
format. While this format meets our study purposes,
it did not fully showcase the content generation capa-
bilities of ChatGPT. In the future, it would be benefi-
cial to include more open-ended questions. Second, we
References
evaluated the performance of ChatGPT in medical 1. Bhinder B, et al. Artificial Intelligence in Cancer Research and Precision Medi-
examinations with zero-shot learning. However, better cine. Cancer Discov. 2021;11(4):900–15.
performance may be achieved by incorporating knowl- 2. Moor M, et al. Foundation models for generalist medical artificial intelligence.
Nature. 2023;616(7956):259–65.
edge-enhanced training methods. Third, the different 3. van Dis EAM, et al. ChatGPT: five priorities for research. Nature.
variations of prompt may impact ChatGPT’s response, 2023;614(7947):224–6.
leading to diverse answers. Therefore, it is imperative to 4. Sarink MJ et al. A study on the performance of ChatGPT in infectious diseases
clinical consultation. Clin Microbiol Infect, 2023.
develop innovative techniques that can generate more 5. Lee TC et al. ChatGPT Answers Common Patient Questions About Colonos-
consistent and trustworthy responses in the future. copy. Gastroenterology, 2023.
Finally, further investigation is needed to determine the 6. Young JN et al. The utility of ChatGPT in generating patient-facing and clini-
cal responses for melanoma. J Am Acad Dermatol, 2023.
underlying factors contributing to this substandard per- 7. Luo R et al. BioGPT: generative pre-trained transformer for biomedical text
formance, and to explore broader application of Chat- generation and mining. Brief Bioinform, 2022. 23(6).
GPT in medical education and clinical decision-making 8. Zhang Y, et al. DIALOGPT: Large-scale generative pre-training for conversa-
tional response generation. Online: Association for Computational Linguis-
support. tics; 2020.
9. Das A, et al. Conversational bots for psychotherapy: a study of Generative
Conclusion Transformer models using domain-specific dialogues. Dublin, Ireland: Asso-
ciation for Computational Linguistics; 2022.
In conclusion, we evaluated the performance of Chat- 10. Komorowski M. M. Del Pilar Arias Lopez, and A.C. Chang, How could ChatGPT
GPT on three types of national medical examinations impact my practice as an intensivist? An overview of potential applications,
in China, including NMLE, NPLE, and NNLE from year risks and limitations. Intensive Care Med, 2023.
11. Munoz-Zuluaga C, et al. Assessing the Accuracy and Clinical Utility of Chat-
2017 to 2021. The results indicated ChatGPT failed to GPT in Laboratory Medicine. Clin Chem; 2023.
meet the official pass criteria with accuracy threshold of 12. Yang H. How I use ChatGPT responsibly in my teaching. Nature, 2023.
0.6 in any of the three types of examinations over the five 13. Abd-Alrazaq A, et al. Large Language models in Medical Education: opportu-
nities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291.
years. The performance of ChatGPT varied across differ- 14. Jin Q, Leaman R, Lu Z. Retrieve, summarize, and verify: how will ChatGPT
ent units and years, with the highest score achieved in affect information seeking from the Medical Literature? J Am Soc Nephrol,
NNLE of year 2017. ChatGPT exhibited relatively bet- 2023.
15. Kovoor JG, Gupta AK, Bacchi S. ChatGPT: effective writing is succinct. BMJ.
ter proficiency in NNLE, with NMLE and NPLE follow- 2023;381:1125.
ing closely behind. ChatGPT performed well in a range 16. Shafiee A. Matters arising: authors of research papers must cautiously use
of subject areas, including clinical epidemiology, human ChatGPT for scientific writing. Int J Surg, 2023.
17. Gao CA, et al. Comparing scientific abstracts generated by ChatGPT to real
parasitology, and dermatology, as well as in various medi- abstracts with detectors and blinded human reviewers. NPJ Digit Med.
cal topics such as molecules, health management and 2023;6(1):75.
prevention, diagnosis and screening.
Zong et al. BMC Medical Education (2024) 24:143 Page 9 of 9

18. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific 24. Weng TL et al. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin
writing? Crit Care. 2023;27(1):75. Med Assoc, 2023.
19. Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted 25. Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the Pharmacist
medical education using large language models. PLOS Digit Health. Licensing examination in Taiwan. J Chin Med Assoc, 2023.
2023;2(2):e0000198. 26. Seghier ML. ChatGPT: not all languages are equal. Nature. 2023;615(7951):216.
20. Gilson A, et al. How does ChatGPT perform on the United States Medical 27. Wang X. Experiences, challenges, and prospects of National Medical Licens-
Licensing examination? The implications of Large Language Models for Medi- ing examination in China. BMC Med Educ. 2022;22(1):349.
cal Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. 28. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to
21. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology those of medical students in Korea for taking a parasitology examination? A
Board-style examination: insights into current strengths and limitations. descriptive study. J Educ Eval Health Prof. 2023;20:1.
Radiology. 2023;307(5):e230582. 29. Han X, et al. Performance of China’s new medical licensing examination for
22. Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s rural general practice. BMC Med Educ. 2020;20(1):314.
performance on the UK Neurology Specialty Certificate Examination. BMJ
Neurol Open. 2023;5(1):e000451.
23. Humar P et al. ChatGPT is equivalent to First Year plastic surgery residents: Publisher’s Note
evaluation of ChatGPT on the plastic surgery In-Service exam. Aesthet Surg J, Springer Nature remains neutral with regard to jurisdictional claims in
2023. published maps and institutional affiliations.

Cohen's Pathways of The Pulp Expert Consult 12th Edition Kenneth M. Hargreaves - Ebook PDFPDF Download
100% (4)
Cohen's Pathways of The Pulp Expert Consult 12th Edition Kenneth M. Hargreaves - Ebook PDFPDF Download
44 pages
The Great Book of Best Quotes of All Time. - Original
100% (3)
The Great Book of Best Quotes of All Time. - Original
204 pages
(Biblical Interpretation Series) King, J. - Speech-In-Character, Diatribe, and Romans 3 - 1-9 - Who's Speaking When and Why It Matters-Brill (2018)
No ratings yet
(Biblical Interpretation Series) King, J. - Speech-In-Character, Diatribe, and Romans 3 - 1-9 - Who's Speaking When and Why It Matters-Brill (2018)
347 pages
5G RAN Planning, Dimensioning, and Optimization
100% (4)
5G RAN Planning, Dimensioning, and Optimization
74 pages
Sample Thesis System
100% (1)
Sample Thesis System
25 pages
G12 DR Geography
No ratings yet
G12 DR Geography
216 pages
Validation Methodology On Airbag Deployment Process of Driver Side Airbag
No ratings yet
Validation Methodology On Airbag Deployment Process of Driver Side Airbag
9 pages
Class X TERM 1 PAPER
No ratings yet
Class X TERM 1 PAPER
18 pages
Corrector
No ratings yet
Corrector
26 pages
RP 4-19 Truck Trailer Landing Gear (TTMA)
No ratings yet
RP 4-19 Truck Trailer Landing Gear (TTMA)
7 pages
Precision Oxygen Analyzer: Key Features
No ratings yet
Precision Oxygen Analyzer: Key Features
2 pages
Comparing ChatGPT and GPT-4 Performance in USMLE
No ratings yet
Comparing ChatGPT and GPT-4 Performance in USMLE
5 pages
Contoh Time Schedule Starting Project MBLE - KBU
No ratings yet
Contoh Time Schedule Starting Project MBLE - KBU
1 page
Training Need Analysis by Atul Mathur
No ratings yet
Training Need Analysis by Atul Mathur
11 pages
Lupox Gp1000H: Description Application
No ratings yet
Lupox Gp1000H: Description Application
2 pages
7 - Eleven Case Study
No ratings yet
7 - Eleven Case Study
4 pages
DET40073 - Topic 1
No ratings yet
DET40073 - Topic 1
54 pages
Visio LCP 02
No ratings yet
Visio LCP 02
15 pages
Chatgpt Performs On The Chinese National Medical Licensing Examination
No ratings yet
Chatgpt Performs On The Chinese National Medical Licensing Examination
9 pages
Chat Generative Pre-Trained Transformer's Performance On Dermatology-Specific Questions and Its Implications in Medical Education
No ratings yet
Chat Generative Pre-Trained Transformer's Performance On Dermatology-Specific Questions and Its Implications in Medical Education
8 pages
2023 02 13 23285879v1 Full
No ratings yet
2023 02 13 23285879v1 Full
17 pages
Large Language Models For Generating Medical Examinations Systematic reviewBMC Medical Education
No ratings yet
Large Language Models For Generating Medical Examinations Systematic reviewBMC Medical Education
11 pages
Physics Grade 10 Unit 4 Summarized Note
No ratings yet
Physics Grade 10 Unit 4 Summarized Note
24 pages
Module 1 Contemporary Arts
No ratings yet
Module 1 Contemporary Arts
46 pages
De La Cruz-Pricing Strateegy Midterm
No ratings yet
De La Cruz-Pricing Strateegy Midterm
5 pages
Sociology and Nursing
No ratings yet
Sociology and Nursing
115 pages
How Does ChatGPT Perform On The United States Medical
No ratings yet
How Does ChatGPT Perform On The United States Medical
9 pages
BSC 4
No ratings yet
BSC 4
2 pages
Utf-8virtual 20lesson 20plan 20template
No ratings yet
Utf-8virtual 20lesson 20plan 20template
3 pages
PDF - 3
No ratings yet
PDF - 3
17 pages
13 Course Electrical
No ratings yet
13 Course Electrical
1 page
Pal 等 - 2022 - MedMCQA A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
No ratings yet
Pal 等 - 2022 - MedMCQA A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
13 pages
Griot 等 - 2024 - Impact of high-quality, mixed-domain data on the performance of medical language models
No ratings yet
Griot 等 - 2024 - Impact of high-quality, mixed-domain data on the performance of medical language models
9 pages
ATmega328 Timer-Counter1 Compare Match B
No ratings yet
ATmega328 Timer-Counter1 Compare Match B
1 page
Machine Design 1
No ratings yet
Machine Design 1
13 pages
Diagnostic Reasoning Prompts Reveal The Potential For Large
No ratings yet
Diagnostic Reasoning Prompts Reveal The Potential For Large
7 pages
Twelve Tips To Leverage AI For Efficient and Effective Medical Question Generation - A Guide For Educators Using Chat GPT
No ratings yet
Twelve Tips To Leverage AI For Efficient and Effective Medical Question Generation - A Guide For Educators Using Chat GPT
6 pages
The Immersive Environment
No ratings yet
The Immersive Environment
21 pages
Grade 10 Physics Mid Exam
No ratings yet
Grade 10 Physics Mid Exam
5 pages
Performance of ChatGPT On USMLE - Potential For AI-assisted Medical Education Using Large Language Models - PLOS Digital Health
No ratings yet
Performance of ChatGPT On USMLE - Potential For AI-assisted Medical Education Using Large Language Models - PLOS Digital Health
8 pages
How Does ChatGPT Perform On The United States Medical Licensing Examination (USMLE)
No ratings yet
How Does ChatGPT Perform On The United States Medical Licensing Examination (USMLE)
9 pages
2026
No ratings yet
2026
14 pages
Large Language Models Finetuned With Diverse Medical Data and Comprehensive Evaluation
No ratings yet
Large Language Models Finetuned With Diverse Medical Data and Comprehensive Evaluation
11 pages
2023 Article 4832
No ratings yet
2023 Article 4832
8 pages
The Rise of Ai in Healthcare Education Deepseek.10
No ratings yet
The Rise of Ai in Healthcare Education Deepseek.10
2 pages
10 - Performance Comparison of GPT-3.5 and GPT-4 in The Japanese National Medical Examination
No ratings yet
10 - Performance Comparison of GPT-3.5 and GPT-4 in The Japanese National Medical Examination
7 pages
Supervision Training Handbook. V1.2022-20231 2
No ratings yet
Supervision Training Handbook. V1.2022-20231 2
30 pages
TCMD: A Traditional Chinese Medicine QA
No ratings yet
TCMD: A Traditional Chinese Medicine QA
12 pages
Vox Sanguinis - 2025 - McBride - Can Medical Students Use Artificial Intelligence To Learn Transfusion Evaluating ChatGPT
No ratings yet
Vox Sanguinis - 2025 - McBride - Can Medical Students Use Artificial Intelligence To Learn Transfusion Evaluating ChatGPT
6 pages
Targeted Drug Delivery: Precision Therapeutics with Robotic Sperm for Enhanced Treatment
From Everand
Targeted Drug Delivery: Precision Therapeutics with Robotic Sperm for Enhanced Treatment
Fouad Sabry
No ratings yet
Targeted Drug Delivery: Advancing Precision in Drug Release Systems
From Everand
Targeted Drug Delivery: Advancing Precision in Drug Release Systems
Fouad Sabry
No ratings yet
Modified Release Dosage: Advancements in Targeted Drug Delivery Systems for Enhanced Sperm Functionality
From Everand
Modified Release Dosage: Advancements in Targeted Drug Delivery Systems for Enhanced Sperm Functionality
Fouad Sabry
No ratings yet
Pharmaceutical Formulation: Advancements in Nanocapsule Technologies for Targeted Drug Delivery
From Everand
Pharmaceutical Formulation: Advancements in Nanocapsule Technologies for Targeted Drug Delivery
Fouad Sabry
No ratings yet
Clinical Decision Support System: Fundamentals and Applications
From Everand
Clinical Decision Support System: Fundamentals and Applications
Fouad Sabry
5/5 (1)
Health Informatics Specialist - The Comprehensive Guide
From Everand
Health Informatics Specialist - The Comprehensive Guide
Viruti Shivan
No ratings yet
Certified Clinical Documentation Specialist Exam Pathway 2025/2026 Version: Master The Concepts With 600 Targeted Practice Questions
From Everand
Certified Clinical Documentation Specialist Exam Pathway 2025/2026 Version: Master The Concepts With 600 Targeted Practice Questions
Brittany Deaton
No ratings yet
Clinical Research Associate - The Comprehensive Guide: Vanguard Professionals
From Everand
Clinical Research Associate - The Comprehensive Guide: Vanguard Professionals
Viruti Shivan
No ratings yet
Navigating the Investigational New Drug (IND) Applications: A Comprehensive Guide: Mastering the FDA Approval Process, #6
From Everand
Navigating the Investigational New Drug (IND) Applications: A Comprehensive Guide: Mastering the FDA Approval Process, #6
Dr. Nilesh Panchal
No ratings yet
Clinical Trials Design and Methodology: Clinical Trials Mastery Series, #3
From Everand
Clinical Trials Design and Methodology: Clinical Trials Mastery Series, #3
Dr. Nilesh Panchal
No ratings yet
Essentials of Clinical Trials: Clinical Trials Mastery Series, #1
From Everand
Essentials of Clinical Trials: Clinical Trials Mastery Series, #1
Dr. Nilesh Panchal
No ratings yet
Real-World Evidence in the Pharmaceutical Landscape
From Everand
Real-World Evidence in the Pharmaceutical Landscape
Sunil Dravida
No ratings yet
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
From Everand
Data-Driven Healthcare: Revolutionizing Patient Care with Data Science
William Webb
No ratings yet
Artificial Intelligence in Healthcare: Unlocking its Potential
From Everand
Artificial Intelligence in Healthcare: Unlocking its Potential
Janak Gunatilleke
No ratings yet
Basics of Quality Management for Nuclear Medicine Practices
From Everand
Basics of Quality Management for Nuclear Medicine Practices
IAEA
No ratings yet
AI in Clinical Trials: Revolutionizing Drug Development
From Everand
AI in Clinical Trials: Revolutionizing Drug Development
Sumanthmayur M R
No ratings yet
Next-Generation Health Systems AI Innovations in Healthcare Informatics
From Everand
Next-Generation Health Systems AI Innovations in Healthcare Informatics
Herat Joshi
5/5 (1)
Health Systems Engineering: Building A Better Healthcare Delivery System
From Everand
Health Systems Engineering: Building A Better Healthcare Delivery System
Mbuso Mabuza
No ratings yet
Arterial hypertension in clinical practice: study and analysis of biotechnological and telemedicine models
From Everand
Arterial hypertension in clinical practice: study and analysis of biotechnological and telemedicine models
Michele Karaboue
No ratings yet
From A Biomedical Scientist to A Clinical Scientist The UK Science Training Programme via Equivalence (STPE) Step-by-Step Process: Continuing Professional Development in Pathology For Medical Laboratory Professionals
From Everand
From A Biomedical Scientist to A Clinical Scientist The UK Science Training Programme via Equivalence (STPE) Step-by-Step Process: Continuing Professional Development in Pathology For Medical Laboratory Professionals
Dr Lydia Taiwo
No ratings yet
Clinical Trials: clinical trials
From Everand
Clinical Trials: clinical trials
medvau
No ratings yet
Machine Learning for Healthcare
From Everand
Machine Learning for Healthcare
Rasit Dinc
No ratings yet
Transforming Treatment: New Pathways to Lifesaving Care with Data and AI
From Everand
Transforming Treatment: New Pathways to Lifesaving Care with Data and AI
Ryan Bauer
5/5 (1)
Hacking Healthcare: Designing Human-Centered Technology for a Healthier Future
From Everand
Hacking Healthcare: Designing Human-Centered Technology for a Healthier Future
Shobha Dasari
4/5 (1)
The Future of Healthcare: Innovations and Challenges Ahead
From Everand
The Future of Healthcare: Innovations and Challenges Ahead
Wilde Carmen
No ratings yet
Tracking Outpatients: Using The E-Health System To Ensure Positive Treatment Progress For Hospital Services' Effectiveness For Clients Tracking And Communication At Golden Years Care
From Everand
Tracking Outpatients: Using The E-Health System To Ensure Positive Treatment Progress For Hospital Services' Effectiveness For Clients Tracking And Communication At Golden Years Care
Dr. Tamer Sabry
5/5 (2)
Primary Biliary Cholangitis: A High-Yield Guide to PBC Management in the Era of PPAR Agonists
From Everand
Primary Biliary Cholangitis: A High-Yield Guide to PBC Management in the Era of PPAR Agonists
Nancy Reau
No ratings yet
Clearing the Global Health Fog: A Systematic Review of the Evidence on Integration of Health Systems and Targeted Interventions
From Everand
Clearing the Global Health Fog: A Systematic Review of the Evidence on Integration of Health Systems and Targeted Interventions
Rifat Atun
No ratings yet
Essential Managed Healthcare Training for Technology Professionals (Volume 1 of 3) - Bridging The Gap Between Healthcare And Technology For Software Developers, Managers, BSA's, QA's & TA's
From Everand
Essential Managed Healthcare Training for Technology Professionals (Volume 1 of 3) - Bridging The Gap Between Healthcare And Technology For Software Developers, Managers, BSA's, QA's & TA's
Steve Bate, Ph.D.
No ratings yet
Valuing Health In Practice: Priorities, QALYs, and Choice
From Everand
Valuing Health In Practice: Priorities, QALYs, and Choice
Douglas McCulloch
No ratings yet
Clinical Trial Management – an Overview
From Everand
Clinical Trial Management – an Overview
Editor IJSMI
No ratings yet
Clinical Research: Principles, Practice and Perspective
From Everand
Clinical Research: Principles, Practice and Perspective
Bikash Medhi
No ratings yet
Relationship Between Health Literacy Scores and Patient Use of the iPET for Patient Education
From Everand
Relationship Between Health Literacy Scores and Patient Use of the iPET for Patient Education
Dr. Melissa A Sorgeloos
No ratings yet
Strategies to Explore Ways to Improve Efficiency While Reducing Health Care Costs
From Everand
Strategies to Explore Ways to Improve Efficiency While Reducing Health Care Costs
Calvin Tchatchoua
No ratings yet
Introduction to Clinical Effectiveness and Audit in Healthcare
From Everand
Introduction to Clinical Effectiveness and Audit in Healthcare
Dr. PS Reddy
No ratings yet
A Clinician’s Guide to HCV Treatment in the Primary Care Setting: A Multimedia eHealth Source™ Educational Initiative
From Everand
A Clinician’s Guide to HCV Treatment in the Primary Care Setting: A Multimedia eHealth Source™ Educational Initiative
Sherilyn Brinkley, MSN, CRNP
No ratings yet
Scope Forward: The Future of Gastroenterology Is Now in Your Hands
From Everand
Scope Forward: The Future of Gastroenterology Is Now in Your Hands
Praveen Suthrum
No ratings yet
Healthcare Insights: Better Care, Better Business
From Everand
Healthcare Insights: Better Care, Better Business
Dr. Harold Goldmeier
No ratings yet
Health informatics: Improving patient care
From Everand
Health informatics: Improving patient care
BCS, The Chartered Institute for IT
3/5 (1)
Nanoparticulate Drug Delivery Systems: Strategies, Technologies, and Applications
From Everand
Nanoparticulate Drug Delivery Systems: Strategies, Technologies, and Applications
Yoon Yeo
No ratings yet
The Clinical Practice Program: A How-to-Guide for Physician Leaders On Starting Up a Successful Program
From Everand
The Clinical Practice Program: A How-to-Guide for Physician Leaders On Starting Up a Successful Program
Terrence J. Loftus
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

38 s12909 024 05125 7

Uploaded by

38 s12909 024 05125 7

Uploaded by

Zong et al.

BMC Medical Education (2024) 24:143 BMC Medical Education

RESEARCH Open Access

Performance of ChatGPT on Chinese national

Introduction the appropriate medication is dispensed and admin-

Discussion languages to enhance its performance in non-English

Limitations Competing interests

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.