0% found this document useful (0 votes)
30 views9 pages

38 s12909 024 05125 7

This study evaluates the performance of ChatGPT on three Chinese national medical licensing examinations (NMLE, NPLE, NNLE) from 2017 to 2021. ChatGPT did not achieve the required accuracy threshold of 0.6 in any exam, with the highest accuracy recorded at 0.5897 in the NNLE in 2017. The findings suggest that while ChatGPT shows potential in medical education, improvements in data quality are necessary for better performance.

Uploaded by

z1583372602
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

38 s12909 024 05125 7

This study evaluates the performance of ChatGPT on three Chinese national medical licensing examinations (NMLE, NPLE, NNLE) from 2017 to 2021. ChatGPT did not achieve the required accuracy threshold of 0.6 in any exam, with the highest accuracy recorded at 0.5897 in the NNLE in 2017. The findings suggest that while ChatGPT shows potential in medical education, improvements in data quality are necessary for better performance.

Uploaded by

z1583372602
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Zong et al.

BMC Medical Education (2024) 24:143 BMC Medical Education


https://doi.org/10.1186/s12909-024-05125-7

RESEARCH Open Access

Performance of ChatGPT on Chinese national


medical licensing examinations: a five-year
examination evaluation study for physicians,
pharmacists and nurses
Hui Zong1† , Jiakun Li1† , Erman Wu1 , Rongrong Wu1 , Junyu Lu1 and Bairong Shen1*

Abstract
Background Large language models like ChatGPT have revolutionized the field of natural language processing
with their capability to comprehend and generate textual content, showing great potential to play a role in medical
education. This study aimed to quantitatively evaluate and comprehensively analysis the performance of ChatGPT
on three types of national medical examinations in China, including National Medical Licensing Examination (NMLE),
National Pharmacist Licensing Examination (NPLE), and National Nurse Licensing Examination (NNLE).
Methods We collected questions from Chinese NMLE, NPLE and NNLE from year 2017 to 2021. In NMLE and NPLE,
each exam consists of 4 units, while in NNLE, each exam consists of 2 units. The questions with figures, tables or
chemical structure were manually identified and excluded by clinician. We applied direct instruction strategy via
multiple prompts to force ChatGPT to generate the clear answer with the capability to distinguish between single-
choice and multiple-choice questions.
Results ChatGPT failed to pass the accuracy threshold of 0.6 in any of the three types of examinations over the five
years. Specifically, in the NMLE, the highest recorded accuracy was 0.5467, which was attained in both 2018 and
2021. In the NPLE, the highest accuracy was 0.5599 in 2017. In the NNLE, the most impressive result was shown in
2017, with an accuracy of 0.5897, which is also the highest accuracy in our entire evaluation. ChatGPT’s performance
showed no significant difference in different units, but significant difference in different question types. ChatGPT
performed well in a range of subject areas, including clinical epidemiology, human parasitology, and dermatology, as
well as in various medical topics such as molecules, health management and prevention, diagnosis and screening.
Conclusions These results indicate ChatGPT failed the NMLE, NPLE and NNLE in China, spanning from year 2017 to
2021. but show great potential of large language models in medical education. In the future high-quality medical
data will be required to improve the performance.
Keywords Medical education, Medical examination, Artificial intelligence, Natural language processing, ChatGPT

1

Hui Zong and Jiakun Li contributed equally to this work. Department of Urology and Institutes for Systems Genetics, Frontiers
Science Center for Disease-related Molecular Network, West China
*Correspondence: Hospital, Sichuan University, No. 37, Guoxue Alley, Chengdu
Bairong Shen 610212, China
bairong.shen@scu.edu.cn

© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The
Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available
in this article, unless otherwise stated in a credit line to the data.
Zong et al. BMC Medical Education (2024) 24:143 Page 2 of 9

Introduction the appropriate medication is dispensed and admin-


In the last decade, artificial intelligence (AI) technology istered correctly, while nurses attend to patients’ daily
has undergone a rapid evolution, achieving noteworthy medical management and care service. Due to limited
breakthroughs in numerous fields [1, 2]. Recently, one medical resources, medical professionals in China face
such breakthrough that has garnered considerable atten- immense pressure, but remain committed to provid-
tion is ChatGPT [3], an AI chatbot powered by generative ing high-quality services. The advent of ChatGPT offers
pre-trained transformer (GPT) architecture, specifically a promising solution to ease this burden by delivering
GPT-3.5 with 175 billion parameters. This innovative intelligent, efficient, and precise medical services to phy-
technology is developed through human feedback rein- sicians, pharmacists, and nurses.
forcement learning and trained on extensive textual data. Medical examinations, including the Chinese National
Remarkably, ChatGPT exhibits remarkable capabilities Medical Licensing Examination (NMLE), the Chinese
in various tasks, including but not limited to intelligent National Pharmacist Licensing Examination (NPLE),
dialogue [4], knowledge question answering [5], and text and the Chinese National Nurse Licensing Examina-
generation [6], thus showcasing unprecedented potential tion (NNLE) are implemented by the government to
for further development. improve professional standards, ensure medical safety
In medical domain, there has been growing interest in and enhance healthcare services quality [27]. In NMLE,
exploration of large language models for tasks such as there are 4 units, each unit contains 150 questions, mak-
biomedical question answering (BioGPT [7]), and auto- ing a total of 600 questions. The NMLE is designed with
matic dialogue generation (DialoGPT [8, 9]). Regrettably, 4 modules, including Basic Medical Sciences, Medical
these studies have so far demonstrated limited practical Humanities, Clinical Medicine, and Preventive Medicine.
utility in clinical practice. However, ChatGPT, with its It is important to note that the questions within each
powerful language understanding and generation capa- module are randomly distributed across different units,
bilities, showing significant potential in the fields of clini- and the number of questions focus on each module is
cal response generation [5, 6], clinical decision support not fixed. In NPLE, there are 4 units, each unit has 120
[4, 10, 11], medical education [12, 13], literature informa- questions, making a total of 480 questions. The 4 units
tion retrieve [14], scientific writing [15–18], and beyond. focus on 4 specific modules, namely Pharmaceutical
Recent studies have demonstrated that ChatGPT can Knowledge I, Pharmaceutical Knowledge II, Pharmaceu-
pass the United States Medical Licensing Exam (USMLE) tical Management and Regulations, and Comprehensive
[19, 20], Radiology Board-style Examination [21], UK Pharmacy Knowledge and Skills. In NNLE, there are 2
Neurology Specialty Certificate Examination [22], and units, each unit has 120 questions, making a total of 240
Plastic Surgery In-Service Exam [23], with results that questions. Unit 1 focuses on clinical knowledge and unit
are comparable to those of human experts. Nevertheless, 2 focuses on clinical skills. Through these medical exami-
Other studies have also indicated that ChatGPT failed to nations, the medical knowledge, clinical skills, and ethi-
pass the Family Medicine Board Exam [24], and Pharma- cal standards mastered by medical staffs can significantly
cist Qualification Examination [25]. Possible explanations improve the quality of their services. This, in turn, can
for this performance difference include language and cul- reduce the incidence of medical errors and accidents,
tural differences, variations in examination content [26]. and protect the fundamental right to health and safety of
These studies highlighted the ChatGPT’s ability to com- patients.
prehend the complex language used in medical contexts These medical licensing examinations aim to compre-
and its potential for use in medical education. However, hensively evaluate candidate’s knowledge of medical sci-
current researches are limited in two aspects. Firstly, it ence, clinical examination, disease diagnosis, surgical
largely focuses on the English language, and secondly, treatment, patient prognosis, policies, and regulations,
it predominantly emphasizes the physician’s examina- among other areas. Successfully passing these examina-
tion. Additional investigation is necessary to explore the tions is a prerequisite for obtaining professional certifica-
potential of ChatGPT in other non-English languages tion for physicians, pharmacists, and nurses. The annual
and various medical examinations, which can deliver number of test-takers is high, while the successful can-
substantial benefits for its expanded application in the didates remain relatively low. For NMLE, according to
medical domain. official website and news reports, in 2017, there were
China, with a population of over 1.4 billion, faces a approximately 530,000 test-takers, followed by around
significant medical burden. The provision of healthcare 600,000 in 2018, around 540,000 in 2019, around 490,000
services involves a collaborative effort among physicians, in 2020, around 530,000 in 2021, around 510,000 in 2022.
pharmacists, and nurses who work diligently to offer the For NPLE, according to the data from official website
best possible care to patients. Physicians are responsible Certification Center for Licensed Pharmacist of NMPA,
for diagnosing and treating illnesses, pharmacists ensure in 2017, the number of test-takers was 523,296, with
Zong et al. BMC Medical Education (2024) 24:143 Page 3 of 9

a pass rate of 29.19%. In 2018, there were 566,613 test- selected”, “The correct choice is” and “This is a multiple
takers, with 79,900 successful candidates and a pass rate choices question, please return the correct answer”. These
of 14.10%. In 2019, there were 133,000 successful candi- prompts force the model to generate the clear answer, as
dates, resulting in a pass rate of 18.72%. In 2020, there well as the capability to distinguish between single-choice
were 610,132 test-takers, but the number of successful and multiple-choice questions.
candidates is not released. In 2021, there were 450,973
test-takers, with 80,840 successful candidates and a pass Evaluation
rate of 17.93%. In 2022, there were 495,419 test-takers, For each question, the response of ChatGPT was
with 97,400 successful candidates and a pass rate of reviewed by an experienced clinicians to determine the
19.66%. For NNLE, the total number of test-takers each predicted answer, which was then compared with the
year from 2012 to 2020 ranged between approximately true answer. The score was calculated based on whether
690,000 and 730,000, with the number of successful can- the answers match or not. A score of 1 was awarded if
didates ranging from approximately 380,000 to 420,000. there is agreement between the predicted answer and
In this study, we aimed to quantitatively evaluate the true answer, whereas a score of 0 was given if there is dis-
performance of ChatGPT on three types of national agreement. The evaluation process has been conducted
medical examinations in China, namely NMLE, NPLE on all datasets of NMLE, NPLE and NNLE over past five
and NNLE. To enhance the reliability of our findings, we years.
meticulously collected a substantial corpus of real-world
medical question-answer data from examinations con- Data analysis
ducted from the year 2017 to 2021. We also conducted Data process was performed in Python (version 3.9.13,
a comparative analysis of the performance of different Python Software Foundation) using Jupyter Notebook.
units. For cases where incorrect responses were gener- Statistical analysis was performed using GraphPad Prism
ated, we solicited feedback from domain experts and 9 Software. The significance of differences among groups
performed thorough assessment and error analysis. Our was set at p < 0.05.
study yields valuable insights for researchers and devel-
opers to improve large language models’ performance in Results
the medical domain. Overall performance
As shown in Fig. 2, ChatGPT failed to pass the accuracy
Methods threshold of 0.6 in any of the three types of examinations
Medical examination datasets over the five years. Specifically, in the Chinese NMLE,
We collected questions from Chinese NMLE, NPLE and the highest recorded accuracy was 0.5467, which was
NNLE from year 2017 to 2021. In NMLE, each exam con- attained in both 2018 and 2021. In the Chinese NPLE,
sists of 4 units, each unit has 150 questions, for a total of the highest accuracy was 0.5599 in 2017. In the Chi-
600 questions. In NPLE, each exam consists of 4 units, nese NNLE, the most impressive result was shown in
each unit has 120 questions, for a total of 480 questions. 2017, with an accuracy of 0.5897, which is also the high-
In NNLE, each exam consists of 2 units, each unit has est accuracy in our entire evaluation. Conversely, the
120 questions, for a total of 240 questions. Based on the 2019 NPLE exam resulted in the lowest accuracy, with a
requirements of the examination, a correct response rate recorded value of 0.4356.
exceeding the accuracy threshold of 0.6 is considered
to meet the passing criteria. The questions with figures, Detailed performance
tables or chemical structure were manually identified The score of each unit in the Chinese NMLE is shown in
and excluded by a clinician with five years of clinical Table 1. The performance of ChatGPT has varied across
experience. different units and years. In 2017 and 2020, ChatGPT
performed best in Unit 2. In 2018 and 2019, ChatGPT
Model setting performed best in Unit 1. In the 2021, ChatGPT per-
We employed ChatGPT, an artificial intelligence chatbot formed best in both Unit 2 and Unit 3. In 2018 and 2021,
built upon the generative pre-trained transformer tech- ChatGPT correctly answered 328 out of 600 questions.
nology. The official API was utilized to invoke the chat- This is because the complexity and difficulty of question
bot, with gpt-3.5-turbo as model parameter and default in each unit were vary from year to year. On average,
values for other parameters. As shown in Fig. 1, the input ChatGPT achieved the highest score in Unit 2 (84.6), fol-
question consisted of the background description and lowed by Unit 1 (79.8), Unit 3 (78.2), Unit 4 (75.4).
choices. To elicit diverse responses, we applied direct The score of each unit in the Chinese NPLE is shown in
instruction strategy via prompt, such as “Please return Table 2. In NPLE, each unit has 120 questions, and each
the most correct answer”, “Only one best option can be exam has 480 questions. We identified and removed the
Zong et al. BMC Medical Education (2024) 24:143 Page 4 of 9

Fig. 1 The overview of interaction with ChatGPT. The question included background description and choices from three national licensing examinations,
including Chinese National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE) and National Nurse Licensing
Examination (NNLE). The prompt was designed to force a clear answer, as well as the ability to recognize single-choice or multiple-choice questions. The
response of ChatGPT were manually reviewed by an experienced clinician to determine the answer. The correct answer to this question is “D. Cor pulmo-
nale”. It should be noted that while English text was shown in the figure, the experiment itself used Chinese text as both the input and output language

Fig. 2 The performance of ChatGPT of three national licensing examinations over a period of five years from 2017 to 2021. The examinations included
Chinese National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE) and National Nurse Licensing Examination
(NNLE)
Zong et al. BMC Medical Education (2024) 24:143 Page 5 of 9

Table 1 The score of each unit in Chinese National Medical Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 81 87 84 71 76 79.8
Unit2 score 96 83 75 83 86 84.6
Unit3 score 79 79 72 75 86 78.2
Unit4 score 65 79 75 78 80 75.4
Total score 321 328 306 307 328 318
Questions 600 600 597 600 600 -
Accuracy 53.50% 54.67% 51.26% 51.17% 54.67% 53.05%

Table 2 The score of each unit in Chinese National Pharmacist Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 57 55 51 48 49 52
Unit2 score 63 54 52 60 61 58
Unit3 score 60 56 49 53 47 53
Unit4 score 77 58 51 59 65 62
Total score 257 223 203 220 222 225
Questions 459 450 466 458 463 -
Accuracy 55.99% 49.56% 43.56% 48.03% 47.95% 49.02%

Table 3 The score of each unit in Chinese National Nurse Licensing Examination
Year 2017 2018 2019 2020 2021 Average
Unit1 score 65 57 72 54 67 63
Unit2 score 73 74 57 53 63 64
Total score 138 131 129 107 130 127
Questions 234 232 238 232 238 -
Accuracy 58.97% 56.47% 54.2% 46.12% 54.62% 54.08%

questions included figures, tables or chemical structure. NPLE (Fig. 3D), ChatGPT demonstrated higher perfor-
Such questions appeared the most in 2018 (30), followed mance in single-choice questions compared to multi-
by 2020 (22), 2017 (21), 2021 (17) and 2019 (14). On aver- ple-choice questions, with a highly statistical difference
age, the ChatGPT performed best in Unit 4 (62), followed (p < 0.0001).
by Unit 2 (58), Unit 3 (53) and Unit 1(52). In the year
2017, ChatGPT achieved highest score, and correctly Performance on different subjects and topics
answered 257 out of 459 questions. To better understand why ChatGPT failed in the Chinese
The Table 3 shown the detailed score of each unit of medical examination, we took the 2021 NMLE exam as
Chinese NNLE. There are totally 26 questions included an example, and labeled the medical subjects and topics
figures, tables or chemical structure were removed. In for each question (Fig. 4). The result revealed that Chat-
2017 and 2018, ChatGPT performed better in Unit2 than GPT excelled in clinical epidemiology, human parasi-
Unit1. Conversely, in 2019, 2020 and 2021, ChatGPT tology, and dermatology, with all questions answered
performed better in Unit1 than Unit2. On average, Chat- correctly. However, the model faltered in subjects such
GPT’s performance of the two units had no noticeable as pathology, pathophysiology, public health regulations,
difference. physiology, and anatomy, with the proportion of correct
In comparison, ChatGPT exhibited better proficiency answers was less than 0.5. Additionally, we observed that
in NNLE (54.08%), with NMLE (53.05%) and NPLE ChatGPT performed admirably in topics related to mol-
(49.02%) following behind. The result corresponds to the ecule, health management and prevention, diagnosis and
complexity and difficulty of the exam questions. screening, but its performance was lackluster in topics
such as clinical manifestations, indicator values, struc-
Performance on different units and question types tural location, cell, and tissue. Interestingly, we found
Figure 3 demonstrated the comparative analysis of Chat- no significant difference in performance on case-based
GPT’s performance differences across units and question questions and non-case-based questions.
types. The results shown there was no significant differ-
ence in across different units in NMLE (Fig. 3A), NPLE
(Fig. 3B), and NNLE (Fig. 3C). However, in the case of
Zong et al. BMC Medical Education (2024) 24:143 Page 6 of 9

Fig. 3 The performance of ChatGPT on different units and question types. For different units, there were no significant difference among (A) Chinese
National Medical Licensing Examination (NMLE), (B) National Pharmacist Licensing Examination (NPLE), and (C) National Nurse Licensing Examination
(NPLE). (D) However, ChatGPT demonstrated higher performance in single-choice questions than multiple-choice questions with a highly significant
difference (ns, no significant difference, ****p < 0.0001)

Discussion languages to enhance its performance in non-English


In this study, we evaluated the performance of ChatGPT, medical exams.
an artificial intelligence chatbot, in answering medical Secondly, there are differences in medical policies, legal
exam questions from Chinese NMLE, NPLE, and NNLE regulations, and management agencies across countries
from year 2017 to 2021. with different languages or cultures. ChatGPT, being
a language model trained on a diverse range of data,
ChatGPT failed NNLE, NMLE and NPLE in China may not possess an in-depth understanding of the spe-
The results of our study revealed that ChatGPT was cific legal framework and requirements. In the Chinese
unsuccessful in meeting the requirements of the three NMLE, some questions relate to healthcare policies,
primary medical licensure assessments namely, NMLE, while in the Chinese NPLE, the entire unit 4 is officially
NPLE and NNLE in China, spanning from 2017 to 2021. designated as pharmaceutical management and regula-
There are several possible reasons for this. tions. These questions cover topics such as drug produc-
Firstly, According to OpenAI, ChatGPT has been tion, market circulation, pharmaceutical management,
trained with the vast majority of the data is in English, and legal regulations. The aim is to assess awareness and
with only a small amount of data in other languages, like compliance ability in clinical practice. Generally, these
Chinese. More richer training dataset allows the model to questions are relatively short in length and clear in mean-
learn more knowledge. Recent studies shown that Chat- ing. While ChatGPT has acquired a wealth of knowledge
GPT passed United States Medical Licensing Examina- on healthcare policies from English-speaking countries
tion [19, 20]. However, it failed to pass the Taiwanese due to its extensive English dataset, it may encounter
Pharmacist Licensing Examination [25], and performed difficulties in correctly understanding the healthcare
worse than medical students on Korean-based parasitol- policies of non-English-speaking countries, leading to
ogy examination [28]. These findings suggest that Chat- erroneous responses to related questions. Additionally,
GPT may require additional training data in non-English healthcare policies undergo regular updates over time,
making such questions more challenging.
Zong et al. BMC Medical Education (2024) 24:143 Page 7 of 9

Fig. 4 The performance of ChatGPT on different subjects, topics and types of questions in the 2021 NMLE exam

Thirdly, while ChatGPT has remarkable ability to pro- The potential of large language model in medical
cess and generate text data, its proficiency in numerical education
computations is limited. For some questions related to As a significant milestone in the development of artificial
mathematical calculation, such as dosage calculation and intelligence, ChatGPT driven by large language model
laboratory values interpretation, may pose challenges for has powerful capability in language understanding and
a language model. Additionally, in some case, the task content generation. With its remarkable potential, Chat-
requires reading the question and selecting the most suit- GPT could be a valuable resource in acquiring medical
able answer, while there are some suboptimal answers in knowledge and learning clinical skills for students, and
given choices, and in such cases, ChatGPT is forced to serve as an informative assistant in preparing teaching
select a single choice as answer, which can limit its con- materials and evaluating course projects for teachers.
tent comprehension ability and lead to incorrect answer. In our study, ChatGPT has achieved an accuracy of
These findings provide deep insight into the strengths over 0.5 in most of the exams, indicating a significant
and weaknesses of ChatGPT in Chinese medical exami- potential for ChatGPT in medical education. Previ-
nations, and pave the way for future research to improve ous study shown in the Chinese Rural General Medical
the model’s capabilities in this domain. Licensing Examination, only 55% of students were able
to pass the written examination [29]. In China, the
Zong et al. BMC Medical Education (2024) 24:143 Page 8 of 9

Author contributions
significant healthcare burden necessitates a vast number H.Z., J.L., E.W., R.W., J.L. and B.S. involved in the study conceptualization. J.L.
of licensed clinical staff and healthcare providers. How- collected and preprocessed the data. H.Z. conducted data analysis, results
ever, the rigorous examinations lead to low pass rates, interpretation and manuscript preparation. H.Z. and B.S. contributed to the
review and editing of the manuscript. B.S. supervised the study. All authors
exacerbating the shortage of licensed practitioners, espe- read and approved the final manuscript.
cially in rural areas. The large language model presents a
promising avenue for enhancing medical education and Funding
This work was supported by the National Natural Science Foundation of China
advancing healthcare reform, with the potential to reduce (32270690 and 32070671).
medical burden.
Finally, the advancement of artificial intelligence (AI), Data availability
The data analyzed and reported in this study are available at https://github.
specifically large language models, in medical education com/zonghui0228/LLM-Chinese-NMLE.git.
needs public benchmarking datasets and fair evaluation
metrics for performance assessment. There is also a need Declarations
to interact with human experts from multiple dimensions
and obtain continuous feedback. In addition, the use of Ethics approval and consent to participate
Not applicable (NA).
such model must also consider data privacy, cognitive
bias, and comply with regulations. Consent for publication
Not applicable (NA).

Limitations Competing interests


Our study has some limitations. First, the questions of The authors declare no conflict of competing interests.
China NMLE, NPLE, and NNLE are all multiple-choice
Received: 31 August 2023 / Accepted: 1 February 2024
format. While this format meets our study purposes,
it did not fully showcase the content generation capa-
bilities of ChatGPT. In the future, it would be benefi-
cial to include more open-ended questions. Second, we
References
evaluated the performance of ChatGPT in medical 1. Bhinder B, et al. Artificial Intelligence in Cancer Research and Precision Medi-
examinations with zero-shot learning. However, better cine. Cancer Discov. 2021;11(4):900–15.
performance may be achieved by incorporating knowl- 2. Moor M, et al. Foundation models for generalist medical artificial intelligence.
Nature. 2023;616(7956):259–65.
edge-enhanced training methods. Third, the different 3. van Dis EAM, et al. ChatGPT: five priorities for research. Nature.
variations of prompt may impact ChatGPT’s response, 2023;614(7947):224–6.
leading to diverse answers. Therefore, it is imperative to 4. Sarink MJ et al. A study on the performance of ChatGPT in infectious diseases
clinical consultation. Clin Microbiol Infect, 2023.
develop innovative techniques that can generate more 5. Lee TC et al. ChatGPT Answers Common Patient Questions About Colonos-
consistent and trustworthy responses in the future. copy. Gastroenterology, 2023.
Finally, further investigation is needed to determine the 6. Young JN et al. The utility of ChatGPT in generating patient-facing and clini-
cal responses for melanoma. J Am Acad Dermatol, 2023.
underlying factors contributing to this substandard per- 7. Luo R et al. BioGPT: generative pre-trained transformer for biomedical text
formance, and to explore broader application of Chat- generation and mining. Brief Bioinform, 2022. 23(6).
GPT in medical education and clinical decision-making 8. Zhang Y, et al. DIALOGPT: Large-scale generative pre-training for conversa-
tional response generation. Online: Association for Computational Linguis-
support. tics; 2020.
9. Das A, et al. Conversational bots for psychotherapy: a study of Generative
Conclusion Transformer models using domain-specific dialogues. Dublin, Ireland: Asso-
ciation for Computational Linguistics; 2022.
In conclusion, we evaluated the performance of Chat- 10. Komorowski M. M. Del Pilar Arias Lopez, and A.C. Chang, How could ChatGPT
GPT on three types of national medical examinations impact my practice as an intensivist? An overview of potential applications,
in China, including NMLE, NPLE, and NNLE from year risks and limitations. Intensive Care Med, 2023.
11. Munoz-Zuluaga C, et al. Assessing the Accuracy and Clinical Utility of Chat-
2017 to 2021. The results indicated ChatGPT failed to GPT in Laboratory Medicine. Clin Chem; 2023.
meet the official pass criteria with accuracy threshold of 12. Yang H. How I use ChatGPT responsibly in my teaching. Nature, 2023.
0.6 in any of the three types of examinations over the five 13. Abd-Alrazaq A, et al. Large Language models in Medical Education: opportu-
nities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291.
years. The performance of ChatGPT varied across differ- 14. Jin Q, Leaman R, Lu Z. Retrieve, summarize, and verify: how will ChatGPT
ent units and years, with the highest score achieved in affect information seeking from the Medical Literature? J Am Soc Nephrol,
NNLE of year 2017. ChatGPT exhibited relatively bet- 2023.
15. Kovoor JG, Gupta AK, Bacchi S. ChatGPT: effective writing is succinct. BMJ.
ter proficiency in NNLE, with NMLE and NPLE follow- 2023;381:1125.
ing closely behind. ChatGPT performed well in a range 16. Shafiee A. Matters arising: authors of research papers must cautiously use
of subject areas, including clinical epidemiology, human ChatGPT for scientific writing. Int J Surg, 2023.
17. Gao CA, et al. Comparing scientific abstracts generated by ChatGPT to real
parasitology, and dermatology, as well as in various medi- abstracts with detectors and blinded human reviewers. NPJ Digit Med.
cal topics such as molecules, health management and 2023;6(1):75.
prevention, diagnosis and screening.
Zong et al. BMC Medical Education (2024) 24:143 Page 9 of 9

18. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific 24. Weng TL et al. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin
writing? Crit Care. 2023;27(1):75. Med Assoc, 2023.
19. Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted 25. Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the Pharmacist
medical education using large language models. PLOS Digit Health. Licensing examination in Taiwan. J Chin Med Assoc, 2023.
2023;2(2):e0000198. 26. Seghier ML. ChatGPT: not all languages are equal. Nature. 2023;615(7951):216.
20. Gilson A, et al. How does ChatGPT perform on the United States Medical 27. Wang X. Experiences, challenges, and prospects of National Medical Licens-
Licensing examination? The implications of Large Language Models for Medi- ing examination in China. BMC Med Educ. 2022;22(1):349.
cal Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. 28. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to
21. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology those of medical students in Korea for taking a parasitology examination? A
Board-style examination: insights into current strengths and limitations. descriptive study. J Educ Eval Health Prof. 2023;20:1.
Radiology. 2023;307(5):e230582. 29. Han X, et al. Performance of China’s new medical licensing examination for
22. Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s rural general practice. BMC Med Educ. 2020;20(1):314.
performance on the UK Neurology Specialty Certificate Examination. BMJ
Neurol Open. 2023;5(1):e000451.
23. Humar P et al. ChatGPT is equivalent to First Year plastic surgery residents: Publisher’s Note
evaluation of ChatGPT on the plastic surgery In-Service exam. Aesthet Surg J, Springer Nature remains neutral with regard to jurisdictional claims in
2023. published maps and institutional affiliations.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy