PIIS2589004224005558
PIIS2589004224005558
OPEN ACCESS
Review
Language model and its interpretability
in biomedicine: A scoping review
Daoming Lyu,1,2 Xingbo Wang,1,2 Yong Chen,3 and Fei Wang1,2,*
SUMMARY
With advancements in large language models, artificial intelligence (AI) is undergoing a paradigm shift
where AI models can be repurposed with minimal effort across various downstream tasks. This provides
great promise in learning generally useful representations from biomedical corpora, at scale, which would
empower AI solutions in healthcare and biomedical research. Nonetheless, our understanding of how they
work, when they fail, and what they are capable of remains underexplored due to their emergent prop-
erties. Consequently, there is a need to comprehensively examine the use of language models in biomed-
icine. This review aims to summarize existing studies of language models in biomedicine and identify
topics ripe for future research, along with the technical and analytical challenges w.r.t. interpretability.
We expect this review to help researchers and practitioners better understand the landscape of language
models in biomedicine and what methods are available to enhance the interpretability of their models.
INTRODUCTION
Recent progress made in large language models, i.e., GPT,1 BERT,2 and ChatGPT, presents a chance to rethink artificial intelligence (AI) sys-
tems, with language as a means to facilitate interaction between humans and AI. Generally, a language model is a probability distribution
p ðw1 ; w2 ; .; wM Þ over a sequence of word tokens, with wm ˛ U and U being a vocabulary, as shown in Figure 2A. But why would you want
to compute such a probability of a word sequence? In the application scenario, the goal is to produce word sequences as output. For
example, the goal of text summarization is to convert long texts into concise summaries. By computing the probability distribution over ut-
terances, the word sequence can be generated by sampling tokens from this learned probability distribution.
A simple approach to computing the probability distribution of word sequence is to use statistical techniques, such as relative frequency
counts. However, it is very data-intensive and suffers from high variance: even grammatical sentences will have a zero probability if they have
not occurred in the training data. An alternative way is to compute the probability in the product format. N-gram models make a crucial simpli-
fying approximation by conditioning on only the last n 1 words. However, those traditional probabilistic language models require smooth-
ing techniques to avoid the situation p ðw1 ; w2 ; .; wM Þ = 0 when there is a rare or unseen word. Besides, these models are computationally
intensive for large histories of text and cannot capture the long-range dependencies in language. Neural language models use neural net-
works or deep neural networks to model languages, such as feedforward neural networks, recurrent neural networks, and transformer neural
networks. Neural language models have significant advantages over traditional probabilistic language models. Compared to n-gram models,
neural language models are not constrained by the restricted context and can incorporate contexts from arbitrarily distant words, while re-
maining computationally and statistically tractable. Besides, neural language models can generalize better over contexts of similar words and
are more accurate at word prediction. In this survey, we will focus on the neural language models and use the term ‘‘language model’’ (LM) to
refer to the neural language models.
LMs usually use (low-dimensional) latent feature representation to implicitly capture the syntactic or semantic features of the language. The
representation needs to be learned afresh for each new natural language processing (NLP) task, and in many cases, the size of the training
data limits the quality of the latent feature representation. Given that the nuances of language are common to all NLP tasks, one could posit
that we could learn generic latent feature representations from some generic tasks once and then share it across all NLP tasks. Language
modeling, where the model needs to learn how to predict the next word given previous words, is such a generic task with abundant naturally
occurring text to pre-train such a model (hence the name pre-trained language models). There are some benefits in pre-training, including
(i) learning a universal representation through the massive corpus for downstream tasks, (ii) achieving an improved generalization ability and
faster convergence with model initialization, and (iii) mitigating the overfitting issues in scenarios with limited data. There are several classes of
pre-trained language models: autoregressive language models (GPT,1 GPT-2,3 ELMo4), masked language models (BERT,2 XLM,5 T5,6 MASS7),
permuted language models (XLNet8), and denoising autoencoders (BART,9 mBART10), which are categorized by their ways of masking tokens,
overcoming the mismatch issue, and recovering back the inputs. Besides, the pre-trained language models can also be categorized from
1Instituteof Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
2Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
3Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
*Correspondence: few2001@med.cornell.edu
https://doi.org/10.1016/j.isci.2024.109334
Figure 1. PRISMA flow diagram of study selection: language models in healthcare and biomedical research
other perspectives. For example, they can be divided into non-contextual and contextual models according to the representation used for
downstream tasks. According to various scenarios, they can be categorized as knowledge-enriched LMs, multilingual or language-specific
LMs, multi-model LMs, domain-specific LMs, and compressed LMs.
Healthcare and biomedicine represent vast domains of application, encompassing diverse areas of focus. Healthcare entails the delivery of
care to patients via diagnosis, treatment, and health administration, while biomedical research concentrates on the scientific understanding of
disease and the discovery of new therapeutic approaches. Both areas necessitate significant resources, time, and comprehensive medical
knowledge. Language models can be trained on diverse sources or modalities of data in the biomedical domain, which have the potential
to serve as a central storage of medical knowledge. In this way, they can be accessed and queried by medical professionals (e.g., healthcare
providers and biomedical researchers) and by the public. By leveraging their strong adaptability through fine-tuning or prompting, language
models can be effectively tailored to suit various specific tasks within healthcare and biomedicine. Despite the imminent widespread adoption
of these models, our current understanding of how they work, when they fail, and what they are even capable of remains underexplored due to
their emergent properties and complexity. Consequently, there is a need to examine the utilization of language models in healthcare and
biomedicine.
Interpretability, often used interchangeably with explainability, refers to the ability to explain or provide meaning to model predictions. In
particular, interpretability aims to describe the inner structure of a model in a manner that is easily understandable by humans.11 In the med-
ical domain, for example, there are great challenges in clinical decision support, such as diagnostic/prognostic/treatment uncertainties, and
imbalanced, heterogeneous, noisy, sparse, high-dimensional datasets. Due to their powerful capacity, language models can be used for
various use cases, including predicting the future diagnosis of depression in a temporal manner for mental health research,12 recommending
medications,13 extracting cancer phenotypes,14 and predicting a patient’s likelihood of readmission to the hospital.15 In these high-stakes
decisions, however, one of the concerns in the deployment of such models is that there can still be high model misclassification. Besides,
it has been widely shown that such models are not robust and may encounter failures in the presence of both artificial and natural noise.16
Due to the black-box nature of such models, there is no easily discernible logic connecting the data to the decisions of the models. Therefore,
providing explanations is critical to holding people/institutes accountable when models malfunction and gaining scientific understanding
about models. To reach a level of explainable and usable machine intelligence, we need to not only learn from data, extract knowledge,
generalize, and mitigate the curse of dimensionality but also disentangle the underlying explanatory factors of the data.
Therefore, the purpose of this scoping review is to map different types of corpora and language models used in existing healthcare and
biomedical literature to their application tasks. Further, it seeks to identify topics ripe for future research, along with the technical and analyt-
ical challenges w.r.t. the interpretability. The processing and reporting of the results of this review were guided by the Preferred Reporting
Items for Systematic Reviews and Meta-Analyses guidelines, as shown in Figure 1. We performed the literature search from various resources
to find relevant articles published between Jan. 2015 and Dec. 2022: (i) the primary databases including Google Scholar, IEEE Xplore, ACM
Digital Library, and PubMed; and (ii) the additional resources such as ACL Anthology. The search strategy for ‘‘language models for healthcare
and biomedical research’’ is: (‘‘language models’’ OR ‘‘Transformer’’ OR ‘‘deep neural networks’’ OR ‘‘pre-trained models’’) AND (‘‘health’’ OR
A C
‘‘biomedical’’ OR ‘‘biomedicine’’). The search strategy for ‘‘interpretability of language models’’ is: (‘‘language models’’ OR ‘‘Transformer’’ OR
‘‘deep neural networks’’ OR ‘‘pre-trained models’’) AND (‘‘health’’ OR ‘‘biomedical’’ OR ‘‘biomedicine’’) AND (‘‘explainability’’ OR ‘‘interpret-
ability’’). Exclusions for the study selection were: (a) articles were not published in English; (b) commentaries or editorials; (c) the full text of the
article is not accessible; (d) the language models are not based on deep neural networks; and (e) the outcome is not related to healthcare and
biomedical research. But there might be a few limitations in this study: (i) we focused on the language models and limited several corpora as
listed in the Results section, without including other types of corpora, such as speech data, audio recordings, video recordings, physiological
data, medical robotic data, etc.; and (ii) the searched studies are all in English, which might result in the underrepresentation of language
model applications in non-English-speaking countries. Despite of these, our review provides a landscape of the current literature on the lan-
guage model and its interpretability in biomedicine.
RESULTS
Language models for healthcare and biomedical research
In this subsection, we classify the biomedical corpora used to train the language models into six types, followed by a presentation of each
category in detail (as shown in Figure 2C). Besides, we make an overview table listing the various examined categories as shown in Table 1.
Biomedical Interpretability
OPEN ACCESS
iScience 27, 109334, April 19, 2024
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
ll
Zhang et al., 201917 EHR VetTag https://github.com/ CSU test data: F1 (66.2%), text classification saliency method
yuhui-zh15/VetTag Precision (72.1%), Recall (63.1%),
ExactMatch (26.2%)
18
Liu et al., 2022 EHR MedM-PLM https://git.openi.org.cn/ 2010-i2b2: F1 (86.29%); information extraction; –
liusc/3-6-liusicen- medication recommendation: AUC classification
multi-modal-pretrain (95.57%); 30-day readmission
prediction: AUC (74.7%); ICD
coding: AUC (87.46%)
Huang et al., 201915 EHR ClinicalBERT https://github.com/ clinical word similarity: Pearson semantic textual attention weight
kexinhuang12345/ correlation (67.0%); 30-day similarity; classification
clinicalBERT readmission prediction: AUC
(71.4%)
Si et al., 201919 EHR BERTbase, https://huggingface.co/ i2b2 2010: F1 (90.25%); i2b2 2012: information extraction –
BERTlarge models?sort=trending& F1 (80.91%); Semeval 2014 Task 7:
search=bert F1 (80.74%); Semeval 2015 Task
14: F1 (81.65%)
20
Zhu et al., 2018 EHR, Online Clinical ELMo https://github.com/ 2010 i2b2/VA: Precision (89.34%), information extraction –
Medical noc-lab/clinical_ Recall (87.87%), F1 (88.60%)
Knowledge concept_extraction
Sources
Alsentzer EHR Clinical BERT, https://github.com/ i2b2 2010: Exact F1 (87.8%); i2b2 information extraction; –
et al., 201921 Discharge EmilyAlsentzer/ 2012: Exact F1 (78.9%); MedNLI: natural language
Summary clinicalBERT Accuracy (82.7%) inference
BERT
Shang et al., 201913 EHR G-BERT https://github.com/ Jaccard (45.65%), PR-AUC classification –
jshang123/G-Bert (69.60%), F1 (61.52%)
Rasmy et al., 202122 EHR Med-BERT https://github.com/ DHF-Cerner: AUC (85.39%); classification attention weights
ZhiGroup/Med-BERT PaCa-Cerner: AUC (82.23%);
PaCa-Truven: AUC (80.57%)
Li et al., 202023 EHR BEHRT – AUC (90.4%), average classification attention weights
precision score (21.6%)
Lewis et al., 202024 EHR, Scientific Bio-LM https://github.com/ I2B2-2010: F1 (89.7%); HOC: Information extraction; –
literature facebookresearch/bio-lm Macro-F1 (86.6%); MedNLI: classification; natural
Accuracy (88.5%) language inference
iScience
(Continued on next page)
Review
Review
iScience
Table 1. Continued
Biomedical Interpretability
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
Peng et al., 201925 EHR, Scientific BlueBERT https://github.com/ MedSTS: Pearson (84.8%); Semantic textual similarity; –
literature ncbi-nlp/bluebert BC5CDR: F1 (93.5%); i2b2 2010: F1 Information extraction;
(76.4%); HOC: F1 (87.3%); MedNLI: text classification; natural
Accuracy (84.0%) language inference
Agrawal EHR GPT-3+R – Biomedical Evidence Extraction: Information extraction; –
et al., 202226 Accuracy (85%), F1 (61%); classification
Medication status classification:
Conditional Accuracy (89%),
Conditional Macro F1 (71%)
Chang EHR Clinical BERT https://github.com/ Top-5 accuracies of 0.92 and 0.94 classification –
et al., 202027 dchang56/chief_ on datasets comprised of 434 and
complaints 188 labels, respectively
Yang et al., 202228 EHR, Scientific GatorTron https://github.com/uf-hobi- 2010 i2b2: F1 (89.96%); 2018 n2c2: Information extraction; –
literature informatics-lab/GatorTron F1 (96.27%); 2019 n2c2: Pearson semantic textual similarity;
correlation (89.03%); MedNLI71: natural language inference;
Accuracy (90.20%); emrQA question answering
medication: F1 (74.08%), Exact
Match (31.55%)
Huang et al., 201929 EHR Clinical XLNet https://github.com/ prolonged mechanical ventilation: classification –
lindvalllab/clinicalXLNet AUC (66.3%); 90-day mortality:
AUC (77.9%)
14
Zhou et al., 2022 EHR CancerBERT https://github.com/zhang- macro F1 scores equal to 0.876 information extraction –
informatics/CancerBERT (95% CI, 0.873–0.879) and 0.904
(95% CI, 0.902–0.906) for exact
match and lenient match,
respectively.
Michalopoulos EHR, Online UmlsBERT https://github.com/gmichalo/ MedNLI: Accuracy (83.0%); i2b2 natural language inference; –
et al., 202030 Medical UmlsBERT 2010: F1 (88.6%) information extraction
iScience 27, 109334, April 19, 2024
Knowledge
Sources
Kades et al., 202131 EHR Enhanced BERT – 2019 n2c2: Pearson correlation semantic textual similarity –
(88.3%)
Yang et al., 202032 EHR RoBERTa-MIMIC https://github.com/uf-hobi- 2010 i2b2: F1 (89.94%); 2012 i2b2: information extraction –
informatics-lab/ F1 (80.53%); 2018 n2c2: F1
OPEN ACCESS
ClinicalTransformerNER (89.07%)
Meng EHR BRLTM https://github.com/ depression prediction: PRAUC classification attention weights
ll
et al., 202112 lanyexiaosa/brltm (76%)
Table 1. Continued
Biomedical Interpretability
OPEN ACCESS
iScience 27, 109334, April 19, 2024
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
ll
Chen et al., 202033 EHR AlphaBERT https://github.com/ AUC (94.7%); ROUGE-L (69.3%) text summarization –
wicebing/AlphaBERT
Wang et al., 202134 EHR CHMBERT – disease prediction: Top-1 F1 classification –
(61.95%), Top-5 F1 (91.58%),
Top-1 F1 (96.83%),
Zhang et al., 202035 EHR MC-BERT https://github.com/alibaba- cEHRNER: F1 (90%); cMedQA: F1 information extraction; –
research/ChineseBLUE (82.3%); cMedTC: F1 (82.1%) question answering;
text classification
Kraljevic EHR MedGPT – NER+L: F1 (93%) information extraction saliency method
et al., 202136
Khin et al., 201837 EHR ELMo – i2b2-PHI: F1 (89.87%–98.74%) information extraction –
Yang et al., 202038 EHR RoBERTa https://github.com/ Pearson correlation (90.65%) semantic textual attention weights
uf-hobi-informatics-lab/ similarity
2019_N2C2_Track1_
ClinicalSTS
39
Xiong et al., 2020 EHR BERT-based – 2019 n2c2: Pearson correlation semantic textual –
model (86.8%) similarity
Mahajan EHR ClinicalBERT – 2019 n2c2: Pearson correlation semantic textual –
et al., 202040 (90.1%) similarity
Yan et al., 202241 EHR RadBERT – abnormal sentence classification: text summarization; –
Accuracy (96.1%), F1 (95.6%); classification
report coding: Accuracy (96.1%),
F1 (96.0%); report summarization:
ROUGE-1 (16.18%);
Lau et al., 202242 EHR BERTrad https://github.com/ 90.9%–93.4% F1 for finding Information extraction –
wilsonlau-uw/BERT-EE triggers; 72.0%–85.6% F1 for
arguments role extraction
Meng et al., 202043 EHR BERT-based – Precision (97.0%), Recall (93.3%), classification –
model F-measure (95.1%)
Bressem EHR FS-BERT & https://github.com/ chest radiograph reports: AUC classification –
et al., 202144 RAD-BERT rAIdiance/bert-for- (97%–99%); CT reports: pooled
radiology AUC/AUPRC of 88%/80%
iScience
Review
Review
iScience
Table 1. Continued
Biomedical Interpretability
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
Naseem Biomedical TraP-VQA – Overall: Accuracy (64.82%), Open- visual question Gradient-weighted
et al., 202245 image-text ended: Accuracy (37.72%), Close- answering Class Activation
pairs ended: Accuracy (93.57%) Mapping
(Grad-CAM),
Shapley additive
explanations (SHAP),
attention weights
Li et al., 202046 Biomedical V + L models https://github.com/ OpenI: averaged AUC (98.5%) visual question visualization of
image-text YIKUAN8/Transformers- answering attention maps
pairs VQA
Khare et al., 202147 Biomedical MMBERT https://github.com/ VQA-Med 2019: overall Accuracy visual question visualization of
image-text VirajBagal/MMBERT (67.2%), BLEU (69%); VQA-RAD: answering attention maps
pairs overall Accuracy (72%)
Moon et al., 202248 Biomedical MedViLL https://github.com/ diagnosis classification (Open-I): visual question answering; visualization of
image-text SuperSupermoon/ AUC (89.2%), F1 (40.7%); VQA- classification attention maps
pairs MedViLL RAD: accuracy of 59.5%/77.7% for
open-ended and close-ended
questions, respectively
Chen et al., 202249 Biomedical Med-VLP https://github.com/ VQA-2019: overall Accuracy visual question answering; –
image-text zhjohnchan/ARL (80.32%); VQA-RAD: overall classification
pairs Accuracy (79.16%); MELINDA:
Accuracy (80.51%)
Chen et al., 202250 Biomedical M3AE https://github.com/ VQA-RAD: overall Accuracy visual question answering; –
image-text zhjohnchan/M3AE (77.01%); VQA-2019: overall classification
pairs Accuracy (79.87%); MELINDA:
Accuracy (78.50%)
Monajatipoor Biomedical BERTHop https://github.com/ OpenI: AUC (98.12%) classification –
et al., 202251 image-text masoud-monajati/
iScience 27, 109334, April 19, 2024
pairs BERTHop
Boecking Biomedical BioViL https://huggingface.co/ RadNLI: Accuracy (65.21%) natural language –
et al., 202252 image-text microsoft/BiomedVLP- inference
pairs BioViL-T
Lee et al., 202053 Scientific BioBERT https://github.com/ 2010 i2b2: F1 (86.73%), NCBI information extraction; –
literature dmis-lab/biobert disease: F1 (89.71%), BC5CDR: F1 question answering
OPEN ACCESS
(93.47%), BC2GM: F1 (84.72%),
ChemProt: F1 (76.46%), BioASQ
ll
5b: Strict Accuracy (46%)
Table 1. Continued
Biomedical Interpretability
OPEN ACCESS
iScience 27, 109334, April 19, 2024
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
ll
Shin et al., 202054 Scientific BioMegatron https://github.com/ BC5CDR-chem: F1 (92.9%), information extraction; –
literature NVIDIA/NeMo BC5CDR-disease: F1 (88.5%), question answering
NCBI-disease: F1 (87.8%),
ChemProt: F1 (77.0%), BioASQ-7b-
factoid: Strict Accuracy (47.4%)
55
Gu et al., 2021 Scientific PubMedBERT – BC5-chem: F1 (93.33%), BC5- information extraction; –
literature disease: F1 (85.62%), NCBI- text classification; question
disease: F1 (87.82%), BC2GM: F1 answering; semantic
(84.52%), ChemProt: Micro F1 textual similarity
(77.24%), DDI: Micro F1 (82.36%),
BIOSSES: Pearson (92.30%), HoC:
Micro F1 (82.32%), PubMedQA:
Accuracy (55.84%), BioASQ:
Accuracy (87.56%),
Luo et al., 202256 Scientific BioGPT https://github.com/ KD-DTI: F1 (38.42%), BC5CDR: F1 information extraction; text –
literature microsoft/BioGPT (46.17%), DDI: F1 (40.76%), classification; question
PubMedQA: Accuracy (78.2%), answering
HoC: F1 (85.12%)
Kanakarajan Scientific BioELECTRA https://github.com/ BC5-chem: F1 (93.60%), BC5- information extraction; –
et al., 202157 literature kamalkraj/BioELECTRA disease: F1 (85.84%), NCBI- text classification;
disease: F1 (89.38%), BC2GM: F1 natural language
(84.69%), ChemProt: Micro F1 inference; question
(78.20%), DDI: Micro F1 (82.76%), answering; semantic
BIOSSES: Pearson (92.49%), textual similarity
HoC: Micro F1 (83.50%),
PubMedQA: Accuracy (64.02%),
BioASQ: Accuracy (88.57%),
MedNLI: Accuracy (86.34%)
Yasunaga Scientific BioLinkBERT https://github.com/ BC5-chem: F1 (94.04%), BC5- information extraction; –
et al., 202258 literature michiyasunaga/LinkBERT disease: F1 (86.39%), NCBI- text classification;
disease: F1 (88.76%), BC2GM: F1 question answering;
(85.18%), ChemProt: Micro F1 semantic textual
(79.98%), DDI: Micro F1 (83.35%), similarity
BIOSSES: Pearson (93.63%), HoC:
Micro F1 (84.87%), PubMedQA:
iScience
Accuracy (72.18%), BioASQ:
Review
Accuracy (94.82%)
Tinn et al., 202164 Scientific PubMedELECTRA https://huggingface.co/ BC5-chem: F1 (93.32%), BC5- information extraction; text –
literature microsoft disease: F1 (85.16%), NCBI- classification; question
disease: F1 (87.73%), BC2GM: F1 answering; semantic
(83.79%), ChemProt: F1 (76.74%), textual similarity
DDI: F1 (81.09%), BIOSSES:
Pearson (92.01%), HoC: F1
OPEN ACCESS
(82.57%), BioASQ: Accuracy
ll
(92.07%), PubMedQA: Accuracy
(67.64%)
Table 1. Continued
Biomedical Interpretability
OPEN ACCESS
iScience 27, 109334, April 19, 2024
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
ll
Ozyurt, 202065 Scientific literature Bio-ELECTRA https://github.com/ BioASQ 8b-factoid: Exact Match information extraction; –
SciCrunch/bio_electra (57.93%), BC4CHEMD: F1 question answering
(83.80%), BC2GM: F1 (72.55%),
NCBI Disease: F1 (81.13%),
LINNAEUS: F1 (85.02%), BioASQ
5b based: MRR (33.5%), GAD: F1
(80.96%), ChemProt: F1 (64.22%)
Moradi et al., 202066 Scientific literature BERT-based- https://github.com/ ROUGE-1 (75.04%), ROUGE-2 text summarization –
Summ BioTextSumm/BERT- (33.12%)
based-Summ
Xie et al., 202267 Scientific literature KeBioSum – CORD-19: ROUGE-1 (32.04%), text summarization –
PubMed-Long: ROUGE-1
(36.39%), s2orc: ROUGE-1
(37.44%), PubMed-Short: ROUGE-
1 (43.98%)
Du et al., 202068 Scientific literature BioBERTSum – PubMed: ROUGE-1 (37.45%), text summarization attention
CNN/DailyMail: ROUGE-1 visualization
(43.13%)
Wallace et al., 202169 Scientific literature BART-based – XSUM: ROUGE-L (26.5%), Pretrain: text summarization –
model ROUGE-L (26.9%), Decorate:
ROUGE-L (26.6%), Sort by N$RoB:
ROUGE-L (26.7%), Decorate and
sort: ROUGE-L (26.5%)
Guo et al., 202170 Scientific literature BART-based https://github.com/ ROUGE-1 (53.02%), ROUGE-2 text summarization –
model qiuweipku/Plain_language_ (22.06%), ROUGE-L (50.24%)
summarization
Kieuvongngam Scientific literature BERT&GPT-2 https://github.com/ extractive summary: ROUGE-1 text summarization attention
et al., 202071 based model VincentK1991/ (20%–70%), abstractive summary: visualization
BERT_summarization_1 ROUGE-1 (20%–45%)
Chakraborty Scientific literature BioMedBERT https://github.com/ GAD: F (79.92%), SQuAD v1.1: F1 information extraction; –
et al., 202072 BioMedBERT/ (92.46%), EM (86.12%), NCBI question answering
biomedbert Disease: F (87.51%), BC5CDR-
Disease: F (87.51%), BC5CDR-
chem: F (92.21%), BC4CHEMD: F
iScience
(86.41%), BC2GM: F (82.32%),
Review
Oniani & Wang, Scientific literature GPT-2-based https://github.com/ overall average rating score: 4.023 question answering –
202073 model oniani/covid-19-chatbot
OPEN ACCESS
ll
11
12
Table 1. Continued
Biomedical Interpretability
OPEN ACCESS
iScience 27, 109334, April 19, 2024
Authors, year Corpora Model name Open source (model) Model performance Application tasks technique
ll
Ji et al., 202182 Social Media MentalBERT & https://huggingface.co/ eRisk T1: F1 (93.38%), CLPsych T1: classification –
MentalRoBERTa mental F1 (69.71%), Depression Reddit: F1
(94.23%), UMD: F1 (58.58%),
T-SID: F1 (89.01%), SWMH: F1
(72.16%), SAD: F1 (68.44%),
Dreaddit: F1 (81.76%),
Papanikolaou Scientific DARE (GPT-2) https://openai.com/ CDR: F1 (73%), DDI2013: F1 (78%), Information extraction –
et al., 202083 literature research/gpt-2-1-5b- ChemProt: F1 (73%)
release
Papanikolaou Scientific BERT model – CDR: F1 (62.2%), GAD: F1 (69.8%), Information extraction –
et al., 201984 literature EUADR: F1 (81.2%), Healx CD: F1
(81.4%)
Wang et al., 202085 Scientific GLRE https://github.com/ CDR: F1 (68.5%), DocRED: F1 Information extraction –
literature nju-websoft/GLRE (57.4%)
Cabot et al., 202186 Scientific REBEL (BART) https://github.com/ CONLL04: F1 (71.97%), NYT: F1 Information extraction –
literature babelscape/rebel (91.76%), DocRED: F1 (41.84%),
ADE: F1 (81.69%), Re-TACRED: F1
(90.39%),
Weber et al., 202287 Scientific transformer- https://github.com/ F1 score of 79.73% on the hidden Information extraction –
literature based LM leonweber/drugprot DrugProt test set
Heinzinger Biological SeqVec https://github.com/ Per-residue predictions: CASP12: Proteins/DNA prediction –
et al., 201988 sequence Rostlab/SeqVec Accuracy (76.5%), TS115: Accuracy
(82.4%), CB513: Accuracy (80.7%)
Rives et al., 202189 Biological ESM-1b https://github.com/ CB513: accuracy (71.6%), CASP13: Proteins/DNA prediction –
sequence Transformer facebookresearch/esm accuracy (72.5%)
Xiao et al., 202190 Biological ProteinLM https://github.com/ contact prediction: P@L/5 (75%), Proteins/DNA prediction –
sequence THUDM/ProteinLM remote homology: Top 1 Accuracy
(30%), Secondary Structure:
Accuracy (79%), fluorescence:
Spearman’s rho (68%)
Brandes Biological ProteinBERT https://github.com/ Secondary structure - 3 state: Proteins/DNA prediction attention
et al., 202291 sequence nadavbra/protein_bert accuracy (74%), Remote homology: visualization
accuracy (22%), Fluorescence:
Spearman’s r (66%),
iScience
Weissenow Biological EMBER2 https://doi.org/10.5281/ SetTst29: TM score (50%) Proteins/DNA prediction –
Review
et al., 202292 sequence zenodo.6412497
OPEN ACCESS
ll
13
ll iScience
OPEN ACCESS Review
Chang et al.27 aimed to derive a compact and computationally useful representation for free-text chief complaints by using the clinical
BERT pre-trained on the MIMIC corpus. Kraljevic et al.36 developed MedGPT with MIMIC-III and other EHR data for predicting the next dis-
order in a patient’s timeline. Liu et al.18 proposed to pre-train the model of MedM-PLM on the MIMIC-III dataset and evaluate its effectiveness
on clinical tasks of medication recommendation, readmission prediction, and ICD coding. There are other language models21,24–26,30,32 devel-
oped on MIMIC-III datasets.
In addition to MIMIC-III, there are many works using private sources of EHR data for pre-training language models.12,14,17,22,23,31,33–35,37–40,97,98
For example, Li et al.23 introduced the model of BEHRT to predict the likelihood of 301 conditions in one’s future visits. Wang et al.98 proposed the
MEB model based on BERT for medication recommendation. Meng et al.12 proposed the BRLTM model to predict future diagnoses of depression
in mental health. Wang et al.34 developed a Chinese BERT model for disease prediction and department recommendation tasks. Rasmy et al.22
proposed the Med-BERT model to predict the diseases, such as diabetes, heart failure, and pancreatic cancer, by leveraging the structured EHR
data. Danilov et al.97 used neurosurgical data to predict the inpatient length of stay. Zhou et al.14 proposed the CancerBERT model in order to
extract breast cancer phenotypes from EHR data. Besides, there is some work using radiology reports as the corpus for pre-training the language
models.20,41–44
Social media
Users often post information on social media platforms and recent studies have shown that health-related social media data are useful in many
applications to provide better health-related services. For example, Twitter is a social media platform where users post and interact with
messages known as ‘‘tweets.’’ Müller et al.80 proposed the COVID-Twitter-BERT model by pre-training on a large corpus of COVID-19-related
tweets. Zhang et al.101 pre-trained language models on HPV vaccine-related tweets for the sentiment analysis of the HPV vaccination
task. Naseem et al.79 proposed the PHS-BERT model for tasks related to public health surveillance on social media by pre-training on
health-related tweets. For Reddit, it is a social news aggregation, web content rating, and discussion website. Ji et al.82 proposed MentalBERT
and MentalRoBERTa for depression detection and other mental disorders classification with the mental health posts on Reddit. Besides,
Tutubalina et al.81 proposed the RuDR-BERT model for drug reactions and effectiveness detection by pre-training the model on the
health-related user-generated texts collected from social media in Russian.
Scientific literature
As valuable knowledge is discovered from biomedical literature, biomedical researchers begin to develop pre-trained language models to
handle biomedical text. PubMed and PubMed Central (PMC) are the two popular sources of biomedical text. PubMed contains only biomed-
ical literature citations and abstracts only while PMC contains full-text biomedical articles. There is a large portion of work pre-training the
proposed model on the corpus from PubMed and PMC25,63,53–62,64,65 for biomedical information extraction. Moradi et al.66 proposed a
BERT-based model for biomedical text summarization with pre-training on PubMed, PMC, and Wiki. Du et al.68 proposed the
BioBERTSum model to better capture token-level and sentence-level contextual representation for extractive summarization tasks in the
biomedical domain. Wallace et al.69 and Guo et al.70 both proposed BART-based models for biomedical text summarization with pre-training
on the corpus of Cochrane systematic reviews indexed in PubMed.
BREATHE is another large and diverse dataset collection of biomedical research articles that contains titles, abstracts, and full-body texts.
The primary advantage of the BREATHE dataset is its source diversity, including BMJ, arXiv, medRxiv, bioRxiv, CORD-19, Springer Nature,
NCBI, JAMA, and BioASQ. Kieuvongngam et al.71 proposed to use BERT and GPT-2 for the text summarization of COVID-19 medical research
articles from CORD-19. Chakraborty et al.72 proposed the BioMedBERT model for the task of question-answering by pre-training the model
on the BREATHE dataset. Oniani et al.73 proposed a GPT-2-based model for the task of question-answering for COVID-19 with pre-training on
the corpus of CORD-19. Xie et al.67 proposed the KeBioSum model for biomedical text summarization with the corpus of CORD-19 and
PubMed. Taylor et al.60 developed the Galactica model pre-trained on a large scientific corpus of papers that can perform the task of medical
question answering. Besides, there are some works pre-training the models on the corpus of chemical disease relation or drug and adverse
effects for the task of biomedical relation extraction.83–87
Biological sequences
In addition to the text or image data, the biological sequence data can be another corpus for pre-training language models. For example, the
structure of each protein is fully determined by a sequence of amino acids; however, these amino acids are from a limited-size amino acid
vocabulary, of which 20 are commonly observed. This is similar to text that is composed of words in a lexicon vocabulary. The Pfam dataset
is a large collection of protein families, in which each protein is represented by multiple sequence alignments using hidden Markov models.
Xiao et al.90 proposed the model of ProteinLM for the protein prediction task with the preprocessed Pfam. Heinzinger et al.88 proposed the
SeqVec model to predict the protein function and structure from sequences and they further presented the ProstT5 model by combining 1D
sequence with 3D structure.96 Rives et al.89 proposed to use the language model for the tasks of remote homology detection, prediction of
secondary structure, long-range residue-residue contacts, and mutational effect for protein sequences. Brandes et al.91 proposed the
ProteinBERT model for protein sequences designed to capture local and global representations of proteins in a natural way. Weissenow
et al.92 proposed the EMBER2 model for protein structure prediction without requiring any multiple sequence alignments. Besides, Ji
et al.93 proposed the DNABERT model to predict the promoters, splice sites, and transcription factor-binding sites with the DNA sequence.
Yamada et al.94 proposed the BERT-RBP model to predict RNA and RNA-binding protein interactions by adapting the BERT architecture pre-
trained on a human reference genome. Mock et al.95 proposed the BERTax model to taxonomically classify the superkingdom and phylum of
DNA sequences.
In the following, we categorize various biomedical downstream tasks, as shown in Figure 2C.
Information extraction
Information extraction plays an important role in automatically extracting structured biomedical information from unstructured biomedical
text data ranging from biomedical scientific literature, and EHR data, to biomedical-related social media corpus, etc. It generally refers to
several important sub-tasks in this review, including named entity recognition and relation extraction. For instance, named entity recognition
is the first step in unlocking valuable information in unstructured text data that aims to identify the concept or entity names in biomedical texts.
Extracting clinical concepts, such as types of diagnosis, test, treatment, clinical department, medication, adverse drug events, etc., is useful for
EHR corpus,14,20,19,21,24–26,30,32,35,42,28 while extracting biomedical entities, such as disease entity, drug-chemical entity, drug-protein entity,
species entity, etc., is meaningful to discover knowledge in scientific literature,25,63,53–55,57–59,61,62,64,65,102 online medical knowledge
corpus,30,63,75–77 or social media posts.81 Relation extraction aims to identify the relationship or semantic correlation between biomedical en-
tities mentioned in texts and generally be considered as a classification problem to predict the possible relation type of two identified entities
in a given sentence.25,42,77,76,63,53–59,62,64,65,83–87
Text classification
Text classification aims to assign one of the predefined labels to variable-length texts like phrases, sentences, paragraphs, or documents in
the corpus like EHR data,24–26,35,41,44 biomedical scientific literature,55–58 and social media data.80,79,81,82,64
Question answering
Question answering (QA) aims to extract answers for the given queries. QA can facilitate seeking information in clinical notes,28,35 biomedical
scientific literature,53–60,62,64,72,73 biomedical image-text corpus,28,45–51 and online medical knowledge corpus,74,78 and thus save time for the
clinicians and biomedical researchers.
Text Summarization
Typically, the clinical notes, scientific literature, and radiology reports could be lengthy in nature. However, clinicians or biomedical re-
searchers need to go through a large number of biomedical documents, which is time-consuming. In this context, there is a need for auto-
matic text summarization, in order to reduce the effort and time required by clinicians or biomedical researchers. Text summarization falls into
two broad categories, namely extractive summarization,33,66,67,68,71 which identifies the most relevant sentences in the document, while
abstractive summarization41,56,69–71 generates new text, which represents the summary of the document.
Proteins/DNAs prediction
Protein can be associated with almost every life process. Consequently, analyzing the biological structure and property of protein sequences
and understanding their functions88–92,96 becomes crucial to the study of life science as well as disease detection and drug discovery. Since
only a fraction of all species are available in today’s databases, it is important to accurately assign DNA sequences to their origin particularly
when there are no closely related species in databases.95 Deciphering the language of non-coding DNA is also one of the fundamental prob-
lems in genome research.93 Besides, identifying RNA and RNA-binding protein interactions94 can help to understand the biological roles in
regulating cellular functions.
attention architecture to decouple the complexity of explanation and the decision-making process. Rigotti et al.113 proposed the gener-
alization of attention from low-level input features to high-level concepts as a mechanism to ensure the interpretability of attention scores.
In particular, they designed the ConceptTransformer that exposes explanations of the output of a model in which it is embedded in terms
of attention over user-defined high-level concepts.
Shapley additive explanation (SHAP) is to compute shapely values for each combination of the features (a power set of the features) by
training a linear model. But, it will be computationally expensive to train 2M models for M set of features. For example, Attanasio et al.114
investigated the SHAP-based explainability approach on Transformer-based models.
Visualization plays an essential role in understanding how a neural model works.115 It can be applied with any of the feature importance-
based methods. With visualization, we can project the feature importance weights using heatmap, partial dependency plot, etc. Saliency has
been primarily used to visualize the importance scores of different types of elements in XAI learning systems,36,17 such as showing input-
output word alignment,116 highlighting words in input text,117 or displaying extracted relations.118 Ding and Koehn119 investigated the
gradient-based saliency methods on different language models based on the perspective of plausibility and faithfulness. Malkiel et al.120 pro-
posed the BTI approach to explain paragraph similarities inferred by pre-trained BERT models. Specifically, the proposed approach can iden-
tify important words that dictate each paragraph’s semantics, match between the words, and retrieve the most important pairs by utilizing
activation and saliency maps. Natural language explanation is verbalized in human-readable natural language. The natural language can
be generated using sophisticated deep learning models, e.g., by training a language model with human natural language explanations
and coupling with a deep generative model.121 It can also be generated by using simple template-based approaches.122 Brand et al.123 devel-
oped the E-BART model by jointly making a veracity prediction and providing an explanation within the same model. Sammani et al.124 pro-
posed the NLX-GPT that can simultaneously predict an answer and explain it by formulating the answer prediction as a text generation task
along with the explanation. Besides, there are other visualization techniques for the purpose of interpretability. For example, Dunn et al.125
proposed a context-sensitive visualization method with Leave-N-Out that leads to heatmaps that include more of the relevant information
pertaining to the classification, as well as more accurately highlighting the most important words from the input text. Li et al.126 developed
a visual analysis method to enable a unified understanding of models for text classification. Specifically, the mutual information-based mea-
sure was used to provide quantitative explanations on how each layer of a model maintains the information of input words in a sample.
There are also some works that aim to improve the interpretability of the Transformer-based vision and language (multimodal) model. For
example, Naseem et al.45 aimed to develop a model that can answer a medical question posed by pathology images. They proposed
TraP-VQA that embeds the image and question features, coupled with domain-specific contextual information, via a transformer for
PathVQA. Grad-Cam and SHAP were used to interpret the retrieved answers visually to indicate which area of the image contributed to
the predicted answer. Visualization of the transformers’ attention showed proposed model assigns more weight to the relevant words and
explains the reason for the retrieved answer. Aflalo et al.127 proposed the VL-InterpreT method that can provide interactive visualizations
for interpreting the attention and hidden representations in multimodal transformers.
DISCUSSION
Language models, particularly pre-trained language models, provide great promise in their ability to learn a generally useful representation
from the knowledge encoded in the corpora by being repurposed with minimal effort for diverse downstream tasks in the biomedical do-
mains. Interpreting the decision mechanism of a pre-trained language model can help understand the rationale behind its success and its
limitations. In this section, we further discuss the challenges in the aforementioned explanation methods, and uncover the gaps and future
research directions toward the interpretability in language models.
space to improve the model’s robustness, and the Boundary Match Constraint was to locate rationales more precisely with the guidance of
boundary information.
Neurosymbolic methods can produce an answer to a complex query by chaining these operations together, passing inputs from one
module to another. This has the benefit of producing an interpretable trace of intermediate computations, in contrast to the ‘‘black box’’ com-
putations common to end-to-end deep learning approaches. Creswell et al.133 proposed a selection inference framework that exploits
pre-trained large LMs as general processing modules, and alternates between selection and inference to generate a series of interpretable,
symbolic reasoning steps leading to the final answer.
Layer-wise relevance propagation is another way to attribute relevance to features computed in any intermediate layer of a neural
network (NN). Definitions are available for most common NN layers including fully connected layers, convolution layers, and recurrent
layers. Layer-wise relevance propagation has been used to, for example, enable feature importance explainability134 and example-based
explainability.135 Aken et al.136 presented a layer-wise analysis of BERT’s hidden states to understand their internal functioning. They
focused on models fine-tuned on the task of QA as an example of a downstream task and inspected how QA models transform token
vectors in order to find the correct answer. Aken et al.137 proposed the VisBERT that can visualize the contextual token representations
within BERT for the task of (multi-hop) QA. Interpretability can be provided by observing how the semantic representations are trans-
formed throughout the layers of the model. Sevastjanova et al.138 aimed to explain models by exploring the continuum between function
and content words with respect to contextualization in BERT. Specifically, they utilized the similarity-based score to measure contextual-
ization and integrate it into a visual analytics technique, presenting the model’s layers simultaneously and highlighting intra-layer proper-
ties and inter-layer differences.
Counterfactual intervention methods explain the causal effect between a feature/concept/example and the prediction by erasing or per-
turbing it and observing the change in the prediction. Counterfactual examples, therefore, refer to the outcome of perturbations. Although
counterfactual examples and adversarial examples look similar in the robustness literature, they differ in this context: (i) the goal of the former
is to explain the model’s reasoning mechanism, while that of the latter is to examine model robustness; (ii) the former should be meaningfully
different in the perturbed feature to the original example while the latter should be similar to or even indistinguishable from it; and (iii) the
former can lead to changes in the ground truth label, whereas the latter should not.145 However, generating high-quality counterfactual ex-
amples is non-trivial, as they need to simultaneously accord with the counterfactual target label, be semantically coherent, and only differ from
the original example in the intended feature. In existing work, the most reliable (yet expensive) approach to collecting counterfactual exam-
ples is still manual creation.145,146 Besides, counterfactual intervention can directly happen on the level of examples, such as the methods of
influence functions. Influence functions are based on counterfactual reasoning – if a training example were absent or slightly changed, then
how would the prediction change? Since it is impractical to retrain the model after erasing/perturbing every single training example, influence
functions provide an approximation by directly recomputing the loss function. However, it is found in the existing work147 that influence func-
tions can become fragile and the approximation accuracy can vary significantly depending on a variety of factors, such as network architec-
ture, depth, width, the extent of model parameterization and regularization techniques, and the examined checkpoints, as models become
more complex. Counterfactual intervention can also happen in the feature representations in the model, such as the work of Amnesic Prob-
ing148 and CausalLM.130 They both aim to answer the more insightful question – is some high-level feature, e.g., syntax tree, used in predic-
tion? They exploit different algorithms to erase the target feature from the model representation and then measure the change in the
prediction. The larger the change, the more strongly it indicates that the feature has been used by the original model. In terms of faithfulness,
only CausalLM is validated with a white-box evaluation, whereas no explicit evaluation is provided for Amnesic Probing. Causal inference can
also be used for interpretability. However it requires a more rigorous formalization of the causal framework, e.g., a causal model, which is
usually task- or even dataset specific and needs to be designed by domain experts. Therefore, there are still important challenges such as
how to automatically derive causal models from data and how to make them more generalizable across tasks. Overall, counterfactual inter-
ventions can capture causal relationships instead of mere correlational effects between inputs and outputs and are more often explicitly eval-
uated in terms of faithfulness. However, counterfactual intervention is relatively more expensive in computational cost, normally requiring
multiple forward passes or modifications to the model representation. Searching for the right targets to intervene in can also be costly. In-
terventions are often overly specific to the particular example and this calls for more insights into the scale of such explanations.149 Counter-
factual intervention may suffer from hindsight bias, which questions the foundation of counterfactual reasoning.150
Surrogate models for post hoc interpretability: SHAP is one of the widely adopted surrogate-model-based methods that can be thought of
as using additive surrogate models as an explanation. Shapley values are theoretically shown to be locally faithful, but there is no empirical
evidence on whether this property is maintained after the SHAP approximation. Subsequent work also finds other limitations: linear surrogate
models have limited expressivity. For example, if the decision boundary is a circle and the target example is inside the circle, it is impossible to
derive a locally faithful linear approximation. Besides, they can result in nonsensical inputs or representations, which sometimes allow adver-
saries to manipulate the explanation.151 What’s more important, one major concern of using SHAP in the medical domain is that the Shapley
value was originally derived from economics tasks, where the cost is additive. However, clinical features are usually heterogeneous, and the
Shapley values derived from the model may not be meaningful.152
should be selected and evaluated both to help model developers (data scientists and machine learning practitioners) understand how their
models behave and to assist clinicians and biomedical researchers to understand the rationale for predictions produced by the model.
ACKNOWLEDGMENTS
The work was supported by the U.S. National Science Foundation under Grant Numbers 1750326 and 2212175 and the U.S. National Institutes
of Health under Grant Numbers R01MH124740, R01AG076448, R01AG080991, R01AG076234, RF1AG072449, and R01AG080624.
AUTHOR CONTRIBUTIONS
F.W. contributed to the conceptualization and reviewing and editing of the manuscript; D.L. contributed to the investigation, drafting, and
reviewing and editing of the manuscript; X.W. contributed to the visualization and reviewing and editing of the manuscript; Y.C. contributed
to reviewing and editing of the manuscript.
DECLARATION OF INTERESTS
The authors declare no competing interests.
REFERENCES
1. Radford, A., Narasimhan, K., Salimans, T., Language models are unsupervised 6. Raffel, C., Shazeer, N., Roberts, A., Lee, K.,
and Sutskever, I. (2018). Improving multitask learners. OpenAI Blog 1, 9. Narang, S., Matena, M., Zhou, Y., Li, W., and
Language Understanding by Generative 4. Peters, M.E., Neumann, M., Iyyer, M., Liu, P.J. (2020). Exploring the limits of
Pre-training. Gardner, M., Clark, C., Lee, K., and transfer learning with a unified text-to-text
2. Devlin, J., Chang, M.-W., Lee, K., and Zettlemoyer, L. (2018). Deep Contextualized transformer. J. Mach. Learn. Res. 21,
Toutanova, K. (2018). Bert: pre-training of Word Representations. Held in New 5485–5551.
deep bidirectional transformers for Orleans, Louisiana (Association for 7. Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y.
language understanding. Preprint at arXiv. Computational Linguistics), pp. 2227–2237. (2019). Mass: masked sequence to
https://doi.org/10.48550/arXiv.1810.04805. 5. Conneau, A., and Lample, G. (2019). Cross- sequence pre-training for language
3. Radford, A., Wu, J., Child, R., Luan, D., lingual language model pretraining. Adv. generation. Preprint at arXiv. https://doi.
Amodei, D., and Sutskever, I. (2019). Neural Inf. Process. Syst. 32. org/10.48550/arXiv.1905.02450.
8. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., structured electronic health records for narratives. Preprint at arXiv. https://doi.org/
Salakhutdinov, R.R., and Le, Q.V. (2019). disease prediction. NPJ Digit. Med. 4, 86. 10.48550/arXiv.2107.03134.
Xlnet: Generalized autoregressive 23. Li, Y., Rao, S., Solares, J.R.A., Hassaine, A., 37. Khin, K., Burckhardt, P., and Padman, R.
pretraining for language understanding. Ramakrishnan, R., Canoy, D., Zhu, Y., Rahimi, (2018). A deep learning architecture for de-
Adv. Neural Inf. Process. Syst. 32. K., and Salimi-Khorshidi, G. (2020). BEHRT: identification of patient notes:
9. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, transformer for electronic health records. implementation and evaluation. Preprint at
M., Mohamed, A., Levy, O., Stoyanov, V., Sci. Rep. 10, 7155. arXiv. https://doi.org/10.48550/arXiv.1810.
and Zettlemoyer, L. (2019). Bart: denoising 24. Lewis, P., Ott, M., Du, J., and Stoyanov, V. 01570.
sequence-to-sequence pre-training for (2020). Pretrained Language Models for 38. Yang, X., He, X., Zhang, H., Ma, Y., Bian, J.,
natural language generation, translation, Biomedical and Clinical Tasks: and Wu, Y. (2020). Measurement of semantic
and comprehension. Preprint at arXiv. Understanding and Extending the State-Of- textual similarity in clinical texts: comparison
https://doi.org/10.48550/arXiv.1910.13461. The-Art, pp. 146–157. of transformer-based models. JMIR Med.
10. Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., 25. Peng, Y., Yan, S., and Lu, Z. (2019). Transfer Inf. 8, e19735.
Ghazvininejad, M., Lewis, M., and learning in biomedical natural language 39. Xiong, Y., Chen, S., Chen, Q., Yan, J., and
Zettlemoyer, L. (2020). Multilingual processing: an evaluation of BERT and Tang, B. (2020). Using character-level and
denoising pre-training for neural machine ELMo on ten benchmarking datasets. entity-level representations to enhance
translation. Trans. Assoc. Comput. Ling. 8, Preprint at arXiv. https://doi.org/10.48550/ bidirectional encoder representation from
726–742. arXiv.1906.05474. transformers-based clinical semantic textual
11. Doshi-Velez, F., and Kim, B. (2017). Towards 26. Agrawal, M., Hegselmann, S., Lang, H., Kim, similarity model: ClinicalSTS modeling
a rigorous science of interpretable machine Y., and Sontag, D. (2022). Large Language study. JMIR Med. Inf. 8, e23357.
learning. Preprint at arXiv. https://doi.org/ Models Are Few-Shot Clinical Information 40. Mahajan, D., Poddar, A., Liang, J.J., Lin,
10.48550/arXiv.1702.08608. Extractors, pp. 1998–2022. Y.-T., Prager, J.M., Suryanarayanan, P.,
12. Meng, Y., Speier, W., Ong, M.K., and 27. Chang, D., Hong, W.S., and Taylor, R.A. Raghavan, P., and Tsou, C.-H. (2020).
Arnold, C.W. (2021). Bidirectional (2020). Generating contextual embeddings Identification of semantically similar
representation learning from transformers for emergency department chief sentences in clinical notes: Iterative
using multimodal electronic health record complaints. JAMIA Open 3, 160–166. intermediate training using multi-task
data to predict depression. IEEE J. Biomed. 28. Yang, X., Chen, A., PourNejatian, N., Shin, learning. JMIR Med. Inf. 8, e22508.
Health Inform. 25, 3121–3129. https://doi. H.C., Smith, K.E., Parisien, C., Compas, C., 41. Yan, A., McAuley, J., Lu, X., Du, J., Chang,
org/10.1109/jbhi.2021.3063721. Martin, C., Costa, A.B., Flores, M.G., et al. E.Y., Gentili, A., and Hsu, C.-N. (2022).
13. Shang, J., Ma, T., Xiao, C., and Sun, J. (2019). (2022). A large language model for RadBERT: Adapting transformer-based
Pre-training of graph augmented electronic health records. NPJ Digit. Med. language models to radiology. Radiol. Artif.
transformers for medication 5, 194. Intell. 4, e210258.
recommendation. Preprint at arXiv. https:// 29. Huang, K., Singh, A., Chen, S., Moseley, E.T., 42. Lau, W., Lybarger, K., Gunn, M.L., and
doi.org/10.24963/ijcai.2019/825. Deng, C.-Y., George, N., and Lindvall, C. Yetisgen, M. (2023). Event-based clinical
14. Zhou, S., Wang, N., Wang, L., Liu, H., and (2019). Clinical xlnet: modeling sequential finding extraction from radiology reports
Zhang, R. (2022). CancerBERT: a cancer clinical notes and predicting prolonged with pre-trained language model. J. Digit.
domain-specific language model for mechanical ventilation. Preprint at arXiv. Imaging 36, 91–104.
extracting breast cancer phenotypes from https://doi.org/10.48550/arXiv.1912.11975. 43. Meng, X., Ganoe, C.H., Sieberg, R.T.,
electronic health records. J. Am. Med. 30. Michalopoulos, G., Wang, Y., Kaka, H., Cheung, Y.Y., and Hassanpour, S. (2020).
Inform. Assoc. 29, 1208–1216. Chen, H., and Wong, A. (2020). Umlsbert: Self-supervised contextual language
15. Huang, K., Altosaar, J., and Ranganath, R. clinical domain knowledge augmentation of representation of radiology reports to
(2019). Clinicalbert: modeling clinical notes contextual embeddings using the unified improve the identification of
and predicting hospital readmission. medical language system metathesaurus. communication urgency. AMIA Jt. Summits
Preprint at arXiv. https://doi.org/10.48550/ Preprint at arXiv. https://doi.org/10.18653/ Transl. Sci. Proc. 2020, 413–421.
arXiv.1904.05342. v1/2021.naacl-main.139. 44. Bressem, K.K., Adams, L.C., Gaudin, R.A.,
16. Jin, D., Jin, Z., Zhou, J.T., and Szolovits, P. 31. Kades, K., Sellner, J., Koehler, G., Full, P.M., Tröltzsch, D., Hamm, B., Makowski, M.R.,
(2020). Is bert really robust? a strong Lai, T.Y.E., Kleesiek, J., and Maier-Hein, K.H. Schüle, C.Y., Vahldiek, J.L., and Niehues,
baseline for natural language attack on text (2021). Adapting bidirectional encoder S.M. (2020). Highly accurate classification of
classification and entailment, 34, representations from transformers (BERT) to chest radiographic reports using a deep
pp. 8018–8025. assess clinical semantic textual similarity: learning natural language model pre-
17. Zhang, Y., Nie, A., Zehnder, A., Page, R.L., algorithm development and validation trained on 3.8 million text reports.
and Zou, J. (2019). VetTag: improving study. JMIR Med. Inf. 9, e22795. Bioinformatics 36, 5255–5261.
automated veterinary diagnosis coding via 32. Yang, X., Bian, J., Hogan, W.R., and Wu, Y. 45. Naseem, U., Khushi, M., and Kim, J. (2022).
large-scale language modeling. NPJ Digit. (2020). Clinical concept extraction using Vision-language transformer for
Med. 2, 35. transformers. J. Am. Med. Inform. Assoc. 27, interpretable pathology visual question
18. Liu, S., Wang, X., Hou, Y., Li, G., Wang, H., 1935–1942. answering. IEEE J. Biomed. Health Inform.
Xu, H., Xiang, Y., and Tang, B. (2023). 33. Chen, Y.-P., Chen, Y.-Y., Lin, J.-J., Huang, 27, 1681–1690.
Multimodal data matters: language model C.-H., and Lai, F. (2020). Modified 46. Li, Y., Wang, H., and Luo, Y. (2020). A
pre-training over structured and bidirectional encoder representations from Comparison of Pre-trained Vision-And-
unstructured electronic health records. IEEE transformers extractive summarization Language Models for Multimodal
J. Biomed. Health Inform. 27, 504–514. model for hospital information systems Representation Learning across Medical
19. Si, Y., Wang, J., Xu, H., and Roberts, K. based on character-level tokens Images and Reports (IEEE), pp. 1999–2004.
(2019). Enhancing clinical concept (AlphaBERT): development and 47. Khare, Y., Bagal, V., Mathew, M., Devi, A.,
extraction with contextual embeddings. performance evaluation. JMIR Med. Inf. 8, Priyakumar, U.D., and Jawahar, C. (2021).
J. Am. Med. Inform. Assoc. 26, 1297–1304. e17787. MMBERT: Multimodal BERT Pretraining for
20. Zhu, H., Paschalidis, I.C., and Tahmasebi, A. 34. Wang, J., Zhang, G., Wang, W., Zhang, K., Improved Medical VQA, pp. 1033–1036.
(2018). Clinical concept extraction with and Sheng, Y. (2021). Cloud-based 48. Moon, J.H., Lee, H., Shin, W., Kim, Y.-H., and
contextual word embedding. Preprint at intelligent self-diagnosis and department Choi, E. (2022). Multi-modal understanding
arXiv. https://doi.org/10.48550/arXiv.1810. recommendation service using Chinese and generation for medical images and text
10566. medical BERT. J. Cloud Comput. 10, 1–12. via vision-language pre-training. IEEE J.
21. Alsentzer, E., Murphy, J.R., Boag, W., Weng, 35. Zhang, N., Jia, Q., Yin, K., Dong, L., Gao, F., Biomed. Health Inform. 26, 6070–6080.
W.-H., Jin, D., Naumann, T., and and Hua, N. (2020). Conceptualized 49. Chen, Z., Li, G., and Wan, X. (2022). Align,
McDermott, M. (2019). Publicly available representation learning for chinese Reason and Learn: Enhancing Medical
clinical BERT embeddings. Preprint at arXiv. biomedical text mining. Preprint at arXiv. Vision-And-Language Pre-training with
https://doi.org/10.48550/arXiv.1904.03323. https://doi.org/10.48550/arXiv.2008.10813. Knowledge, pp. 5152–5161.
22. Rasmy, L., Xiang, Y., Xie, Z., Tao, C., and Zhi, 36. Kraljevic, Z., Shek, A., Bean, D., Bendayan, 50. Chen, Z., Du, Y., Hu, J., Liu, Y., Li, G., Wan, X.,
D. (2021). Med-BERT: pretrained R., Teo, J., and Dobson, R. (2021). MedGPT: and Chang, T.-H. (2022). Multi-modal
contextualized embeddings on large-scale medical concept prediction from clinical Masked Autoencoders for Medical
Vision-And-Language Pre-training 66. Moradi, M., Dorffner, G., and Samwald, M. 82. Ji, S., Zhang, T., Ansari, L., Fu, J., Tiwari, P.,
(Springer), pp. 679–689. (2020). Deep contextualized embeddings and Cambria, E. (2021). Mentalbert: publicly
51. Monajatipoor, M., Rouhsedaghat, M., Li, for quantifying the informative content in available pretrained language models for
L.H., Jay Kuo, C.-C., Chien, A., and Chang, biomedical text summarization. Comput. mental healthcare. Preprint at arXiv. https://
K.-W. (2022). Berthop: An Effective Vision- Methods Programs Biomed. 184, 105117. doi.org/10.48550/arXiv.2110.15621.
And-Language Model for Chest X-Ray 67. Xie, Q., Bishop, J.A., Tiwari, P., and 83. Papanikolaou, Y., and Pierleoni, A. (2020).
Disease Diagnosis (Springer), pp. 725–734. Ananiadou, S. (2022). Pre-trained language Dare: Data augmented relation extraction
52. Boecking, B., Usuyama, N., Bannur, S., models with domain knowledge for with gpt-2. Preprint at arXiv. https://doi.org/
Castro, D.C., Schwaighofer, A., Hyland, S., biomedical extractive summarization. 10.48550/arXiv.2004.13845.
Wetscherek, M., Naumann, T., Nori, A., and Knowl. Base Syst. 252, 109460. 84. Papanikolaou, Y., Roberts, I., and Pierleoni,
Alvarez-Valle, J. (2022). Making the Most of 68. Du, Y., Li, Q., Wang, L., and He, Y. (2020). A. (2019). Deep bidirectional transformers
Text Semantics to Improve Biomedical Biomedical-domain pre-trained language for relation extraction without supervision.
Vision–Language Processing (Springer), model for extractive summarization. Knowl. Preprint at arXiv. https://doi.org/10.48550/
pp. 1–21. Base Syst. 199, 105964. arXiv.1911.00313.
53. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., 69. Wallace, B.C., Saha, S., Soboczenski, F., and 85. Wang, D., Hu, W., Cao, E., and Sun, W.
So, C.H., and Kang, J. (2020). BioBERT: a Marshall, I.J. (2021). Generating (factual?) (2020). Global-to-local neural networks for
pre-trained biomedical language narrative summaries of rcts: Experiments document-level relation extraction. Preprint
representation model for biomedical text with neural multi-document summarization. at arXiv. https://doi.org/10.48550/arXiv.
mining. Bioinformatics 36, 1234–1240. AMIA Jt. Summits Transl. Sci. Proc. 2021, 2009.10359.
54. Shin, H.-C., Zhang, Y., Bakhturina, E., Puri, 605–614. 86. Cabot, P.-L.H., and Navigli, R. (2021). REBEL:
R., Patwary, M., Shoeybi, M., and Mani, R. 70. Guo, Y., Qiu, W., Wang, Y., and Cohen, T. Relation Extraction by End-To-End
(2020). Biomegatron: larger biomedical (2021). Automated Lay Language Language Generation, pp. 2370–2381.
domain language model. Preprint at arXiv. Summarization of Biomedical Scientific 87. Weber, L., Sänger, M., Garda, S., Barth, F.,
https://doi.org/10.48550/arXiv.2010.06060. Reviews, 1, pp. 160–168. Alt, C., and Leser, U. (2022). Chemical–
55. Gu, Y., Tinn, R., Cheng, H., Lucas, M., 71. Kieuvongngam, V., Tan, B., and Niu, Y. protein relation extraction with ensembles
Usuyama, N., Liu, X., Naumann, T., Gao, J., (2020). Automatic text summarization of of carefully tuned pretrained language
and Poon, H. (2021). Domain-specific covid-19 medical research articles using models. Database 2022, baac098.
language model pretraining for biomedical bert and gpt-2. Preprint at arXiv. https://doi. 88. Heinzinger, M., Elnaggar, A., Wang, Y.,
natural language processing. ACM Trans. org/10.48550/arXiv.2006.01997. Dallago, C., Nechaev, D., Matthes, F., and
Comput. Healthc. 3, 1–23. 72. Chakraborty, S., Bisong, E., Bhatt, S., Rost, B. (2019). Modeling aspects of the
56. Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Wagner, T., Elliott, R., and Mosconi, F. language of life through transfer-learning
Poon, H., and Liu, T.-Y. (2022). BioGPT: (2020). BioMedBERT: A Pre-trained protein sequences. BMC Bioinf. 20, 1–17.
generative pre-trained transformer for Biomedical Language Model for QA and IR, 89. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin,
biomedical text generation and mining. pp. 669–679. Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma,
Brief. Bioinform. 23, bbac409. 73. Oniani, D., and Wang, Y. (2020). A J., and Fergus, R. (2021). Biological structure
Qualitative Evaluation of Language Models and function emerge from scaling
57. Kanakarajan, K.R., Kundumani, B., and
on Automatic Question-Answering for unsupervised learning to 250 million protein
Sankarasubbu, M. (2021). BioELECTRA:
Covid-19, pp. 1–9. sequences. Proc. Natl. Acad. Sci. USA 118.
Pretrained Biomedical Text Encoder Using
74. Liévin, V., Hother, C.E., and Winther, O. e2016239118.
Discriminators, pp. 143–154.
(2022). Can large language models reason 90. Xiao, Y., Qiu, J., Li, Z., Hsieh, C.-Y., and
58. Yasunaga, M., Leskovec, J., and Liang, P.
about medical questions?. Preprint at arXiv. Tang, J. (2021). Modeling protein using
(2022). Linkbert: pretraining language
https://doi.org/10.48550/arXiv.2207.08143. large-scale pretrain language model.
models with document links. Preprint at
75. He, Y., Zhu, Z., Zhang, Y., Chen, Q., and Preprint at arXiv. https://doi.org/10.48550/
arXiv. https://doi.org/10.48550/arXiv.2203.
Caverlee, J. (2020). Infusing disease arXiv.2108.07435.
15827.
knowledge into bert for health question 91. Brandes, N., Ofer, D., Peleg, Y., Rappoport,
59. Miolo, G., Mantoan, G., and Orsenigo, C. answering, medical inference and disease N., and Linial, M. (2022). ProteinBERT: a
(2021). Electramed: A new pre-trained name recognition. Preprint at arXiv. https:// universal deep-learning model of protein
language representation model for doi.org/10.48550/arXiv.2010.03746. sequence and function. Bioinformatics 38,
biomedical nlp. Preprint at arXiv. https:// 76. Hao, B., Zhu, H., and Paschalidis, I.C. (2020). 2102–2110.
doi.org/10.48550/arXiv.2104.09585. Enhancing Clinical Bert Embedding Using a 92. Weissenow, K., Heinzinger, M., and Rost, B.
60. Taylor, R., Kardas, M., Cucurull, G., Scialom, Biomedical Knowledge Base. (2022). Protein language-model
T., Hartshorn, A., Saravia, E., Poulton, A., 77. Liu, F., Shareghi, E., Meng, Z., Basaldella, embeddings for fast, accurate, and
Kerkez, V., and Stojnic, R. (2022). Galactica: a M., and Collier, N. (2020). Self-alignment alignment-free protein structure prediction.
large language model for science. Preprint pretraining for biomedical entity Structure 30, 1169–1177.e4.
at arXiv. https://doi.org/10.48550/arXiv. representations. Preprint at arXiv. https:// 93. Ji, Y., Zhou, Z., Liu, H., and Davuluri, R.V.
2211.09085. doi.org/10.48550/arXiv.2010.11784. (2021). DNABERT: pre-trained Bidirectional
61. Jin, Q., Dhingra, B., Cohen, W.W., and Lu, X. 78. Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Encoder Representations from
(2019). Probing biomedical embeddings Wei, J., Chung, H.W., Scales, N., Tanwani, Transformers model for DNA-language in
from language models. Preprint at arXiv. A., Cole-Lewis, H., and Pfohl, S. (2022). Large genome. Bioinformatics 37, 2112–2120.
https://doi.org/10.48550/arXiv.1904.02181. language models encode clinical 94. Yamada, K., and Hamada, M. (2022).
62. Naseem, U., Dunn, A.G., Khushi, M., and knowledge. Preprint at arXiv. https://doi. Prediction of RNA–protein interactions
Kim, J. (2022). Benchmarking for biomedical org/10.48550/arXiv.2212.13138. using a nucleotide language model.
natural language processing tasks with a 79. Naseem, U., Lee, B.C., Khushi, M., Kim, J., Bioinform. Adv. 2, vbac023.
domain specific albert. BMC Bioinf. 23, 144. and Dunn, A.G. (2022). Benchmarking for 95. Mock, F., Kretschmer, F., Kriese, A., Böcker,
63. Yuan, Z., Liu, Y., Tan, C., Huang, S., and public health surveillance tasks on social S., and Marz, M. (2022). Taxonomic
Huang, F. (2021). Improving biomedical media with a domain-specific pretrained classification of DNA sequences beyond
pretrained language models with language model. Preprint at arXiv. https:// sequence similarity using deep neural
knowledge. Preprint at arXiv. https://doi. doi.org/10.18653/v1/2022.nlppower-1.3. networks. Proc. Natl. Acad. Sci. USA 119.
org/10.48550/arXiv.2104.10344. 80. Müller, M., Salathé, M., and Kummervold, e2122636119.
64. Tinn, R., Cheng, H., Gu, Y., Usuyama, N., Liu, P.E. (2023). Covid-twitter-bert: A natural 96. Heinzinger, M., Weissenow, K., Sanchez,
X., Naumann, T., Gao, J., and Poon, H. language processing model to analyse J.G., Henkel, A., Steinegger, M., and Rost, B.
(2023). Fine-tuning large neural language covid-19 content on twitter. Front. Artif. (2023). ProstT5: Bilingual language model
models for biomedical natural language Intell. 6, 1023281. for protein sequence and structure. Preprint
processing. Patterns 4, 100729. 81. Tutubalina, E., Alimova, I., Miftahutdinov, Z., at bioRxiv. https://doi.org/10.1101/2023.07.
65. Ozyurt, I.B. (2020). On the effectiveness of Sakhovskiy, A., Malykh, V., and Nikolenko, S. 23.550085.
small, discriminatively pre-trained language (2021). The Russian Drug Reaction Corpus 97. Danilov, G., Kotik, K., Shevchenko, E.,
representation models for biomedical text and neural models for drug reactions and Usachev, D., Shifrin, M., Strunina, Y.,
mining. Preprint at bioRxiv. https://doi.org/ effectiveness detection in user reviews. Tsukanova, T., Ishankulov, T., Lukshin, V.,
10.18653/v1/2020.sdp-1.12. Bioinformatics 37, 243–249. and Potapov, A. (2022). Predicting the
length of stay in neurosurgery with RuGPT-3 114. Attanasio, G., Nozza, D., Pastor, E., and Computing Machinery). https://doi.org/10.
language model. Stud. Health Technol. Hovy, D. (2022). Benchmarking Post-hoc 48550/arXiv.2108.11656.
Inform. 295, 555–558. Interpretability Approaches for 130. Feder, A., Oved, N., Shalit, U., and Reichart,
98. Wang, M., Chen, J., and Lin, S. (2021). Transformer-Based Misogyny Detection R. (2021). Causalm: Causal model
Medication recommendation based on a (Association for Computational Linguistics). explanation through counterfactual
knowledge-enhanced pre-training model, 115. Li, J., Chen, X., Hovy, E., and Jurafsky, D. language models. Comput. Ling. 47,
pp. 290–294. (2015). Visualizing and understanding neural 333–386.
99. Wang, F., Zhou, Y., Wang, S., models in NLP. Preprint at arXiv. https://doi. 131. Taylor, N., Sha, L., Joyce, D.W., Lukasiewicz,
Vardhanabhuti, V., and Yu, L. (2022). Multi- org/10.48550/arXiv.1506.01066. T., Nevado-Holgado, A., and Kormilitzin, A.
granularity cross-modal alignment for 116. Bahdanau, D., Cho, K., and Bengio, Y. (2021). Rationale production to support
generalized medical visual representation (2014). Neural machine translation by jointly clinical decision-making. Preprint at arXiv.
learning. Adv. Neural Inf. Process. Syst. 35, learning to align and translate. Preprint at https://doi.org/10.48550/arXiv.2111.07611.
33536–33549. arXiv. https://doi.org/10.48550/arXiv. 132. Li, D., Hu, B., Chen, Q., Xu, T., Tao, J., and
100. Kaur, N., and Mittal, A. (2022). RadioBERT: A 1409.0473. Zhang, Y. (2022). Unifying model
deep learning-based system for medical 117. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, explainability and robustness for joint text
report generation from chest X-ray images J., and Eisenstein, J. (2018). Explainable classification and rationale extraction, 36,
using contextual embeddings. J. Biomed. prediction of medical codes from clinical pp. 10947–10955.
Inform. 135, 104220. text. Preprint at arXiv. https://doi.org/10. 133. Creswell, A., Shanahan, M., and Higgins, I.
101. Zhang, L., Fan, H., Peng, C., Rao, G., and 48550/arXiv.1802.05695. (2022). Selection-inference: exploiting large
Cong, Q. (2020). Sentiment Analysis 118. Xie, Q., Ma, X., Dai, Z., and Hovy, E. (2017). language models for interpretable logical
Methods for HPV Vaccines Related Tweets An interpretable knowledge transfer model reasoning. Preprint at arXiv. https://doi.org/
Based on Transfer Learning, 3 for knowledge base completion. Preprint at 10.48550/arXiv.2205.09712.
(MDPI), p. 307. arXiv. https://doi.org/10.48550/arXiv.1704. 134. Poerner, N., Roth, B., and Schütze, H. (2018).
102. Naseem, U., Khushi, M., Reddy, V., 05908. Evaluating neural network explanation
Rajendran, S., Razzak, I., and Kim, J. (2021). 119. Ding, S., and Koehn, P. (2021). Evaluating methods using hybrid documents and
Bioalbert: A Simple and Effective Pre- saliency methods for neural language morphological agreement. Preprint at arXiv.
trained Language Model for Biomedical models. Preprint at arXiv. https://doi.org/10. https://doi.org/10.18653/v1/P18-1032.
Named Entity Recognition (IEEE), pp. 1–7. 48550/arXiv.2104.05824. 135. Croce, D., Rossini, D., and Basili, R. (2018).
103. Jain, S., and Wallace, B.C. (2019). Attention 120. Malkiel, I., Ginzburg, D., Barkan, O., Explaining Non-linear Classifier Decisions
is not explanation. Preprint at arXiv. https:// Caciularu, A., Weill, J., and Koenigstein, N. within Kernel-Based Deep Architectures,
doi.org/10.48550/arXiv.1902.10186. (2022). Interpreting BERT-based Text pp. 16–24.
Similarity via Activation and Saliency Maps. 136. Aken, B.v., Winter, B., Löser, A., and Gers,
104. Wiegreffe, S., and Pinter, Y. (2019). Attention
In Proceedings of the ACM Web F.A. (2019). How does BERT answer
is not not explanation. Preprint at arXiv.
Conference 2022 (Association for questions? a layer-wise analysis of
https://doi.org/10.48550/arXiv.1908.04626.
Computing Machinery). transformer representations. In Proceedings
105. Hao, Y., Dong, L., Wei, F., and Xu, K. (2021).
121. Rajani, N.F., McCann, B., Xiong, C., and of the 28th ACM International Conference
Self-attention attribution: Interpreting
Socher, R. (2019). Explain yourself! on Information and Knowledge
information interactions inside transformer,
leveraging language models for Management (Association for Computing
35, pp. 12963–12971.
commonsense reasoning. Preprint at arXiv. Machinery). https://doi.org/10.1145/
106. Córdova Sáenz, C.A., and Becker, K. (2021). https://doi.org/10.48550/arXiv.1906.02361. 3357384.3358028.
Assessing the Use of Attention Weights to 122. Abujabal, A., Roy, R.S., Yahya, M., and 137. Aken, B.v., Winter, B., Löser, A., and Gers,
Interpret BERT-Based Stance Classification, Weikum, G. (2017). Quint: Interpretable F.A. (2020). VisBERT: hidden-state
pp. 194–201. Question Answering over Knowledge Bases, visualizations for transformers. In
107. Shi, T., Zhang, X., Wang, P., and Reddy, C.K. pp. 61–66. Companion Proceedings of the Web
(2021). Corpus-level and concept-based 123. Brand, E., Roitero, K., Soprano, M., Rahimi, Conference 2020 (Association for
explanations for interpretable document A., and Demartini, G. (2022). A neural model Computing Machinery). https://doi.org/10.
classification. ACM Trans. Knowl. Discov. to jointly predict and explain truthfulness of 48550/arXiv.2011.04507.
Data 16, 1–17. Article 48. https://doi.org/10. statements. J. Data Inf. Qual. 15, 1–19. 138. Sevastjanova, R., Kalouli, A.-L., Beck, C.,
1145/3477539. Article 4. https://doi.org/10.1145/3546917. Schäfer, H., and El-Assady, M. (2021).
108. Chrysostomou, G., and Aletras, N. (2021). 124. Sammani, F., Mukherjee, T., and Deligiannis, Explaining Contextualization in Language
Improving the faithfulness of attention- N. (2022). NLX-GPT: A Model for Natural Models Using Visual Analytics, pp. 464–476.
based explanations with task-specific Language Explanations in Vision and Vision- 139. Janizek, J.D., Sturmfels, P., and Lee, S.-I.
information for text classification. Preprint at Language Tasks, pp. 8312–8322. (2021). Explaining explanations: axiomatic
arXiv. https://doi.org/10.48550/arXiv.2105. 125. Dunn, A., Inkpen, D., and Andonie, R. (2021). feature interactions for deep networks.
02657. Context-Sensitive Visualization of Deep J. Mach. Learn. Res. 22. Article 104.
109. Bacco, L., Cimino, A., Dell’Orletta, F., and Learning Natural Language Processing 140. Bach, S., Binder, A., Montavon, G.,
Merone, M. (2021). Explainable sentiment Models (IEEE), pp. 170–175. Klauschen, F., Müller, K.R., and Samek, W.
analysis: a hierarchical transformer-based 126. Li, Z., Wang, X., Yang, W., Wu, J., Zhang, Z., (2015). On pixel-wise explanations for non-
extractive summarization approach. Liu, Z., Sun, M., Zhang, H., and Liu, S. (2022). linear classifier decisions by layer-wise
Electronics 10, 2195. A unified understanding of deep NLP relevance propagation. PLoS One 10,
110. Niu, S., Yin, Q., Song, Y., Guo, Y., and Yang, models for text classification. IEEE Trans. e0130140.
X. (2021). Label dependent attention model Vis. Comput. Graph. 28, 4980–4994. https:// 141. Shrikumar, A., Greenside, P., Shcherbina, A.,
for disease risk prediction using multimodal doi.org/10.1109/TVCG.2022.3184186. and Kundaje, A. (2016). Not just a black box:
electronic health records, pp. 449–458. 127. Aflalo, E., Du, M., Tseng, S.Y., Liu, Y., Wu, C., learning important features through
111.
Tutek, M., and Snajder, J. (2022). Toward Duan, N., and Lal, V. (2022). VL-InterpreT: An propagating activation differences. Preprint
practical usage of the attention mechanism Interactive Visualization Tool for at arXiv. https://doi.org/10.48550/arXiv.
as a tool for interpretability. IEEE Access 10, Interpreting Vision-Language Transformers, 1605.01713.
47011–47030. https://doi.org/10.1109/ pp. 21374–21383. 142. Feng, S., Wallace, E., Grissom, A., II, Iyyer,
ACCESS.2022.3169772. 128. Yan, X., Jian, F., and Sun, B. (2021). SAKG- M., Rodriguez, P., and Boyd-Graber, J.
112. Liu, D., Greene, D., and Dong, R. (2022). A BERT: enabling language representation (2018). Pathologies of neural models make
novel perspective to look at attention: bi- with knowledge graphs for chinese interpretations difficult. Preprint at arXiv.
level attention-based explainable topic sentiment analysis. IEEE Access 9, 101695– https://doi.org/10.18653/v1/D18-1407.
modeling for news classification. Preprint at 101701. https://doi.org/10.1109/ACCESS. 143. Ghorbani, A., Abid, A., and Zou, J. (2019).
arXiv. https://doi.org/10.18653/v1/2022. 2021.3098180. Interpretation of neural networks is fragile,
findings-acl.178. 129. Islam, S.M., and Bhattacharya, S. (2022). 33, pp. 3681–3688.
113. Rigotti, M., Miksovic, C., Giurgiu, I., AR-BERT: aspect-relation enhanced aspect- 144. Martins, A., and Astudillo, R. (2016). From
Gschwind, T., and Scotton, P. (2021). level sentiment classification with multi- Softmax to Sparsemax: A Sparse Model of
Attention-based Interpretability with modal explanations. In Proceedings of the Attention and Multi-Label Classification
Concept Transformers. ACM Web Conference 2022 (Association for (PMLR), pp. 1614–1623.
145. Kaushik, D., Hovy, E., and Lipton, Z.C. (2019). 152. Kovalerchuk, B., Ahmad, M.A., and Trends in artificial intelligence for
Learning the difference that makes a Teredesai, A. (2021). Survey of Explainable biotechnology. N. Biotechnol. 74, 16–24.
difference with counterfactually-augmented Machine Learning with Visual and Granular 158. Muller, H., Mayrhofer, M.T., Van Veen, E.-B.,
data. Preprint at arXiv. https://doi.org/10. Methods beyond Quasi-Explanations. and Holzinger, A. (2021). The ten
48550/arXiv.1909.12434. Interpretable Artificial Intelligence: A commandments of ethical medical AI.
146. Abraham, E.D., D’Oosterlinck, K., Feder, A., Perspective of Granular Computing, Computer 54, 119–123.
Gat, Y., Geiger, A., Potts, C., Reichart, R., pp. 217–267.
and Wu, Z. (2022). CEBaB: Estimating the 159. Kargl, M., Plass, M., and Müller, H. (2022). A
153. DeYoung, J., Jain, S., Rajani, N.F., Lehman,
causal effects of real-world concepts on NLP literature review on ethics for AI in
E., Xiong, C., Socher, R., and Wallace, B.C.
model behavior. Adv. Neural Inf. Process. biomedical research and biobanking. Yearb.
(2019). ERASER: A benchmark to evaluate
Syst. 35, 17582–17596. Med. Inform. 31, 152–160.
rationalized NLP models. Preprint at arXiv.
147. Basu, S., Pope, P., and Feizi, S. (2020). https://doi.org/10.48550/arXiv.1911.03429. 160. Müller, H., Holzinger, A., Plass, M., Brcic, L.,
Influence functions in deep learning are 154. Jacovi, A., and Goldberg, Y. (2020). Towards Stumptner, C., and Zatloukal, K. (2022).
fragile. Preprint at arXiv. https://doi.org/10. faithfully interpretable NLP systems: How Explainability and causability for artificial
48550/arXiv.2006.14651. should we define and evaluate faithfulness?. intelligence-supported medical image
148. Elazar, Y., Ravfogel, S., Jacovi, A., and Preprint at arXiv. https://doi.org/10.48550/ analysis in the context of the European
Goldberg, Y. (2021). Amnesic probing: arXiv.2004.03685. In Vitro Diagnostic Regulation.
Behavioral explanation with amnesic N. Biotechnol. 70, 67–72.
counterfactuals. Trans. Assoc. Comput. 155. Weerts, H.J., van Ipenburg, W., and
Pechenizkiy, M. (2019). A human-grounded 161. Zhou, J., Müller, H., Holzinger, A., and Chen,
Ling. 9, 160–175. F. (2023). Ethical ChatGPT: concerns,
149. Wallace, E., Gardner, M., and Singh, S. evaluation of shap for alert processing.
Preprint at arXiv. https://doi.org/10.48550/ challenges, and commandments. Preprint at
(2020). Interpreting Predictions of NLP arXiv. https://doi.org/10.48550/arXiv.2305.
Models, pp. 20–23. arXiv.1907.03324.
10646.
150. De Cao, N., Schlichtkrull, M., Aziz, W., and 156. Bhatt, U., Xiang, A., Sharma, S., Weller, A.,
Titov, I. (2020). How do decisions emerge Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, 162. Mozannar, H., and Sontag, D. (2020).
across layers in neural models? J.M.F., and Eckersley, P. (2020). Explainable Consistent Estimators for Learning to Defer
interpretation with differentiable masking. machine learning in deployment. In to an Expert (PMLR), pp. 7076–7087.
Preprint at arXiv. https://doi.org/10.48550/ Proceedings of the 2020 Conference on 163. Weidinger, L., Mellor, J., Rauh, M., Griffin,
arXiv.2006.14651. Fairness, Accountability, and Transparency C., Uesato, J., Huang, P.-S., Cheng, M.,
151. Slack, D., Hilgard, S., Jia, E., Singh, S., and (Association for Computing Machinery). Glaese, M., Balle, B., and Kasirzadeh, A.
Lakkaraju, H. (2020). Fooling Lime and Shap: https://doi.org/10.48550/arXiv.1909.06342. (2021). Ethical and social risks of harm from
Adversarial Attacks on Post Hoc Explanation 157. Holzinger, A., Keiblinger, K., Holub, P., language models. Preprint at arXiv. https://
Methods, pp. 180–186. Zatloukal, K., and Müller, H. (2023). AI for life: doi.org/10.48550/arXiv.2112.04359.