0% found this document useful (0 votes)
29 views12 pages

A Multi-Institutional Study Using Arti Cial Intelligence To Provide Reliable and Fair Feedback To Surgeons

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

A Multi-Institutional Study Using Arti Cial Intelligence To Provide Reliable and Fair Feedback To Surgeons

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ARTICLE

https://doi.org/10.1038/s43856-023-00263-3 OPEN

A multi-institutional study using artificial intelligence


to provide reliable and fair feedback to surgeons
Dani Kiyasseh 1 ✉, Jasper Laca2, Taseen F. Haque 2, Brian J. Miles3, Christian Wagner4, Daniel A. Donoho5,
Animashree Anandkumar1 & Andrew J. Hung 2 ✉

Abstract Plain language summary


Background Surgeons who receive reliable feedback on their performance quickly master the Surgeons aim to master skills neces-
sary for surgery. One such skill is
skills necessary for surgery. Such performance-based feedback can be provided by a recently-
suturing which involves connecting
developed artificial intelligence (AI) system that assesses a surgeon’s skills based on a
1234567890():,;

objects together through a series of


surgical video while simultaneously highlighting aspects of the video most pertinent to the stitches. Mastering these surgical skills
assessment. However, it remains an open question whether these highlights, or explanations, can be improved by providing sur-
are equally reliable for all surgeons. geons with feedback on the quality of
Methods Here, we systematically quantify the reliability of AI-based explanations on surgical their performance. However, such
videos from three hospitals across two continents by comparing them to explanations gen- feedback is often absent from surgical
practice. Although performance-based
erated by humans experts. To improve the reliability of AI-based explanations, we propose
feedback can be provided, in theory, by
the strategy of training with explanations –TWIX –which uses human explanations as recently-developed artificial intelli-
supervision to explicitly teach an AI system to highlight important video frames. gence (AI) systems that use a com-
Results We show that while AI-based explanations often align with human explanations, putational model to assess a surgeon’s
they are not equally reliable for different sub-cohorts of surgeons (e.g., novices vs. experts), a skill, the reliability of this feedback
phenomenon we refer to as an explanation bias. We also show that TWIX enhances the remains unknown. Here, we compare
AI-based feedback to that provided by
reliability of AI-based explanations, mitigates the explanation bias, and improves the per-
human experts and demonstrate that
formance of AI systems across hospitals. These findings extend to a training environment they often overlap with one another.
where medical students can be provided with feedback today. We also show that explicitly teaching
Conclusions Our study informs the impending implementation of AI-augmented surgical an AI system to align with human
training and surgeon credentialing programs, and contributes to the safe and fair democra- feedback further improves the relia-
bility of AI-based feedback on new
tization of surgery.
videos of surgery. Our findings outline
the potential of AI systems to support
the training of surgeons by providing
feedback that is reliable and focused
on a particular skill, and guide pro-
grams that give surgeons qualifications
by complementing skill assessments
with explanations that increase the
trustworthiness of such assessments.

1 Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA. 2 Center for Robotic Simulation and

Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA. 3 Department of Urology, Houston
Methodist Hospital, Houston, TX, USA. 4 Department of Urology, Pediatric Urology and Uro-Oncology, Prostate Center Northwest, St. Antonius-Hospital,
Gronau, Germany. 5 Division of Neurosurgery, Center for Neuroscience, Children’s National Hospital, Washington DC, USA. ✉email: danikiy@hotmail.com;
ajhung@gmail.com

COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed 1


ARTICLE COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3

S
urgeons seldom receive feedback on how well they perform study suggests that SAIS, when used alongside TWIX, has the
surgery1,2, despite evidence that it accelerates their acqui- potential to provide surgeons with accurate feedback on how to
sition of skills (e.g., suturing)3–7. Such feedback can be improve their operating technique.
automated, in theory, by artificial intelligence systems8,9. To that
end, we recently developed a surgical artificial intelligence system Methods
(SAIS) that assesses the skill of a surgeon based on a video of Surgical video samples. In a previous study, we trained and
intraoperative activity and simultaneously highlights video frames deployed an AI system (SAIS) on videos of a surgical procedure
of pertinent activity. We demonstrated that SAIS reliably auto- known as a robot-assisted radical prostatectomy (RARP). The
mates surgeon skill assessment10 and also examined the fairness purpose of this procedure is to treat cancer by removing a can-
of its assessments11. In pursuit of developing a trustworthy AI cerous prostate gland from the body of a patient. In general, to
system capable of safely automating the provision of feedback to complete a surgical procedure, a surgeon must often perform a
surgeons, we must ensure that SAIS’ video-frame highlights, sequence of steps. We focus on one particular step of the RARP
which we refer to as AI-based explanations, align with the procedure, referred to as vesico-urethral anastomosis (VUA), in
expectations of expert surgeons (i.e., be reliable)12,13 and be which the bladder and the urethra are connected to one another
equally reliable for all surgeons (i.e., be fair)14. However, it through a series of stitches. To perform a single stitch, a surgeon
remains an open question whether AI-based explanations are must first grasp the needle with a robotic arm (needle handling),
reliable and fair. If left unchecked, misguided feedback can hinder and push that needle through tissue (needle driving), before
the professional development of surgeons and unethically dis- following through and withdrawing the needle from the tissue to
advantage one surgeon sub-cohort over another (e.g., novices prepare for the next stitch (needle withdrawal). In this study, we
vs. experts). leveraged SAIS’ ability to assess the skill-level of needle handling
Although an explanation can take on many forms, it often and needle driving when provided with a video sample depicting
manifests as the relative importance of data (attention scores) in these activities.
disciplines where the attention-based transformer architecture15
is used, such as natural language processing16 and protein Live surgical procedure. To obtain video samples from videos of
modelling17,18. An element with a higher attention score is surgical procedures, we adopted the following strategy. Given a
assumed to be more important than that with a lower score. video of the VUA step ( ≈ 20 min), we first identified the start and
Similarly, with a transformer architecture which operates on end time of each stitch (up to 24 stitches) involved in completing
videos, SAIS10 generates an attention score for each frame in a this step. We subsequently identified the start and end time of the
surgical video, with high-attention frames assumed to be more needle handling and needle driving activities performed as part of
relevant to the surgeon skill assessment. When such an AI sys- each stitch. A single video sample reflects one such needle
tem’s explanations align with those provided by humans, it can handling (or needle driving) activity ( ≈ 20 − 30s). As such, each
direct surgeons to specific aspects of their operating technique VUA step may result in approximately 24 video samples for each
that require improvement while simultaneously enhancing its activity (see Table 1 for total number of video samples).
trustworthiness19. As such, there is a pressing need to quantify
the reliability of explanations generated by surgical AI systems. Training environment. We also designed a realistic training
Previous studies have investigated the reliability of AI-based environment in which medical students, without any prior
explanations that highlight, for example, important patches of a robotic experience, sutured a gel-like model of the bladder and
medical image20–22 or clinical variables23. However, these studies urethra with a robot that is used by surgeons to perform live
remain qualitative and thereby do not systematically investigate surgical procedures. To control for the number of stitches per-
whether explanations are consistently reliable across data points. formed by each participant, we marked the gel-like model with 16
Studies that quantitatively evaluate AI-based explanations often target entry/exit points, resulting in 16 video samples from each
exclude a comparison to human explanations24,25, a drawback participant. Although each stitch involved a needle handling and
that extends to the preliminary studies aimed at also assessing the needle driving activity (among others), we only focused on needle
fairness of such explanations26,27. Notably, previous work has not handling for the purpose of our analysis (see Table 1 for number
quantitatively compared AI-based explanations to human expla- of video samples).
nations in the context of surgical videos, nor has it proposed a
strategy to enhance the reliability and fairness of such
explanations. Ethics approval. All datasets (data from University of Southern
In this study, we quantify the reliability and fairness of California, St. Antonius Hospital, and Houston Methodist Hos-
explanations generated by a surgical AI system –SAIS10 –that we pital) were collected under Institutional Review Board (IRB)
previously developed and which was shown to reliably assess the approval from the University of Southern California in which
skill level of surgeons from videos. Through evaluation on data informed consent was obtained (HS-17-00113). We have also
from three geographically-diverse hospitals, we show that SAIS’ obtained informed consent to present surgical video frames in
attention-based explanations often align, albeit imperfectly, with figures.
human explanations. We also find that SAIS generates different
quality explanations for different surgeon sub-cohorts (e.g., Skill assessment annotations. We assembled a team of trained
novices vs. experts), which we refer to as an explanation bias. To human raters to annotate video samples with skill assessments
address this misalignment between AI-based and human expla- based on a previously-developed skill assessment taxonomy (also
nations, we devise a general strategy of training with explanations known as an end-to-end assessment of suturing expertise or
–TWIX –which uses human explanations as supervision to EASE28). EASE was formulated through a rigorous Delphi pro-
explicitly teach an AI system to highlight important video frames. cess involving five expert surgeons that identified a strict set of
We find that TWIX enhances the reliability of AI-based expla- criteria for assessing multiple skills related to suturing (e.g.,
nations, mitigates the explanation bias, and improves the per- needle handling, needle driving, etc.). Our team of raters com-
formance of skill assessment systems across hospitals. With SAIS prised medical students and surgical residents who either helped
likely to provide feedback to medical students in the near future, devise the original skill assessment taxonomy or had been inti-
we show that our findings extend to a training environment. Our mately aware of its details.

2 COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed


COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3 ARTICLE

Table 1 Total number of videos and video samples associated with each of the hospitals and tasks.

Task Activity Details Hospital Videos Video samples Surgeons Generalizing to


skill assessment suturing needle handling USC 78 912 19 videos
SAH 60 240 18 hospitals
HMH 20 184 5 hospitals
LAB 69 328 38 modality
needle driving USC 78 530 19 videos
SAH 60 280 18 hospitals
HMH 20 220 5 hospitals

Note that we train our model, SAIS, on data exclusively shown in bold following a 10-fold Monte Carlo cross-validation setup. For an exact breakdown of the number of video samples in each fold and
training, validation, and test split, please refer to Supplementary Tables 2–7. The data from the remaining hospitals are exclusively used for inference. SAIS is always trained and evaluated on a class-
balanced set of data whereby each category (e.g., low skill and high skill) contains the same number of samples. This prevents SAIS from being negatively affected by a sampling bias during training, and
allows for a more intuitive appreciation of the evaluation results. USC University of Southern California, SAH St. Antonius Hospital, HMH Houston Methodist Hospital.

Exact criteria for skill assessment annotations. The skill-level of needle when being driven through tissue or its complete removal
needle handling is assessed by observing the number of times a from tissue in the opposite direction to which it was inserted. As
surgeon had to reposition their grasp of the needle. Fewer repo- such, raters had to identify both visual and motion cues in the
sitions imply a high skill-level, as it is indicative of improved surgical field of view in order to annotate segments of time as
surgeon dexterity and intent. The skill-level of needle driving is relevant. We reflect important time segments with a value of 1
assessed by observing the smoothness with which a surgeon and all other time segments with a value of 0.
pushes the needle through tissue. Smoother driving implies a high
skill-level, as it is less likely to cause physical trauma to the tissue. Visualising explanation annotations. Consider that for a given
video sample 30 s in duration, a human rater might annotate the
Mitigating noise in skill assessment annotations. We took several first five seconds (0−5 s) as most important for the skill assess-
steps to mitigate the degree of noise in the skill assessment ment. Based on our annotations, we show that ≈ 30% of a single
annotations. First, EASE outlines a strict set of criteria related to video sample is often identified as important (see Supplementary
the visual and motion content of a video sample, thereby Table 1). To visualise these annotations, we considered unique
making it straightforward to identify whether such criteria are video samples in the test set of each of the 10 Monte Carlo folds.
satisfied (or violated) upon watching a video sample. This Since each video sample may vary in duration, and to facilitate a
reduces the level of expertise that a rater must ordinarily have in comparison of the heatmaps across hospitals, we first normalized
order to annotate a video sample. Second, the raters involved in the time index of each explanation annotation such that it ranged
the annotation process were either a part of the development of from 0 (beginning of video sample) to 1 (end of video sample).
the EASE taxonomy or intimately aware of its details. This For needle handling, this translates to the beginning and end of
implied that they were comfortable with the criteria outlined in needle handling, respectively. A value of 0.20 would therefore
EASE. Third, and understanding that raters can be imperfect, refer to the first 20% of the video sample. We subsequently
we subjected them to a training process whereby raters were averaged the explanation annotations, whose values are either 0
provided with a training set of video samples and asked to (irrelevant frame) or 1 (relevant frame), across the video samples
annotate them independently of one another. This process for this normalized time index. We repeated the process for all
continued until the agreement of their annotations, which was hospitals and the skills of needle handling and needle driving (see
quantified via inter-rater reliability, exceeded 80%. We chose Fig. 1).
this threshold based on (a) the level of agreement first reported
in the study developing the EASE taxonomy and (b) an Training the raters. Before providing such explanation annota-
appreciation that natural variability is likely to exist from one tions, however, the raters underwent a training process akin to
rater to the next in, for example, the amount of attention they the one conducted for skill assessment annotations. First, raters
place on certain content within a video sample. were familiarized with the criteria outlined in the skill assessment
After completing their training process, raters were asked to taxonomy. In practice, and to mitigate potential noise in the
annotate the video samples in this study with binary skill explanation annotations, our assembled team of raters had, in the
assessments (low vs. high skill). In the event of disagreements in past, been involved in providing skill assessment annotations
the annotations, we followed the same strategy adopted in the while using the same exact taxonomy. The raters were then
original study10 where the lowest of all scores is considered as the provided with a training set of low-skill video samples and asked
final annotation. to independently annotate them with segments of time that they
believed were important to that skill assessment. During this time,
Skill explanation annotations. We assembled a team of two raters were encouraged to abide by the strict set of criteria out-
trained human raters to annotate each video sample with seg- lined in the skill assessment taxonomy. This training process
ments of time (or equivalently, spans of frames) deemed relevant continued until the agreement in their annotations, which was
for a particular skill assessment. We define segments of time as quantified via the intersection over union, exceeded 0.80. This
relevant if they reflect the strict set of criteria (or their violation) implies that, on average, each segment of time highlighted by one
outlined in the EASE skill assessment taxonomy28. In practice, we rater exhibited an 80% overlap with that provided by another
asked raters to exclusively annotate video samples tagged as low rater. This value was chosen, as with the skill assessment anno-
skill from a previous study (for motivation, see later section). To tation process, with the appreciation that natural variability in the
recap, for the activity of needle handling, a low skill assessment is annotation process is likely to occur. Raters may disagree, for
characterized by three or more grasps of the needle by the surgical example, on when an important segment of time ends even when
instrument. For the activity of needle driving, a low skill assess- both of their explanation annotations capture the bulk of the
ment is characterized by either four or more adjustments of the relevant activity.

COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed 3


ARTICLE COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3

Fig. 1 Heatmap of the ground-truth explanation annotations across hospitals. We average the explanation annotations for the a, needle handling and
b, needle driving video samples in the test set of the Monte Carlo folds (see Supplementary Table 2 for total number of samples), and present them over a
normalized time index, where 0 and 1 reflect the beginning and end of a video sample, respectively. A darker shade (which ranges from 0 to 1 as per the
colour bars) implies that a segment of time is of greater importance.

Aggregating explanation annotations. Upon completing the For each threshold, we calculated the precision; the proportion of
training process, raters were asked to provide explanation anno- frames deemed important by the AI system which are also
tations for the video samples used in this study. They were identified as important by the human, and the recall; the pro-
informed that each video sample had been annotated in the past as portion of all frames identified as important by the human which
low skill, and were therefore aware of the specific criteria in the the AI system also identified as important. By iterating over a
taxonomy to look out for. In the event of disagreements in the range of thresholds, we can generate a precision-recall curve.
explanation annotations, we considered the intersection of the The precision-recall curve reflects the trade-off between the
annotations. This ensures that we avoid identifying potentially precision of AI-based explanations: the proportion of frames
superfluous video frames as relevant and makes us more confident identified as important by the AI system which are actually
in the segments of time that overlapped amongst the raters’ important, and the recall of such explanations: the proportion of all
annotations. Although we experimented with other strategies for important frames identified as such by the AI system. For example,
aggregating the explanation annotations, such as considering their recall=1 implies that 100% of the frames identified as important by
union, we found this to have a minimal effect on our findings. a human are also identified as such by the AI system. However, an
imperfect system can only achieve this outcome by flagging all
Motivation behind focusing on low-skill activity. In this study, our frames in a video sample as important, irrespective of their ground-
goal was to provide feedback for video samples exclusively truth importance. Naturally, this is an undesirable outcome as the
depicting low skill activity. A binary skill assessment system is resultant feedback would no longer be temporally-localized, and
therefore well aligned with this goal. We focused on low-skill would thus be less informative. We use the area under the
activity for two reasons. First, from a practical standpoint, it is precision-recall curve (AUPRC) as a measure of the reliability of
relatively more straightforward to provide an explanation annota- AI-based explanations, as reported in previous studies25.
tion for a video sample depicting low skill activity than it is for one
depicting high skill activity. This is because human raters simply
have to look for segments of time in the video sample during which Metrics for evaluating the bias of explanations. Algorithmic
one (or more) of the criteria outlined in EASE are violated. Second, bias is often defined as a discrepancy in the performance of an AI
from an educational standpoint, studies in the domain of educa- system across sub-cohorts of stakeholders. In this study, we define
tional psychology have demonstrated that corrective feedback fol- explanation bias as a discrepancy in the reliability of AI-based
lowing an error is instrumental to learning [1]. As such, our focus explanations across sub-cohorts of surgeons. The intuition is that
on a low skill activity (akin to an error) provides a ripe opportunity such a discrepancy implies that a particular sub-cohort of sur-
for the provision of feedback. We do appreciate, however, that geons would systematically receive less reliable feedback, and is
feedback can also be useful when provided for video samples thus disadvantaged at a greater rate than other sub-cohorts. To
depicting high skill activity (e.g., through positive reinforcement). mitigate this explanation bias, we look to improve the reliability
We leave this as an extension of our work for the future. of AI-based explanations generated for the disadvantaged sub-
cohort of surgeons, referred to as worst-case AUPRC.

Metrics for evaluating the reliability of explanations. To eval- Choice of surgeon groups. When dealing with live surgical videos,
uate the reliability of AI-based explanations, we compared them we measured the explanation bias against surgeons operating on
to human-based ground-truth explanations. After normalizing prostate glands with different volumes, prostate glands with dif-
AI-based explanations (between 0 and 1), we introduced a ferent cancer severity levels (Gleason Score), and against surgeons
threshold such that frames with explanations that exceed this with a different caseload (total number of robotic surgeries per-
threshold are considered important, and unimportant otherwise. formed during their lifetime). To decide on these groups, we

4 COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed


COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3 ARTICLE

consulted with a urologist (A.J.H) about their relevance and opted Training and evaluating SAIS. To train and evaluate SAIS, we
for groups whose clinical meta information was most complete in adopted the same strategy presented in the original study10.
our database in effort to increase the number of samples available Specifically, we trained SAIS using 10-fold Monte Carlo cross
for analysis. While it may seem out of the ordinary to define a validation on data exclusively from USC. To ensure that we
surgeon group based on the prostate volume of a patient being evaluated SAIS in a rigorous manner, each fold was split into a
operated on, we note that the distribution of such volumes can training, validation, and test set where each set contained samples
also differ across hospitals (e.g., due to patient demographics) (see from surgical videos not present in any of the other sets. Having
Supplementary Fig. 1). When dealing with videos from the formulated skill assessment as a binary classification task, we
training environment, we measured the explanation bias against balance the number of video samples from each class (low vs.
medical students of a different gender. high-skill) in every data split (training, validation, and test).
While doing so during training ensures that the model’s perfor-
Choice of surgeon sub-cohorts. When dealing with live surgical mance is not biased towards the majority class, balancing the
videos, we converted each of the surgeon groups into categorical classes during evaluation (e.g., on the test set) allows for a better
sub-cohorts. Specifically, for the prostate volume group, we appreciation of the evaluation metrics we report. For evaluation
decided on the two sub-cohorts of prostate volume ≤ 49 ml on data from other hospitals, we deployed all 10 SAIS models
and > 49 ml. We chose this for practical reasons as it was the (from the 10 folds). As such, we always report metrics as an
population median of the patients at USC, and thereby providing average across these 10 folds.
us with a relatively balanced number of video samples from each
sub-cohort, and for clinical reasons, with some evidence illus-
trating that operating on prostate glands of a larger size can TWIX is a module for generating AI-based explanations.
increase operating times29,30. As for the surgeon caseload group, Although there exist various ways to incorporate human-based
we decided on the two sub-cohorts of caseload ≤100 and > 100, explanations into the learning process of an AI system24,32, we
based on previous studies31. took inspiration from studies demonstrating the effectiveness of
using explanations as a target variable33–36. To that end, we
propose a strategy entitled training with explanations –TWIX
SAIS is an AI system for skill assessment. We recently devel- –which explicitly teaches an AI system to generate explanations
oped SAIS to decode the intraoperative activity of surgeons based that match those provided by human experts (see Fig. 2). The
exclusively on surgical videos10. Specifically, we demonstrated intuition is that by incorporating the reasoning used by humans
state-of-the-art performance in assessing the skill-level of surgical into the learning process of the AI system, it can begin to focus on
activity, such as needle handling and needle driving, across relevant frames in the video sample and generate explanations
multiple hospitals. In light of these capabilities, we used SAIS as that better align with the expectations of humans. To achieve this,
the core AI system throughout this study. we made a simple modification to SAIS (appending a classifica-
tion module) enabling it to identify the binary importance
Components of SAIS. We outline the basic components of SAIS (important vs. not important) of each frame in the video sample
here and refer readers to the original study for more details10. In in a supervised manner. Note that TWIX is a general strategy in
short, SAIS takes two data modalities as input: RGB frames and that it can be used with any architecture, regardless of whether it
optical flow, which measures motion in the field of view over time, is attention-based or not.
and which is derived from neighbouring RGB frames. Spatial
information is extracted from each of these frames through a vision Outlining the mechanics of TWIX. When dealing with the RGB
transformer pre-trained in a self-supervised manner on ImageNet. frames of a video sample, SAIS first extracts spatial features from
To capture the temporal information across frames, SAIS learns the each frame and then captures the temporal relationship between
relationship between subsequent frames through an attention these frames (see previous section for attention mechanism).
mechanism (see next section). Greater attention, or importance, is Upon doing so, SAIS outputs both a single D-dimensional video
placed on frames deemed more important for the ultimate skill representation, v 2 RD , that summarizes the entire video sample
assessment. Repeating this process for all data modalities, SAIS and, importantly, the D-dimensional representation of each
arrives at modality-specific video representations. SAIS aggregates frame, ht 2 RD , at time-point t ∈ [1, T] in a video sample of
these representations to arrive at a single video representation that duration T seconds. For the purpose of surgeon skill assessment,
summarizes the content of the video sample. This video repre- the video representation suffices and is fed into downstream
sentation is then used to output a probability distribution over the modules. As part of TWIX, however, whose details are presented
two skill categories (low vs. high skill). below, we operate directly on the representation of each frame.
Specifically, we keep the core architecture of SAIS unchanged and
Generating explanations with SAIS. To summarize a video sample simply append a classification module, pω : ht 2 RD ! ^y 2 R to
with a single representation, SAIS adopts an approach often map each frame representation, ht, to a scalar frame importance
observed with transformer networks used for the purpose of clas- variable, ^y 2 ½0; 1. The classification module learns which frames
sification; it learns a classification token embedding and treats its are important in a supervised manner based on ground-truth
corresponding representation after the N transformer encoders as labels, y, provided by humans indicating the binary importance of
the video representation, v (for one of the modalities, e.g., RGB each frame (important vs. not important).
frames). As the attention mechanism still applies to this video
representation, we are able to measure its dependence on all frames Training AI systems with TWIX. We retrain SAIS on data
in the video sample. The higher the dependence on a particular exclusively from USC as reported in the original study10. In short,
frame, the more important it is for the assessment of surgeon skill. this involves 10-fold Monte Carlo cross-validation where each
We hypothesized that the temporal relationship between frames at training fold consists of an equal number of samples from each
the final layer of the transformer encoder is most strongly asso- class (low skill, high skill). The number of samples are provided in
ciated with the surgeon skill assessment, and as such, we extracted Supplementary Table 2. In this study, the difference is that we
the attention placed on these frames. This method of explanations train SAIS in an end-to-end manner with the TWIX classification
is referred to as attention in the Results section. module. Specifically, for a single video sample, we optimize the

COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed 5


ARTICLE COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3

surgical video
frames

Human Annotators

importance
explanation ∶= frames most
important for skill assessment
surgical video frames

previous study current study

Surgical AI System skill skill Surgical AI System


(SAIS) assessment assessment (SAIS)

video frames
TWIX
Module
video frames

aention AI-based frame-level AI-based frame-level


across frames importance with aention importance with TWIX
0.8 0.9
probability of
frame importance

importance
importance

video frames video frames


alignment of aention with human explanation alignment of TWIX with human explanation

Fig. 2 Quantifying the alignment of AI-based explanations with human explanations. A surgical artificial intelligence system (SAIS) can assess the skill of
a surgeon based on a surgical video and generate an explanation for such an assessment by highlighting the relative importance of video frames (e.g., via
an attention mechanism). Human experts annotate video frames most important for the skill assessment. TWIX is a module which uses human
explanations as supervision to explicitly teach an AI system to predict the importance of video frames. We show the alignment of attention and TWIX with
human explanations.

supervised InfoNCE loss, LInfoNCE , reported in the original study, Note that the attention mechanism is still a core element of
and the binary cross-entropy loss for the classification module, SAIS and continues to capture the temporal relationships
which we refer to as the importance loss, Limportance . Notably, between frames irrespective of whether TWIX is adopted or not.
since skill explanation annotations are only provided for the low- In fact, the method entitled attention (w/ TWIX) in the Results
skill video samples, the importance loss is only calculated for section refers to the attention placed on the frames in a video
those video samples. sample, as per usual, however after having adopted TWIX (i.e.,
after optimizing eq. 1). While it may seem that evaluating
L ¼ LInfoNCE ðθÞ þ Limportance ðωÞ ð1Þ explanations again based on these attention values is redundant
(i.e., akin to attention before the adoption of TWIX), we show
T that such attention values can in fact change as a result of
Limportance ðωÞ ¼  ∑ ð1  yt Þ log ^yt þ yt logð1  ^yt Þ optimizing eq. 1. This makes sense since attention is a function
t¼1
of the parameters θ which, in turn, are now affected by the
Predicting frame importance with TWIX. When SAIS is deployed inclusion of the importance loss, Limportance .
on unseen data, the classification module, pω, can now directly
output the importance, ^y 2 ½0; 1, of each frame in a video
sample, and thus act as an alternative to the attention Reporting summary. Further information on research design is
mechanism as an indicator of the relative importance of such available in the Nature Portfolio Reporting Summary linked to
frames. We refer to this method as TWIX throughout the paper. this article.

6 COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed


COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3 ARTICLE

Fig. 3 TWIX can improve the reliability of AI-based explanations across hospitals. Precision-recall curves reflecting the alignment of different AI-based
explanations with those provided by humans when assessing the skill-level of a, needle handling and b, needle driving. Note that SAIS is trained exclusively
on data from USC and then deployed on data from USC, SAH, and HMH. The solid lines and shaded areas represent the mean and standard deviation,
respectively, across 10 Monte Carlo cross-validation folds.

Results SAIS exhibits explanation bias against surgeons. We also


SAIS generates explanations that often align with human investigated whether AI-based explanations are equally reliable
explanations. We quantified the reliability of SAIS’ explanations for different sub-cohorts of surgeons within the same hospital
(referred to as attention) by comparing them to those generated (Fig. 4, see Supplementary Tables 3-4 for number of samples). We
by human experts. To do so, we first trained and deployed SAIS found that SAIS exhibits an explanation bias against surgeon sub-
to perform skill assessment on data from the University of cohorts, whereby its explanations are more reliable for one sub-
Southern California (USC). We evaluated its explanations for the cohort than another. This is evident, for example, when SAIS
needle handling and needle driving activities by using the assessed the skill-level of needle handling (Fig. 4a) for surgeons
precision-recall (PR) curve25 (Fig. 3, see Methods for intuition operating on prostate glands of different volumes. Whereas
behind PR curves). We found that SAIS’ explanations often align, AUPRC ≈ 0.54 for prostate volumes ≤ 49 ml, AUPRC ≈ 0.47 for
albeit imperfectly, with human explanations. This is evident by prostate volumes > 49 ml. We observed a similar explanation bias
AUPRC = 0.488 and 0.428 for the needle handling (Fig. 3a) and when SAIS assessed the skill-level of needle driving (Fig. 4b).
needle driving (Fig. 3b) activities, respectively.
Explanation bias is inconsistent across hospitals. We were
Reliability of explanations is inconsistent across hospitals. further motivated to investigate whether SAIS’ explanation bias
With AI systems often trained on data from one hospital and was consistent across hospitals. To that end, we trained SAIS on
deployed on data from another hospital, it is important that their data from USC and deployed it on data from SAH and HMH,
behaviour remains consistent across hospitals. Consistent expla- stratifying the reliability of its explanations across surgeon sub-
nations can, for example, improve the trustworthiness of AI cohorts (Fig. 3a,b, Supplementary Tables 5-6 outline number of
systems37. To measure this consistency, we trained SAIS on data samples). We found that the explanation bias is inconsistent
from USC and deployed it on data from St. Antonius Hospital across hospitals. For example, when SAIS assessed the skill-level
(SAH), Gronau, Germany and Houston Methodist Hospital of needle handling (Fig. 4a) and focusing on surgeons operating
(HMH), TX, USA (see Table 1 for number of video samples). We on prostate glands of different volumes, we observed an expla-
present the precision-recall curves of AI-based explanations for nation bias at USC (AUPRC ≈ 0.54 vs. 0.47), no bias at SAH
needle handling (Fig. 3a) and needle driving (Fig. 3b). We found (AUPRC ≈ 0.63 vs. 0.63), and an explanation bias against the
that the reliability of SAIS’ explanations differ across hospitals. opposite sub-cohort at HMH (AUPRC ≈ 0.58 vs. 0.63). We found
For example, the explanations for needle handling (Fig. 3a) are a similarly inconsistent explanation bias for the remaining sur-
more reliable when SAIS is deployed on data from SAH and geon groups and even when SAIS assessed the skill-level of needle
HMH than on data from USC. This is evident by the improved driving (Fig. 4b).
AUPRC = 0.488 → 0.629 at SAH and AUPRC = 0.488 → 0.551
at HMH. We hypothesize that this finding is due to the higher
degree of variability in surgical activity depicted in the USC TWIX improves reliability of AI-based explanations across
videos relative to that in the other hospitals. This variability might hospitals. Having demonstrated that SAIS’ explanations often
be driven by the larger number of novice surgeons who can align, albeit imperfectly, with human explanations, we set out to
exhibit a wider range of surgical activity compared to expert improve their reliability. We hypothesized that by using human
surgeons. explanations as supervision, we can explicitly teach SAIS to

COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed 7


ARTICLE COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3

Fig. 4 TWIX effectively mitigates explanation bias exhibited by SAIS against surgeons. Reliability of attention-based explanations stratified across
surgeon sub-cohorts when assessing the skill-level of a, needle handling and b, needle driving (see Supplementary Tables 3-6 for number of samples in
each sub-cohort). We do not report caseload for SAH due to insufficient samples from one sub-cohort. Effect of TWIX on the reliability of AI-based
explanations for the disadvantaged surgeon sub-cohort (worst-case AUPRC) when assessing the skill-level of c, needle handling and d, needle driving. AI-
based explanations come in the form of attention placed on frames by SAIS or through the direct estimate of frame importance by TWIX (see Methods).
We do not report caseload for SAH due to insufficient samples from one sub-cohort. Note that SAIS is trained exclusively on data from USC and then
deployed on data from USC, SAH, and HMH. Results are an average across 10 Monte Carlo cross-validation folds, and errors bars reflect the 95%
confidence interval.

generate explanations that more closely align with human We found that TWIX improves the reliability of SAIS’
explanations (Fig. 2, right column). This strategy which we refer attention-based explanations across hospitals. This is evident by
to as training with explanations –TWIX –directly estimates the the higher AUPRC achieved by attention w/ TWIX. For example,
importance of each video frame (see Methods). We trained when SAIS assessed the skill-level of needle handling (Fig. 3a), the
SAIS to assess the skill-level of needle handling and needle reliability of attention AUPRC = 0.488 → 0.595 at USC,
driving, while adopting TWIX, on data exclusively from USC AUPRC = 0.629 → 0.687 at SAH, and AUPRC = 0.551 → 0.617
and deployed it on data from USC, SAH, and HMH. We pre- at HMH. We did not observe such an improvement when SAIS
sent the reliability of AI-based explanations which take on the assessed the skill-level of needle driving (Fig. 3b). One hypothesis
form of either the attention placed on frames (attention w/ for this is that needle driving exhibits a greater degree of
TWIX) or the direct estimate of the importance of frames variability than needle handling, and therefore assessing its skill
(TWIX) (Fig. 3). level may require the AI system to focus on a more diverse range

8 COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed


COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3 ARTICLE

of video frames. We also found that TWIX can be more reliable Table 2 TWIX often improves AI-based skill assessments
than attention-based explanations, as evident by its relatively across hospitals.
higher AUPRC. For example, AUPRC = 0.595 → 0.677 at USC,
AUPRC = 0.629 → 0.724 at SAH, and AUPRC = 0.551 → 0.697
Skill Hospital w/o TWIX TWIX
at HMH (Fig. 3a). Although TWIX had a minimal benefit on the
reliability of explanations when SAIS assessed the skill-level of needle USC 0.849 (0.06) 0.859 (0.05)
needle driving (Fig. 3b), it still improved skill assessment handling
SAH 0.873 (0.24) 0.885 (0.02)
performance (see next sections).
HMH 0.795 (0.19) 0.794 (0.03)
needle driving USC 0.822 (0.05) 0.850 (0.04)
TWIX can effectively mitigate explanation bias across SAH 0.800 (0.04) 0.837 (0.03)
hospitals. With TWIX improving the reliability of AI-based HMH 0.728 (0.05) 0.757 (0.03)
explanations on average for all surgeons, we wanted to identify if
Effect of TWIX on SAIS’ ability to assess the skill-level of needle handling and needle driving.
these improvements also applied to the surgeon sub-cohort(s) Values in bold reflect improvements in performance. Note that SAIS is trained exclusively on
that previously experienced an explanation bias. Such an data from USC and then deployed on data from USC, SAH, and HMH. Results are an average
(standard deviation) across 10 Monte Carlo cross-validation folds.
improvement would translate to a mitigation in the bias. To do
so, we quantified the reliability of AI-based explanations for the
disadvantaged sub-cohort of surgeons (worst-case AUPRC) after training SAIS as per normal (RGB + Flow), we also withheld a
having adopted TWIX when assessing the skill-level of needle data modality known as optical flow (RGB), and performed
handling (Fig. 4c) and needle driving (Fig. 4d). multi-class skill assessment (low vs. intermediate vs. high) (Multi-
We found that TWIX effectively mitigates the explanation bias Skill). We found that TWIX consistently improves the reliability
across hospitals. This is evident by the marked increased in the of explanations and mitigates the explanation bias irrespective of
reliability of SAIS’ explanations when assessing the skill-level of the experimental setting in which it is deployed. For example, in
needle handling (Fig. 4c) for the previously disadvantaged sub- the Multi-Skill setting, which is becoming an increasingly pre-
cohort of surgeons. For example, focusing on surgeons operating ferred way to assess surgeons, the average AUPRC = 0.48 → 0.67
on prostate glands of different volumes, the worst-case AUPRC ≈ (Fig. 5a) and the worst-case AUPRC = 0.50 → 0.68 (Fig. 5b).
0.50 → 0.60 at USC, AUPRC ≈ 0.62 → 0.75 at SAH, and These findings demonstrate the versatility of TWIX.
AUPRC ≈ 0.64 → 0.80 at HMH. We observed similarly effective
bias mitigation for the remaining surgeon groups. On the other Providing feedback today in training environment. Our study
hand, we found that TWIX was not as effective in mitigating the builds the foundation for the future implementation of AI-
explanation bias when SAIS assessed the skill-level of needle augmented surgical training programs. It is very likely that, in the
driving (Fig. 4d). We believe that such lack of improvement is due short run, SAIS will be used to assess the skills of surgical trainees
to the high degree of variability in needle driving, implying that and provide them with feedback on their performance. As with
the importance of frames in one video sample may not transfer to practicing surgeons, it is equally important that such trainees are
that in another sample. also not disadvantaged by AI systems. We therefore deployed
SAIS on videos from a training environment to assess, and gen-
TWIX often improves AI-based skill assessments across erate explanations for, the skill-level of the needle handling
hospitals. Although TWIX was designed to better align AI-based activity performed by participants in control of the same robot
explanations with human explanations, we hypothesized that it otherwise used in live surgical procedures (see Methods)
might also improve the performance of AI skill assessment sys- (Fig. 5, c-f).
tems. The intuition is that by learning to focus on the relevant We discovered that our findings from when SAIS was deployed
aspects of the video, SAIS is less likely to latch onto spurious on video samples from live surgical procedures transfer to the
features. To investigate this, we present the performance of SAIS, training environment. We found that AI-based explanations often
both with and without TWIX (w/o TWIX), when assessing the align with those provided by human experts, and that TWIX
skill-level of needle handling and needle driving (Table 2). enhances the reliability of these explanations (Fig. 5c). New to
As expected, we found that TWIX improves AI-based skill this setting, we found that SAIS exhibits an explanation bias
assessments across hospitals. This is evident by the higher AUC against male surgical trainees (Fig. 5d), an analysis typically
values with TWIX than without it. For example, when SAIS precluded by the imbalance in the gender demographics of
assessed the skill-level of needle driving (Table 2), it achieved urology surgeons (national average: > 90% male38). Consistent
AUC = 0.822 → 0.850 at USC, AUC = 0.800 → 0.837 at SAH, with previous findings, we found that TWIX mitigates the
and AUC = 0.728 → 0.757 at HMH. These findings illustrate that explanation bias, as evident by the improvement in the worst-case
TWIX, which was adopted when SAIS was trained on data AUPRC (Fig. 5e), and improves SAIS’ skill assessment capabil-
exclusively from USC, can still have positive ramifications on ities, with an improvement in the AUC (Fig. 5f).
performance even when SAIS is deployed on data from an
entirely different hospital. In the case of needle handling, we Discussion
continued to observe the benefits of TWIX on performance, albeit Surgical AI systems can now reliably assess surgeon skills while
more marginally. simultaneously generating an explanation for their assessments.
With such explanations likely to inform the provision of feedback
Ablation study. Throughout this study, we used the same con- to surgeons, it is critical that they align with the expectations of
figuration (e.g., data modalities, problem setup) of SAIS as that humans and treat all surgeons fairly. However, it has remained an
presented in the original study10. This was motivated by the open question whether AI-based explanations exhibit these
promising capabilities demonstrated by SAIS and its impending characteristics.
deployment for the provision of feedback. Here, we show how In this study, we quantified the reliability of explanations
variants of SAIS affect the reliability of explanations (Fig. 5a) and generated by surgical AI systems by comparing them to human
the explanation bias (Fig. 5b), and whether TWIX continues to explanations, and investigated whether such systems generate
confer benefits in such settings. Specifically, in addition to different quality explanations for different surgeon sub-cohorts.

COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed 9


ARTICLE COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3

Fig. 5 TWIX’ benefits persist across different experimental settings. We present the effect of TWIX, in different experimental settings (ablation studies),
on a, the reliability of explanations generated by SAIS, quantified via the AUPRC, and b, the explanation bias, quantified via improvements in the worst-case
AUPRC (see Supplementary Tables 3-6 for number of samples in each sub-cohort). The default experimental setting is RGB + Flow and was used
throughout this study. Other settings include withholding optical flow from SAIS (RGB) and formulating a multi-class skill assessment task (Multi-Skill).
c–f SAIS can be used today to provide feedback to surgical trainees. c AI-based explanations often align with those provided by human experts. d SAIS
exhibits an explanation bias against male surgical trainees. e TWIX mitigates the explanation bias by improving the reliability of explanations provided to
male surgical trainees and f improves SAIS' performance in assessing the skill-level of needle handling. Note that SAIS is trained exclusively on live data
from USC and then deployed on data from the training environment. Results are shown for all 10 Monte Carlo cross-validation folds.

We showed that while AI-based explanations often align with AI system for the provision of feedback to surgeons. The impli-
those generated by humans, they can exhibit a bias against sur- cations of misguided feedback can be grave, affecting both surgeons
geon sub-cohorts. To remedy this, we proposed a strategy –TWIX and the patients they eventually operate on. From the surgeon’s
–which uses human explanations as supervision to explicitly perspective, receiving unreliable feedback can hinder their profes-
teach an AI system to highlight important video frames. We sional development, unnecessarily lengthen their learning curve,
demonstrated that TWIX can improve the reliability and fairness and prevent them from mastering surgical technical skills. These
of AI-based explanations, and the overall performance of AI skill are acutely problematic given that learning curves for certain
assessment systems. procedures can span up to 1000 surgeries45 and that surgeon
Our study addresses several open questions in the literature. performance correlates with postoperative patient outcomes46,47.
First, the degree of alignment between AI-based explanations and Quantifying the discrepancy in the quality of feedback is equally
human explanations, and thus their reliability, has thus far important. A discrepancy, which we referred to as an explanation
remained unknown for video-based surgical AI systems. In pre- bias, results in the provision of feedback that is more reliable for
vious work, AI-based explanations are often evaluated based on one surgeon sub-cohort than another. Given that feedback can
how effectively they guide a human in identifying the content of an accelerate a surgeon’s acquisition of skills, an explanation bias can
image39,40 and facilitate the detection of errors committed by a unintentionally widen the disparity in the skill-set of surgeons.
model41,42. Second, it has also remained unknown whether AI- Combined, these implications can complicate the ethical integra-
based explanations exhibit a bias, where their reliability differs tion of AI systems into surgical training and surgeon credentialing
across surgeon sub-cohorts. Although preliminary studies have programs. Nonetheless, we believe our framework for quantifying
begun to explore the intersection of bias and explanations26,27,43, and subsequently improving the alignment of AI-based explana-
they do not leverage human expert explanations, are limited to tions can benefit other disciplines involving assessments and
non-surgical domains, and do not present findings for video-based feedback based on videos, such as childhood education48 and
AI systems. Third, the development of a strategy that consistently workplace training49.
improves the reliability and fairness of explanations has been There are certain limitations to our study. We have only
underexplored. Although previous studies have incorporated measured the reliability of explanations and the effectiveness of
human explanations into the algorithmic learning process25,44, they TWIX on a single type of surgical activity, namely suturing.
are primarily limited to the discipline of natural language proces- However, surgeons must often master a suite of technical skills,
sing and, importantly, do not demonstrate its effectiveness in also including tissue dissection, to proficiently and independently
improving the fairness of AI-based explanations. complete an entire surgical procedure. An AI-augmented surgical
Without first quantifying the reliability and fairness of AI-based training program will likely benefit from reliable assessments of
explanations, it becomes difficult to evaluate the preparedness of an distinct surgical activities and corresponding explanations of

10 COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed


COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3 ARTICLE

those assessments. Furthermore, TWIX requires human-based Received: 30 September 2022; Accepted: 17 February 2023;
explanations, which, in the best-case scenario, are difficult and
time-consuming to retrieve from experts and, in the worst-case
scenario, ambiguous and subjective to provide. Our explanation
annotations avoided this latter scenario since they were depen-
dent on a strict set of criteria28 associated with both visual and References
motion cues present in the surgical videos. We therefore believe 1. Ende, J. Feedback in clinical medical education. JAMA. 250, 777–781 (1983).
2. Roberts, K. E., Bell, R. L. & Duffy, A. J. Evolution of surgical skills training.
that our approach can be useful in other settings which share World J. Gastroenterol.: WJG 12, 3219 (2006).
these characteristics; where expectations from an AI system can 3. Karam, M. D. et al. Surgical coaching from head-mounted video in the
be codified by humans. training of fluoroscopically guided articular fracture surgery. JBJS. 97,
We have also made the assumption that AI-based explanations 1031–1039 (2015).
are considered reliable only when they align with human expla- 4. Singh, P., Aggarwal, R., Tahir, M., Pucher, P. H. & Darzi, A. A randomized
controlled study to evaluate the role of video-based coaching in training
nations. This interpretation has two potential drawbacks. First, it laparoscopic skills. Annals Surgery. 261, 862–869 (2015).
overlooks the myriad ways in which explanations can be viewed 5. Yule, S. et al. Coaching non-technical skills improves surgical residents’
as reliable. For example, they may align with a time period of high performance in a simulated operating room. J. Surgical Education. 72,
blood loss during surgery, which could be consistent with poor 1124–1130 (2015).
postoperative patient outcomes. Evaluating explanations in this 6. Bonrath, E. M., Dedy, N. J., Gordon, L. E. & Grantcharov, T. P.
Comprehensive surgical coaching enhances surgical skill in the operating
way is promising as it would obviate the need for ground-truth room. Annals Surgery. 262, 205–212 (2015).
human explanations. Instead, the ground-truth importance of 7. Hu, Y.-Y. et al. Complementing operating room teaching with video-based
video frames can be derived from the context of the surgery (e.g., coaching. JAMA Surgery. 152, 318–325 (2017).
what and where surgical activity is taking place), which can 8. Gunning, D. et al. XAI - explainable artificial intelligence. Sci. Robotics. 4,
automatically be decoded by systems like SAIS. Second, con- eaay7120 (2019).
9. Yuan, L. et al. In situ bidirectional human-robot value alignment. Sci. Robotics.
straining AI-based explanations to match human explanations 7, eabm4183 (2022).
overlooks their promise for the discovery of novel aberrant (or 10. Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from
optimal) surgeon behaviour, contributing to the scientific body of surgical videos. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-023-01010-8
knowledge and informing future surgical protocols. Although (2023).
such discovery is beyond the scope of the present work, it is likely 11. Kiyasseh, D. et al. Human visual explanations mitigate AI-based assessment of
surgeon skills. npj Digital Medicine. https://doi.org/10.1038/s41746-023-
to yield value, for example, when associating intraoperative sur- 00766-2 (2023).
gical activity with postoperative patient outcomes. 12. Doshi-Velez, F. & Kim, B. Towards a rigorous science of interpretable
Several open questions remain unaddressed. First, it remains machine learning. Stat. 1050, 2 (2017).
unknown whether SAIS’ explanations accelerate the acquisition of 13. Kim, B. & Doshi-Velez, F. Machine learning techniques for accountability. AI
skills by surgical trainees. To investigate this, we plan to conduct a Mag. 42, 47–52 (2021).
14. Cirillo, D. et al. Sex and gender differences and biases in artificial intelligence
prospective trial amongst medical students in a controlled for biomedicine and healthcare. npj Digital Med. 3, 1–11 (2020).
training environment. Second, despite attempts to define optimal 15. Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Sys.
feedback50,51, in which explanations play an essential role52, its 30, (2017).
embodiment remains elusive. In pursuit of that definition, recent 16. Wiegreffe, S. & Marasovic, A. Teach me to explain: A review of datasets for
frameworks such as the feedback triangle53 may hold promise, explainable natural language processing. In Thirty-fifth Conference on Neural
Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021).
emphasizing the cognitive54,55, affective, and structural dimen- 17. Vig, J. et al. Bertology meets biology: Interpreting attention in protein
sions of feedback. Third, while we have demonstrated that SAIS language models. In International Conference on Learning Representations
generates explanations whose reliability differs for different sur- (2020).
geon sub-cohorts, it remains to be seen whether this discrepancy 18. Jumper, J. et al. Highly accurate protein structure prediction with alphafold.
will result in notable harmful consequences. After all, a dis- Nature 596, 583–589 (2021).
19. Liang, W. et al. Advances, challenges and opportunities in creating data for
crepancy may only translate to an explanation bias if it is trustworthy ai. Nat. Machine Intell. 4, 669–677 (2022).
unjustified and harmful to surgeons56. 20. Lundberg, S. M. et al. Explainable machine-learning predictions for the
Surgical training programs continue to adopt the 20th century prevention of hypoxaemia during surgery. Nature Biomedical Engineering 2,
Halstedian model of “see one, do one, teach one”57 in reference to 749–760 (2018).
learning how to perform surgical procedures. In contrast, AI- 21. Hooker, S., Erhan, D., Kindermans, P.-J. & Kim, B. A benchmark for
interpretability methods in deep neural networks. Adv. Neural Inform. Process.
augmented surgical training programs can democratize the Sys. 32, 9734–9745 (2019).
acquisition of surgical skills on a global scale58,59 and improve the 22. Barnett, A. J. et al. A case-based interpretable deep learning model for
long-term postoperative outcomes of patients. classification of mass lesions in digital mammography. Nat. Machine Intell. 3,
1061–1070 (2021).
23. Lauritsen, S. M. et al. Explainable artificial intelligence model to predict acute
critical illness from electronic health records. Nat. Commun. 11, 1–11 (2020).
Data availability
24. Zaidan, O., Eisner, J. & Piatko, C. Using “annotator rationales” to improve
As the data contain protected health information, the videos of live surgical procedures
machine learning for text categorization. In Conference of the North American
and the patients’ corresponding demographic information from the University of
Association for Computational Linguistics, 260–267 (2007).
Southern California, St. Antonius Hospital, and Houston Methodist Hospital are not
25. DeYoung, J. et al. ERASER: A Benchmark to Evaluate Rationalized NLP
publicly available. However, since the data from the training environment do not involve
Models. Proceedings of the 58th Annual Meeting of the Association for
patients, those videos and annotations are available on Zenodo (https://zenodo.org/
Computational Linguistics. Online: Association for Computational Linguistics,
record/7221656#.Y-ZIfi_MI2y) upon reasonable request from the authors. Source data
4443–4458 (2020).
for Fig. 1 is in Supplementary Data 1. Source data for Fig. 3 is in Supplementary Data 2.
26. Dai, J., Upadhyay, S., Aivodji, U., Bach, S. H. & Lakkaraju, H. Fairness via
Source data for Fig. 4 is in Supplementary Data 3 and 4. Source data for Fig. 5 is in
explanation quality: Evaluating disparities in the quality of post hoc
Supplementary Data 5.
explanations. Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics,
and Society 203–214 (2022).
27. Balagopalan, A. et al. The road to explainability is paved with bias: Measuring
Code availability the fairness of explanations. 2022 ACM Conference on Fairness,
While SAIS, the underlying AI system, can be accessed at https://github.com/ Accountability, and Transparency 1194–1206 (2022).
danikiyasseh/SAIS, the code for the existing study can be found at https://github.com/ 28. Haque, T. F. et al. Development and validation of the end-to-end assessment
danikiyasseh/TWIX. of suturing expertise (EASE). J. Urology. 207, e153 (2022).

COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed 11


ARTICLE COMMUNICATIONS MEDICINE | https://doi.org/10.1038/s43856-023-00263-3

29. Martinez, C. H. et al. Effect of prostate gland size on the learning curve for 55. Noetel, M. et al. Multimedia design for learning: An overview of reviews with
robot-assisted laparoscopic radical prostatectomy: does size matter initially? J. meta-meta-analysis. Rev. Educ. Res. 92, 413–454 (2021).
Endourol. 24, 261–266 (2010). 56. Barocas, S., Hardt, M. & Narayanan, A.Fairness and Machine Learning
30. Goldstraw, M. et al. Overcoming the challenges of robot-assisted radical (fairmlbook.org, 2019). http://www.fairmlbook.org.
prostatectomy. Prostate Cancer Prostatic Dis. 15, 1–7 (2012). 57. Romero, P. et al. Halsted’s “see one, do one, and teach one” versus peyton’s
31. Hung, A. J. et al. Face, content and construct validity of a novel robotic four-step approach: a randomized trial for training of laparoscopic suturing
surgery simulator. J. Urology. 186, 1019–1025 (2011). and knot tying. J. Surg. Educ. 75, 510–515 (2018).
32. Ross, A. S., Hughes, M. C. & Doshi-Velez, F. Right for the right reasons: 58. Ajao, O. G. & Alao, A. Surgical residency training in developing countries:
training differentiable models by constraining their explanations. In West african college of surgeons as a case study. J. Natl Med. Assoc. 108,
Proceedings of the 26th International Joint Conference on Artificial Intelligence, 173–179 (2016).
2662–2670 (2017). 59. Ng-Kamstra, J. S. et al. Global surgery 2030: a roadmap for high income
33. Hind, M. et al. Ted: Teaching AI to explain its decisions. In Proceedings of the country actors. BMJ Global Health. 1, e000011 (2016).
2019 AAAI/ACM Conference on AI, Ethics, and Society, 123–129 (2019).
34. Kailkhura, B., Gallagher, B., Kim, S., Hiszpanski, A. & Han, T. Reliable and
explainable machine-learning methods for accelerated material discovery. NPJ Acknowledgements
Computational Materials. 5, 1–9 (2019). Research reported in this publication was supported by the National Cancer Institute
35. Rieger, L., Singh, C., Murdoch, W. & Yu, B. Interpretations are useful: under Award No. R01CA251579-01A1.
penalizing explanations to align neural networks with prior knowledge. In
International Conference on Machine Learning, 8116–8126 (PMLR, 2020). Author contributions
36. Lampinen, A. K. et al. Tell me why! explanations support learning relational D.K. contributed to the conception of the study and the study design, developed the deep
and causal structure. In International Conference on Machine Learning, learning models, and wrote the manuscript. J.L. collected the data from the training
11868–11890 (PMLR, 2022). environment. D.K., J.L., T.H., and M.O. provided annotations for the video samples.
37. Jacovi, A., Marasović, A., Miller, T. & Goldberg, Y. Formalizing trust in D.A.D. provided feedback on the manuscript. C.W. collected data from St. Antonius-
artificial intelligence: Prerequisites, causes and goals of human trust in AI. In Hospital and B.J.M. collected data from Houston Methodist Hospital, and provided
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and feedback on the manuscript. A.J.H., and A.A. provided supervision and contributed to
Transparency, 624–635 (2021). edits of the manuscript.
38. Nam, C. S., Daignault-Newton, S., Kraft, K. H. & Herrel, L. A. Projected us
urology workforce per capita, 2020-2060. JAMA Network Open. 4,
e2133864–e2133864 (2021). Competing interests
39. Nguyen, G., Kim, D. & Nguyen, A. The effectiveness of feature attribution The authors declare the following competing interests: D.K. is a paid consultant of
methods and its correlation with automatic evaluation scores. Adv. Neural Flatiron Health and an employee of Vicarious Surgical. C.W. is a paid consultant of
Inform. Process. Sys. 34, 26422–26436 (2021). Intuitive Surgical. A.A. is an employee of Nvidia. A.J.H is a consultant of Intuitive
40. Poursabzi-Sangdeh, F., Goldstein, D. G., Hofman, J. M., Wortman Vaughan, J. Surgical. The remaining authors declare no competing interests.
W. & Wallach, H. Manipulating and measuring model interpretability. In
Proceedings of the 2021 CHI Conference on Human Factors in Computing
Systems, 1–52 (2021).
Additional information
Supplementary information The online version contains supplementary material
41. Adebayo, J., Muelly, M., Liccardi, I. & Kim, B. Debugging tests for model
available at https://doi.org/10.1038/s43856-023-00263-3.
explanations. Adv. Neural Inform. Process. Sys. 33, 700–712 (2020).
42. Adebayo, J., Muelly, M., Abelson, H. & Kim, B. Post hoc explanations may be
Correspondence and requests for materials should be addressed to Dani Kiyasseh or
ineffective for detecting unknown spurious correlation. In International
Andrew J. Hung.
Conference on Learning Representations (2021).
43. Agarwal, C. et al. Openxai: Towards a transparent evaluation of model
Peer review information Communications Medicine thanks Danail Stoyanov, Shlomi
explanations. In Thirty-sixth Conference on Neural Information Processing
Laufer and the other, anonymous, reviewer(s) for their contribution to the peer review of
Systems Datasets and Benchmarks Track (2022).
this work. Peer reviewer reports are available.
44. Zhong, R., Shao, S. & McKeown, K. Fine-grained sentiment analysis with
faithful attention. arXiv preprint arXiv:1908.06870 (2019).
Reprints and permission information is available at http://www.nature.com/reprints
45. Abboudi, H. et al. Learning curves for urological procedures: a systematic
review. BJU Int. 114, 617–629 (2014). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
46. Birkmeyer, J. D. et al. Surgical skill and complication rates after bariatric published maps and institutional affiliations.
surgery. New England J. Med. 369, 1434–1442 (2013).
47. Stulberg, J. J. et al. Association between surgeon technical skills and patient
outcomes. JAMA Surgery. 155, 960–968 (2020).
48. Noetel, M. et al. Video improves learning in higher education: A systematic Open Access This article is licensed under a Creative Commons
review. Rev. Educ. Res. 91, 204–236 (2021). Attribution 4.0 International License, which permits use, sharing,
49. Saedon, H., Salleh, S., Balakrishnan, A., Imray, C. H. & Saedon, M. The role of adaptation, distribution and reproduction in any medium or format, as long as you give
feedback in improving the effectiveness of workplace based assessments: a appropriate credit to the original author(s) and the source, provide a link to the Creative
systematic review. BMC Med. Educ. 12, 1–8 (2012). Commons license, and indicate if changes were made. The images or other third party
50. Black, P. & Wiliam, D. Developing the theory of formative assessment. material in this article are included in the article’s Creative Commons license, unless
Educational Assessment, Evaluation and Accountability (formerly: J Personnel indicated otherwise in a credit line to the material. If material is not included in the
Evaluation Educ.) 21, 5–31 (2009). article’s Creative Commons license and your intended use is not permitted by statutory
51. Archer, J. C. State of the science in health professional education: effective regulation or exceeds the permitted use, you will need to obtain permission directly from
feedback. Med. Educ. 44, 101–108 (2010). the copyright holder. To view a copy of this license, visit http://creativecommons.org/
52. Hattie, J. & Timperley, H. The power of feedback. Rev. Educ. Res. 77, 81–112 licenses/by/4.0/.
(2007).
53. Yang, M. & Carless, D. The feedback triangle and the enhancement of dialogic
feedback processes. Teaching Higher Educ. 18, 285–297 (2013). © The Author(s) 2023
54. Farquharson, A., Cresswell, A., Beard, J. & Chan, P. Randomized trial of the
effect of video feedback on the acquisition of surgical skills. J. British Surg.
100, 1448–1453 (2013).

12 COMMUNICATIONS MEDICINE | (2023)3:42 | https://doi.org/10.1038/s43856-023-00263-3 | www.nature.com/commsmed

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy