BMJ 2022 070904.full
BMJ 2022 070904.full
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
Reporting guideline for the early stage clinical evaluation of
decision support systems driven by artificial intelligence:
DECIDE-AI
Baptiste Vasey,1,2,3 Myura Nagendran,4 Bruce Campbell,5,6 David A Clifton,2 Gary S Collins,7
Spiros Denaxas,8,9,10,11 Alastair K Denniston.12,13,14 Livia Faes,14 Bart Geerts,15
Mudathir Ibrahim,1,16 Xiaoxuan Liu,12,13 Bilal A Mateen,8,17,18 Piyush Mathur,19
Melissa D McCradden,20,21 Lauren Morgan,22 Johan Ordish,23 Campbell Rogers,24
Suchi Saria,25,26 Daniel S W Ting,27,28 Peter Watkinson,3,29 Wim Weber,30 Peter Wheatstone,31
Peter McCulloch,1 on behalf of the DECIDE-AI expert group
For numbered affiliations see
end of the article
A growing number of artificial categories. The final composition and
Correspondence to: B Vasey intelligence (AI)-based clinical decision wording of the guideline was
Nuffield Department of Surgical
Sciences, University of Oxford,
support systems are showing determined at a virtual consensus
Oxford OX3 9DU, UK
baptiste.vasey@gmail.com
promising performance in preclinical, meeting. The checklist and the
(ORCID 0000-0002-0017-8891) in silico, evaluation, but few have yet Explanation & Elaboration (E&E)
Additional material is published
online only. To view please visit
demonstrated real benefit to patient sections were refined based on
the journal online. care. Early stage clinical evaluation is feedback from a qualitative evaluation
Cite this as: BMJ 2022;377:e070904
http://dx.doi.org/10.1136/ important to assess an AI system’s process. 123 experts participated in
bmj‑2022‑070904
actual clinical performance at small the first round of Delphi, 138 in the
Accepted: 26 April 2022
scale, ensure its safety, evaluate the second, 16 in the consensus meeting,
human factors surrounding its use, and and 16 in the qualitative evaluation.
pave the way to further large scale The DECIDE-AI reporting guideline
trials. However, the reporting of these comprises 17 AI specific reporting
early studies remains inadequate. The items (made of 28 subitems) and 10
present statement provides a generic reporting items, with an E&E
multistakeholder, consensus-based paragraph provided for each. Through
reporting guideline for the consultation and consensus with a
Developmental and Exploratory Clinical range of stakeholders, we have
Investigations of DEcision support developed a guideline comprising key
systems driven by Artificial Intelligence items that should be reported in early
(DECIDE-AI). We conducted a two stage clinical studies of AI-based
round, modified Delphi process to decision support systems in
collect and analyse expert opinion on healthcare. By providing an actionable
the reporting of early clinical evaluation checklist of minimal reporting items,
of AI systems. Experts were recruited the DECIDE-AI guideline will facilitate
from 20 predefined stakeholder the appraisal of these studies and
replicability of their findings.
Summary points The prospect of improved clinical outcomes and more
DECIDE-AI is a stage specific reporting guideline for the early, small scale and live efficient health systems has fuelled a rapid rise in the
clinical evaluation of decision support systems based on artificial intelligence development and evaluation of artificial intelligence
(AI) systems over the last decade. Because most AI
The DECIDE-AI checklist presents 27 items considered as minimum reporting
systems within healthcare are complex interventions
standards. It is the result of a consensus process involving 151 experts from 18
designed as clinical decision support systems, rather
countries and 20 stakeholder groups
than autonomous agents, the interactions between
DECIDE-AI aims to improve the reporting around four key aspects of early stage the AI systems, their users and the implementation
live AI evaluation: proof of clinical utility at small scale, safety, human factors environments are defining components of the AI
evaluation, and preparation for larger scale summative trials interventions’ overall potential effectiveness. Therefore,
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
the IDEAL Framework.8 9 For example, in all three
Box 1: Methodological challenges of the artificial intelligence (AI)-based decision
cases, the evaluation needs to consider the potential
support system evaluation
for iterative modification of the interventions and the
• The clinical evaluation of AI-based decision support systems presents several characteristics of the operators (or users) performing
methodological challenges, all of which will likely be encountered at early stage.
them. In this regard, the IDEAL framework offers readily
These are the needs to:
implementable and stage specific recommendations
◦◦ account for the complex intervention nature of these systems and evaluate their
for the evaluation of surgical innovations under
integration within existing ecosystems
development. IDEAL stages 2a/2b, for example, are
◦◦ account for user variability and the added biases occurring as a result
described as development and exploratory stages,
◦◦ consider two collaborating forms of intelligence (human and AI system) and
during which the intervention is refined, operators’
therefore integrate human factors considerations as a core component
learning curves analysed, and the influence of patient
◦◦ consider both physical patients and their data representations
and operator variability on effectiveness are explored
◦◦ account for the changing nature of the intervention (either due to early
prototyping, version updates, or continuous learning design) and to analyse prospectively, prior to large scale efficacy testing.
related performance changes Early stage clinical evaluation of AI systems
◦◦ minimise the potential of this technology to embed and reproduce existing should also place a strong emphasis on validation of
health inequality and systemic biases performance and safety, in a similar manner to phase 1
◦◦ estimate the generalisability of findings across sites and populations and 2 pharmaceutical trials, before efficacy evaluation
◦◦ enable reproducibility of the findings in the context of a dynamic innovation field at scale in phase 3. For example, small changes in
and intellectual property protection the distribution of the underlying data between the
algorithm training and clinical evaluation populations
(so called dataset shift) can lead to significant
variation in clinical performance and expose patients
bringing AI systems from mathematical performance to potential unexpected harm.10 11
to clinical utility, needs an adapted, stepwise Human factors (or ergonomics) evaluations are
implementation and evaluation pathway, addressing commonly conducted in safety critical fields such
the complexity of this collaboration between two as aviation, the military and energy sectors.12-14
independent forms of intelligence, beyond measures Their assessments evaluate the impact of a device
of effectiveness alone.1 Despite indications that some or procedure on their users’ physical and cognitive
AI-based algorithms now match the accuracy of human performance, and vice versa. Human factors, such
experts within preclinical in silico studies,2 there is as usability evaluation, are an integral part of the
little high quality evidence for improved clinician regulatory process for new medical devices15 16 and
performance or patient outcomes in clinical studies.3 4 their application to AI specific challenges is attracting
Reasons proposed for this so called AI-chasm5 are lack growing attention in the medical literature.17-20
of necessary expertise needed for translating a tool However, few clinical AI studies report on the
into practice, lack of funding available for translation, evaluation of human factors,3 and usability evaluation
a general underappreciation of clinical research as of related digital health technology is often performed
a translation mechanism6 and more specifically a with inconstant methodology and reporting.21
disregard for the potential value of the early stages of Other areas of suboptimal reporting of clinical AI
clinical evaluation and the analysis of human factors.7 studies have also recently been highlighted,3 22 such
The challenges of early stage clinical AI evaluation as implementation environment, user characteristics
(see box 1) are similar to those of complex interventions, and selection process, training provided, underlying
as reported by the Medical Research Council dedicated algorithm identification, and disclosure of funding
guidance,1 and surgical innovation, as described by sources. Transparent reporting is necessary
Fig 1 | Comparison of development pathways for drug therapies, artificial intelligence (AI) in healthcare, and surgical innovation. The coloured lines
represent reporting guidelines, some of which are study design specific (TRIPOD-AI, STARD-AI, SPIRIT/CONSORT, SPIRIT/CONSORT-AI), others stage
specific (DECIDE-AI, IDEAL). Depending on the context, more than one study design can be appropriate for each stage. *Only apply to AI in healthcare
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
Table 1 | Overview of existing and upcoming artificial intelligence (AI) reporting guidelines
Name Stage Study design Comment
TRIPOD-AI Preclinical development Prediction model evaluation* Extension of TRIPOD. Used to report prediction models (diagnostic or
prognostic) development, validation and updates. Focuses on model
performance
STARD-AI Preclinical development, Diagnostic accuracy studies* Extension of STARD. Used to report diagnostic accuracy studies, either
offline validation at development stage or as an offline validation in clinical settings.
Focuses on diagnostic accuracy
DECIDE-AI Early live clinical Various (prospective cohort studies, non-randomised Stand alone guideline. Used to report the early evaluation of AI
evaluation* controlled trials, . . .)† with additional features such as systems as an intervention in live clinical settings (small scale,
modification of intervention, analysis of prespecified formative evaluation), independently of the study design and AI
subgroups, or learning curve analysis system modality (diagnostic, prognostic, therapeutic). Focuses on
clinical utility, safety, and human factors
SPIRIT-AI Comparative prospective Randomised controlled trials (protocol)* Extension of SPIRIT. Used to report the protocols of randomised
evaluation controlled trials evaluating AI systems as interventions
CONSORT-AI Comparative prospective Randomised controlled trials* Extension of CONSORT. Used to report randomised controlled trials
evaluation evaluating AI systems as interventions (large scale, summative
evaluation), independently of the AI system modality (diagnostic,
prognostic, therapeutic). Focuses on effectiveness and safety
*Primary target of the guidelines, either a specific stage or a specific study design.
†Although existing reporting guidelines exist for some of these study designs (eg, STROBE for cohort studies), none of them cover all the core aspects of AI system early stage evaluation and
none would fit all possible study designs; DECIDE-AI was therefore developed as a new stand alone reporting guideline for these studies.
for informed study appraisal and to facilitate seamless integration with other existing guidelines.
reproducibility of study results. In a relatively new We conducted a modified Delphi process,28 with two
and dynamic field such as clinical AI, comprehensive rounds of feedback from participating experts and one
reporting is also key to construct a common and virtual consensus meeting. The project was reviewed
comparable knowledge base to build upon. by the University of Oxford Central University Research
Guidelines already exist, or are under development, Ethics Committee (approval number R73712/RE003)
for the reporting of preclinical, in silico, studies of and registered with the EQUATOR Network. Informed
AI systems, their offline validation, and for their consent was obtained from all participants in the
evaluation in large comparative studies23-26; but there Delphi process and consensus meeting.
is an important stage of research between these,
namely studies focussing on the initial clinical use Initial item list generation
of AI systems, for which no such guidance currently An initial list of candidate items was developed
exists (fig 1 and table 1). This early clinical evaluation based on expert opinion informed by a systematic
provides a crucial scoping evaluation of clinical utility, literature review focusing on the evaluation of AI-
safety, and human factors challenges in live clinical based diagnostic decision support systems,3 an
settings. By investigating the potential obstacles to additional literature search about existing guidance
clinical evaluation at scale and informing protocol for AI evaluation in clinical settings (search strategy
design, these studies are also important stepping available on the Open Science Framework29), literature
stones toward definitive comparative trials. recommended by Steering Group members,19 22 30-34
To address this gap, we convened an international, and institutional documents.35-38
multistakeholder group of experts in a Delphi exercise
to produce the DECIDE-AI reporting guideline. Focusing Expert recruitment
on AI systems supporting, rather than replacing Experts were recruited through five different channels:
human intelligence, DECIDE-AI aims to improve the invitation to experts recommended by the Steering
reporting of studies describing the evaluation of AI- Group, invitation to authors of the publications
based decision support systems during their early, identified through the initial literature searches, call
small scale implementation in live clinical settings to contribute published in a commentary article in a
(ie, the supported decisions have an actual impact on medical journal,7 consideration of any expert contacting
patient care). Whereas TRIPOD-AI, STARD-AI, SPIRIT- the Steering Group of their own initiative, and invitation
AI, and CONSORT-AI are specific to particular study to experts recommended by the Delphi participants
designs, DECIDE-AI is focused on the evaluation stage (snowballing). Before starting the recruitment
and does not prescribe a fixed study design. process, 20 target stakeholder groups were defined,
namely: administrators/hospital management, allied
Methods health professionals, clinicians, engineers/computer
The DECIDE-AI guideline was developed through scientists, entrepreneurs, epidemiologists, ethicists,
an international expert consensus process and funders, human factors specialists, implementation
in accordance with the EQUATOR Network’s scientists, journal editors, methodologists, patient
recommendations for guideline development.27 A representatives, payers/commissioners, policy makers/
Steering Group was convened to oversee the guideline official institution representatives, private sector
development process. Its members were selected representatives, psychologists, regulators, statisticians,
to cover a broad range of expertise and ensure a and trialists.
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
One hundred and thirty eight experts agreed to reworded/completed, 21 reorganised (merged/split,
participate in the first round of Delphi, of whom 123 becoming 13 items), two items dropped, and nine
(89%) completed the questionnaire (83 identified new items added, for a total of 53 items. The two items
from Steering Group recommendation, 12 from their dropped were related to health economic assessment.
publications, 21 contacting the Steering Group from They were the only two items with a median score
of own initiative, and seven through snowballing). One below 7 (median 6, interquartile range 2-9 for both)
hundred and sixty two experts were invited to take part and received numerous comments describing them as
in the second round of Delphi, of whom 138 completed an entirely separate aspect of evaluation. The revised
the questionnaire (85%). 110 had also completed the list was reorganised into items and subitems. 136 sets
first round (continuity rate of 89%)39 and 28 were new of answers were included in the analysis of the second
participants. The participating experts represented 18 round of Delphi (one set of answers was excluded due
countries and spanned all 20 of the defined stakeholder to lack of consideration for the questions, one due
groups (see supplementary notes 1 and supplementary to completion after the deadline). The second round
tables 1 and 2). yielded 7101 item scores and 923 comments. The
results of the thematic analysis, the initial and revised
Delphi process item lists, as well as per item narrative and graphical
The Delphi surveys were designed and distributed summaries of the feedback received in both rounds can
via the REDCap web application.40 41 The first round be found on OSF.29
consisted of four open ended questions on aspects
viewed by the Delphi participants as necessary to be Consensus meeting
reported during early stage clinical evaluation. The A virtual consensus meeting was held on three
participating experts were then asked to rate, on a 1 occasions between the 14 and 16 of June 2021, to
to 9 scale, the importance of items in the initial list debate and agree the content and wording of the
proposed by the research team. Ratings of 1 to 3 on DECIDE-AI reporting guideline. The 16 members of
the scale were defined as “not important,” 4 to 6 as the Consensus Group (see supplementary notes 1
“important but not critical,” and 7 to 9 as “important and supplementary tables 2a and 2b) were selected
and critical.” Participants were also invited to to ensure a balanced representation of the key
comment on existing items and to suggest new items. stakeholder groups, as well as geographic diversity. All
An inductive thematic analysis of the narrative answers items from the second round of Delphi were discussed
was performed independently by two reviewers (BV and voted on during the consensus meeting. For each
and MN) and conflict was resolved by consensus.42 The item, the results of the Delphi process were presented to
themes identified were used to correct any omissions the Consensus Group members and a vote was carried
in the initial list and to complement the background out anonymously using the Vevox online application
information about proposed items. Summary statistics (www.vevox.com). A prespecified cut-off of 80% of the
of the item scores were produced for each stakeholder Consensus Group members (excluding blank votes and
group, by calculating the median score, interquartile abstentions) was necessary for an item to be included.
range, and the percentage of participants scoring an To highlight the new, AI specific reporting items, the
item 7 or higher, as well as 3 or lower, which were Consensus Group divided the guidelines into two item
the prespecified inclusion and exclusion cut-offs, lists: an AI specific items list, which represents the
respectively). A revised item list was developed based main novelty of the DECIDE-AI guideline, and a second
on the results of the first round. list of generic reporting items, which achieved high
In the second round, the participants were shown consensus but are not AI specific and could apply to
the results of the first round and invited to rate and most types of study. The Consensus Group selected 17
comment on the items in the revised list. The detailed items (made of 28 subitems in total) for inclusion in the
survey questions of the two rounds of Delphi can be AI specific list and 10 items for inclusion in the generic
found on the Open Science Framework (OSF).29 All reporting item list. Supplementary table 3 provides a
analyses of item scores and comments were performed summary of the Consensus Group meeting votes.
independently by two members of the research team
(BV and MN), using NVivo (QSR International, v1.0) Qualitative evaluation
and Python (Python Software Foundation, v3.8.5). The drafts of the guideline and of the Explanation and
Conflicts were resolved by consensus. Elaboration (E&E) sections were sent for qualitative
The initial item list contained 54 items. 120 sets evaluation to a group of 16 selected experts with
of responses were included in the analysis of the first experience in AI system implementation or in the
round of Delphi (one set of responses was excluded peer reviewing of literature related to AI system
due to a reasonable suspicion of scale inversion, two evaluation (see supplementary notes 1), all of whom
due to completion after the deadline). The first round were independent of the Consensus Group. These 16
yielded 43 986 words of free text answers to the four experts were asked to comment on the clarity and
initial open ended questions, 6,419 item scores, applicability of each AI specific item, using a custom
228 comments, and 64 proposals for new items. form (available on OSF29). Item wording amendments
The thematic analysis identified 109 themes. In the and modifications to the E&E sections were conducted
revised list, nine items remained unchanged, 22 were based on the feedback from the qualitative evaluation,
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
Box 2: Glossary of terms
AI system
• Decision support system incorporating AI and consisting of: (i) the artificial intelligence or machine learning algorithm; (ii) the supporting software
platform; and (iii) the supporting hardware platform
AI system version
• Unique reference for the form of the AI system and the state of its components at a single point in time. Allows for tracking changes to the AI system
over time and comparing between different versions
Algorithm
• Mathematical model responsible for learning from data and producing an output
Artificial intelligence (AI)
• “Science of developing computer systems which can perform tasks normally requiring human intelligence”26
Bias
• “Systematic difference in treatment of certain objects, people, or groups in comparison to others”43
Care pathway
• Series of interactions, investigations, decision making and treatments experienced by patients in the course of their contact with a healthcare
system for a defined reason
Clinical
• Relating to the observation and treatment of actual patients rather than in silico or scenario-based simulations
Clinical evaluation
• Set of ongoing activities, analysing clinical data and using scientific methods, to evaluate the clinical performance, effectiveness and/or safety of
an AI system, when used as intended35
Clinical investigation
• Study performed on one or more human subjects to evaluate the clinical performance, effectiveness and/or safety of an AI system.44 This can be
performed in any setting (eg, community, primary care, hospital)
Clinical workflow
• Series of tasks performed by healthcare professionals in the exercise of their clinical duties
Decision support system
• System designed to support human decision making by providing person specific and situation specific information or recommendations, to
improve care or enhance health
Exposure
• State of being in contact with, and having used, an AI system or similar digital technology.
Human-computer interaction
• Bidirectional influence between human users and digital systems through a physical and conceptual interface
Human factors
• Also called ergonomics. “The scientific discipline concerned with the understanding of interactions among humans and other elements of a
system, and the profession that applies theory, principles, data and methods to design in order to optimise human well-being and overall system
performance.” (International Ergonomics Association)
Indication for use
• Situation and reason (medical condition, problem, and patient group) where the AI system should be used
In silico evaluation
• Evaluation performed via computer simulation outside the clinical settings
Intended use
• Use for which an AI system is intended, as stated by its developers, and which serves as the basis for its regulatory classification. The intended use
includes aspects of: the targeted medical condition, patient population, user population, use environment, mode of action
Learning curves
• Graphical plotting of user performance against experience.45 By extension, analysis of the evolution of user performance with a task as exposure to
the task increases. The measure of performance often uses other context specific metrics as a proxy
Live evaluation
• Evaluation under actual clinical conditions, in which the decisions made have a direct impact on patient care. As opposed to “offline” or “shadow
mode” evaluation where the decisions do not have a direct impact on patient care
Machine learning
• “Field of computer science concerned with the development of models/algorithms that can solve specific tasks by learning patterns from data,
rather than by following explicit rules. It is seen as an approach within the field of AI”26
Participant
• Subject of a research study, on which data will be collected and from whom consent is obtained (or waived). The DECIDE-AI guideline considers that
both patients and users can be participants
(Continued)
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
Box 2: Continued
Patient
• Person (or the digital representation of this person) receiving healthcare attention or using health services, and who is the subject of the decision
made with the support of the AI system. NB: DECIDE-AI uses the term “patient” pragmatically to simplify the reading of the guideline. Strictly
speaking, a person with no health conditions who is the subject of a decision made about them by an AI-based decision support tool to improve
their health and wellbeing or for a preventative purpose is not necessarily a “patient” per se
Patient involvement in research
• Research carried out “with” or “by” patients or members of the public rather than “to”, “about” or “for” them. (Adapted from the INVOLVE definition
of public involvement)
Standard practice
• Usual care currently received by the intended patient population for the targeted medical condition and problem. This may not necessarily be
synonymous with the state-of-the-art practice
Usability
• “Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified
context of use”46
User
• Person interacting with the AI system to inform their decision making. This person could be a healthcare professional or a patient
The definitions given pertain to the specific context of DECIDE-AI and the use of the terms in the guideline. They are not necessarily generally
accepted definitions and might not always be fully applicable to other areas of research
which was independently analysed by two reviewers become challenging to report all the required
(BV and MN) and with conflicts resolved by consensus. information within a single primary manuscript, in
A glossary of terms (see box 2) was produced to clarify which case references to the study protocol, open
key concepts used in the guideline. The Consensus science repositories, related publications, and
Group approved the final item lists including any supplementary materials are encouraged.
changes made during the qualitative evaluation.
Supplementary figures 1 and 2 provide graphical Discussion
representations of the two item lists’ (AI specific and The DECIDE-AI guideline is the result of an international
generic) evolution. consensus process involving a diverse group of experts
spanning a wide range of professional background and
Recommendations experience. The level of interest across stakeholder
Reporting item checklist groups and the high response rate amongst the invited
The DECIDE-AI guideline should be used for the experts speaks to the perceived need for more guidance
reporting of studies describing the early stage live in the reporting of studies presenting the development
clinical evaluation of AI-based decision support and evaluation of clinical AI systems, and to the growing
systems, independently of the study design chosen (fig value placed on comprehensive clinical evaluation
1 and table 1). Depending on the chosen study design to guide implementation. The emphasis placed on
and if available, authors may also wish to complete the the role of human-in-the-loop decision making was
reporting according to study type specific guideline guided by the Steering Group’s belief that AI will, at
(eg, STROBE for cohort studies).47Table 2 presents the least in the foreseeable future, augment rather than
DECIDE-AI checklist, comprising of the 17 AI specific replace human intelligence in clinical settings. In this
reporting items and 10 generic reporting items selected context, thorough evaluation of the human-computer
by the Consensus Group. Each item comes with an E&E interaction and the roles played by the human users
to explain why and how reporting is recommended will be key to realising the full potential of AI.
(see supplementary appendix 1). A downloadable The DECIDE-AI guideline is the first stage specific
version of the checklist, designed to help researchers AI reporting guideline to be developed. This stage
and reviewers check compliance when preparing or specific approach echoes recognised development
reviewing a manuscript, is available as supplementary pathways for complex interventions,1 8 9 49 and
appendix 2. Reporting guidelines are a set of minimum aligns conceptually with proposed frameworks for
reporting recommendations and not intended to guide clinical AI,6 50 51 52 although no commonly agreed
research conduct. Although familiarity with DECIDE-AI nomenclature or definition has so far been published
might be useful to inform some aspects of the design for the stages of evaluation in this field. Given the
and conduct of studies within the guideline’s scope,48 current state of clinical AI evaluation, and the apparent
adherence to the guideline alone should not be deficit in reporting guidance for the early clinical
interpreted as an indication of methodological quality stage, the DECIDE-AI Steering Group considered it
(which is the realm of methodological guidelines important to crystallise current expert opinion into a
and risk of bias assessment tools). With increasingly consensus, to help improve reporting of these studies.
complex AI interventions and evaluations, it might Beside this primary objective, the DECIDE-AI guideline
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
Table 2 | DECIDE-AI checklist
Reported on
Item No Theme Recommendation
page
1-17 AI specific reporting items
I-X Generic reporting items
Title and abstract
Identify the study as early clinical evaluation of a decision support system based on AI or machine learning, specifying
1 Title
the problem addressed
Provide a structured summary of the study. Consider including: intended use of the AI system, type of underlying algorithm,
I Abstract study setting, number of patients and users included, primary and secondary outcomes, key safety endpoints, human factors
evaluated, main results, conclusions
Introduction
a) Describe the targeted medical condition(s) and problem(s), including the current standard practice, and the
intended patient population(s)
2 Intended use
b) Describe the intended users of the AI system, its planned integration in the care pathway, and the potential impact,
including patient outcomes, it is intended to have
II Objectives State the study objectives
Methods
III Research governance Provide a reference to any study protocol, study registration number, and ethics approval
a) Describe how patients were recruited, stating the inclusion and exclusion criteria at both patient and data level,
and how the number of recruited patients was decided
3 Participants b) Describe how users were recruited, stating the inclusion and exclusion criteria, and how the intended number of
recruited users was decided
c) Describe steps taken to familiarise the users with the AI system, including any training received prior to the study
a) Briefly describe the AI system, specifying its version and type of underlying algorithm used. Describe, or provide
a direct reference to, the characteristics of the patient population on which the algorithm was trained and its
performance in preclinical development/validation studies
4 Al system
b) Identify the data used as inputs. Describe how the data were acquired, the process needed to enter the input data,
the pre-processing applied, and how missing/low-quality data were handled
c) Describe the AI system outputs and how they were presented to the users (an image may be useful)
a) Describe the settings in which the AI system was evaluated
5 Implementation b) Describe the clinical workflow/care pathway in which the AI system was evaluated, the timing of its use, and how
the final supported decision was reached and by whom
IV Outcomes Specify the primary and secondary outcomes measured
a) Provide a description of how significant errors/malfunctions were defined and identified
6 Safety and errors
b) Describe how any risks to patient safety or instances of harm were identified, analysed, and minimised
7 Human factors Describe the human factors tools, methods or frameworks used, the use cases considered, and the users involved
Describe the statistical methods by which the primary and secondary outcomes were analysed, as well as any prespecified
V Analysis
additional analyses, including subgroup analyses and their rationale
Describe whether specific methodologies were utilised to fulfil an ethics-related goal (such as algorithmic fairness)
8 Ethics
and their rationale
State how patients were involved in any aspect of: the development of the research question, the study design, and the
VI Patient involvement
conduct of the study
Results
a) Describe the baseline characteristics of the patients included in the study, and report on input data missingness
9 Participants
b) Describe the baseline characteristics of the users included in the study
a) Report on the user exposure to the AI system, on the number of instances the AI system was used, and on the users’
10 Implementation adherence to the intended implementation
b) Report any significant changes to the clinical workflow or care pathway caused by the AI system
VII Main results Report on the prespecified outcomes, including outcomes for any comparison group if applicable
VIII Subgroups analysis Report on the differences in the main outcomes according to the prespecified subgroups
Report any changes made to the AI system or its hardware platform during the study. Report the timing of these
11 Modifications
modifications, the rationale for each, and any changes in outcomes observed after each of them
Human-computer Report on the user agreement with the AI system. Describe any instances of and reasons for user variation from the AI
12
agreement system’s recommendations and, if applicable, users changing their mind based on the AI system’s recommendations
a) List any significant errors/malfunctions related to: AI system recommendations, supporting software/hardware, or
users. Include details of: (i) rate of occurrence, (ii) apparent causes, (iii) whether they could be corrected, and (iv) any
13 Safety and errors
significant potential impacts on patient care
b) Report on any risks to patient safety or observed instances of harm (including indirect harm) identified during the study
a) Report on the usability evaluation, according to recognised standards or frameworks
14 Human factors
b) Report on the user learning curves evaluation
Discussion
Support for intended
15 Discuss whether the results obtained support the intended use of the AI system in clinical settings
use
Discuss what the results indicate about the safety profile of the AI system. Discuss any observed errors/malfunctions
16 Safety and errors
and instances of harm, their implications for patient care, and whether/how they can be mitigated
Strengths and
IX Discuss the strengths and limitations of the study
limitations
Statements
17 Data availability Disclose if and how data and relevant code are available
Disclose any relevant conflicts of interest, including the source of funding for the study, the role of funders, any other roles
X Conflicts of interest
played by commercial companies, and personal conflicts of interest for each author
AI=artificial intelligence.
AI specific items are numbered in Arab numerals, generic items in Roman numerals.
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
will hopefully also support authors during study Secondly, the relevance of comparator groups in
design, protocol drafting and study registration, by early stage clinical evaluation was considered. Most
providing them with clear criteria around which to studies retrieved in the literature search described
plan their work. As with other reporting guidelines, a comparator group (commonly the same group of
it is important to note that the overall impact on the clinicians without AI support). Such comparators can
standard of reporting will need to be assessed in due provide useful information for the design of future large
course, once the wider community has had a chance to scale trials (eg, information on the potential effect size).
use the checklist and explanatory documents, which However, comparator groups are often unnecessary at
is likely to prompt modification and fine tuning of this early stage of clinical evaluation, when the focus is
the DECIDE-AI guideline, based on its real world use. on issues other than comparative efficacy. Small scale
While the outcome of this process cannot be prejudged, clinical investigations are also usually underpowered
there is evidence that the adoption of consensus-based to make statistically significant conclusions about
reporting guidelines (such as CONSORT) does indeed efficacy, accounting for both patient and user variability.
improve the standard of reporting.53 Moreover, the additional information gained from
The Steering Group paid special attention to the comparator groups in this context can often be inferred
integration of DECIDE-AI within the broader scheme from other sources, like previous data on unassisted
of AI guidelines (eg, TRIPOD-AI, STARD-AI, SPIRIT- standard of care in the case of the expected effect size.
AI, and CONSORT-AI). It also focussed on DECIDE- Comparison groups are therefore mentioned in item VII
AI being applicable to all type of decision support but considered optional.
modalities (ie, detection, diagnostic, prognostic, and Thirdly, output interpretability is often described as
therapeutic). The final checklist should be considered important to increase user and patient trust in the AI
as minimum scientific reporting standards and do system, to contextualise the system’s outputs within
not preclude reporting additional information, nor the broader clinical information environment,19 and
are they a substitute for other regulatory reporting potentially for regulatory purpose.55 However, some
or approval requirements. The overlap between experts argued that an output’s clinical value may
scientific evaluation and regulatory processes was be independent of its interpretability, and that the
a core consideration during the development of the practical relevance of evaluating interpretability is
DECIDE-AI guideline. Early stage scientific studies still debatable.56 57 Furthermore, there is currently no
can be used to inform regulatory decisions (eg, based generally accepted way of quantifying or evaluating
on the stated intended use within the study), and are interpretability. For this reason, the Consensus Group
part of the clinical evidence generation process (eg, decided not to include an item on interpretability at the
clinical investigations). The initial item list was aligned current time.
with information commonly required by regulatory Fourthly, the notion of users’ trust in the AI system,
agencies and regulatory considerations are introduced and its evolution with time, were discussed. As users
in the E&E paragraphs. However, given the somewhat accumulate experience with, and receive feedback
different focuses of scientific evaluation and regulatory from, the real world use of AI systems, they will adapt
assessment,54 as well as differences between regulatory their level of trust in its recommendations. Whether
jurisdictions, it was decided to make no reference to appropriate or not, this level of trust will influence,
specific regulatory processes in the guideline, nor to as recently demonstrated by McIntosh et al,58 how
define the scope of DECIDE-AI within any particular much impact the systems have on the final decision
regulatory framework. The primary focus of DECIDE- making and therefore influence the overall clinical
AI is scientific evaluation and reporting, for which performance of the AI system. Understanding how
regulatory documents often provide little guidance. trust evolves is essential for planning user training
Several topics led to more intense discussion and determining the optimal timepoints at which to
than others, both during the Delphi process and start data collection in comparative trials. However,
Consensus Group discussion. Regardless of whether as for interpretability, there is currently no commonly
the corresponding items were included or not, these accepted way to measure trust in the context of clinical
represent important issues that the AI and healthcare AI. For this reason, the item about user trust in the AI
communities should consider and continue to debate. system was not included in the final guideline. The
Firstly, we discussed at length whether users (see fact that interpretability and trust were not included
glossary of terms) should be considered as study highlights the tendency of consensus-based guidelines
participants. The consensus reached was that users development towards conservatism, because only
are a key study population, about whom data will be widely agreed upon concepts reach the level of
collected (eg, reasons for variation from the AI system consensus needed for inclusion. However, changes
recommendation, user satisfaction, etc), who might of focus in the field as well as new methodological
logically be consented as study participants, and development can be integrated into subsequent
therefore should be considered as such. Because user guideline iterations. From this perspective, the issues
characteristics (eg, experience) can affect intervention of interpretability and trust are far from irrelevant
efficacy, both patient and user variability should be to future AI evaluations and their exclusion from
considered when evaluating AI systems, and reported the current guideline reflects less a lack of interest
adequately. than a need for further research into how we can
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
best operationalise these metrics for the purposes of biases, which apply to any consensus process: these
evaluation in AI systems. include anchoring or participant selection biases.61
Fifthly, the notion of modifying the AI system (the The research team tried to mitigate bias through the
intervention) during the evaluation received mixed survey design, using open ended questions analysed
opinions. During comparative trials, changes made to through a thematic analysis, and by adapting
the intervention during data collection are questionable the expert recruitment process, but it is unlikely
unless the changes are part of the study protocol; some that it was eliminated entirely. Despite an aim for
authors even consider them as impermissible, on geographical diversity and several actions taken to
the basis that they would make valid interpretation foster it, representation was skewed towards Europe
of study results difficult or impossible. However, the and more specifically the United Kingdom. This
objectives of early clinical evaluation are often not to could be explained in part by the following factors:
make definitive conclusions on effectiveness. Iterative a likely selection bias in the Steering Group’s expert
design evaluation cycles, if performed safely and recommendations, a higher interest in our open
reported transparently, offer opportunities to tailor invitation to contribute coming from European/
an intervention to its users and beneficiaries, and UK scientists (25 out of 30 experts approaching us,
augment chances of adoption of an optimised, fixed 83%), and a lack of control over the response rate and
version during later summative evaluation.8 9 59 60 self-reported geographical location of participating
Sixthly, several experts noted the benefit of experts. Considerable attention was also paid to
conducting human factors evaluation prior to clinical diversity and balance between stakeholder groups,
implementation and considered that therefore human even though clinicians and engineers were the most
factors should be reported separately. However, even represented, partly due to the profile of researchers
robust preclinical human factors evaluation will not who contacted us spontaneously after the public
reliably characterise all the potential human factors announcement of the project. Stakeholder group
issues which might arise during the use of an AI system analyses were performed to identify any marked
in a live clinical environment, warranting a continued disagreements from underrepresented groups. Finally,
human factors evaluation at the early stage of clinical as also noted by the authors of the SPIRIT-AI and
implementation. The Consensus Group agreed that CONSORT-AI guidelines,25 26 few examples of studies
human factors play a fundamental role in AI system reporting on the early stage clinical evaluation of
adoption in clinical settings at scale and that the full AI systems were available at the time we started
appraisal of an AI system’s clinical utility can only developing the DECIDE-AI guideline. This might have
happen in the context of its clinical human factors impacted the exhaustiveness of the initial item list
evaluation. created from literature review. However, the wide range
Finally, several experts raised concerns that the of stakeholders involved and design of the first round
DECIDE-AI guideline prescribes an evaluation too of Delphi allowed identification of several additional
exhaustive to be reported within a single manuscript. candidate items which were added in the second
The Consensus Group acknowledged the breadth of iteration of the item list.
topics covered and the practical implications. However, The introduction of AI into healthcare needs to
reporting guidelines aim to promote transparent be supported by sound, robust and comprehensive
reporting of studies, rather than mandating that every evidence generation and reporting. This is essential
aspect covered by an item must have been evaluated both to ensure the safety and efficacy of AI systems,
within the studies. For example, if a learning curves and to gain the trust of patients, practitioners, and
evaluation has not been performed, then fulfilment purchasers, so that this technology can realise its
of item 14b would be to simply state that this full potential to improve patient care. The DECIDE-AI
was not done, with an accompanying rationale. guideline aims to improve the reporting of early stage
The Consensus Group agreed that appropriate AI live clinical evaluation of AI systems, which lay the
evaluation is a complex endeavour necessitating the foundations for both larger clinical studies and later
interpretation of a wide range of data, which should widespread adoption.
be presented together as far as possible. It was also Author affiliations
felt that thorough evaluation of AI systems should 1
Nuffield Department of Surgical Sciences, University of Oxford,
not be limited by a word count and that publications Oxford, UK
2
reporting on such systems might benefit from special Institute of Biomedical Engineering, Department of Engineering
Science, University of Oxford, Oxford, UK
formatting requirements in the future. The information 3
Critical Care Research Group, Nuffield Department of Clinical
required by several items might already be reported in Neurosciences, University of Oxford, Oxford, UK
previous studies or in the study protocol, which could 4
UKRI Centre for Doctoral Training in AI for Healthcare, Imperial
be cited, rather than described in full again. The use of College London, London, UK
5
references, online supplementary materials, and open University of Exeter Medical School, Exeter, UK
6
access repositories (eg, OSF) is recommended to allow Royal Devon and Exeter Hospital, Exeter, UK
7
the sharing and connecting of all required information Centre for Statistics in Medicine, Nuffield Department of
Orthopaedics, Rheumatology and Musculoskeletal Sciences,
within one main published evaluation report. University of Oxford, Oxford, UK
There are several limitations to our work which 8
Institute of Health Informatics, University College London, London,
should be considered. Firstly, the issue of potential UK
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
9
British Heart Foundation Data Science Centre, London, UK Beddoe (Sheffield Teaching Hospital), Nicole Bilbro (Maimonides
10
Health Data Research UK, London, UK Medical Center), Neale Marlow (Oxford University Hospitals), Elliott
11 Taylor (Nuffield Department of Surgical Science, University of
UCL Hospitals Biomedical Research Centre, London, UK Oxford), and Stephan Ursprung (Department for Radiology, Tübingen
12
University Hospitals Birmingham NHS Foundation Trust, University Hospital) for their support in the initial stage of the project.
Birmingham, UK The views expressed in this guideline are those of the authors,
13
Academic Unit of Ophthalmology, Institute of Inflammation and Delphi participants, and experts who participated in the qualitative
Ageing, College of Medical and Dental Sciences, University of evaluation of the guidelines. These views do not necessarily reflect
Birmingham, Birmingham, UK those of their institutions or funders.
14
Moorfields Eye Hospital NHS Foundation Trust, London, UK Contributors: BV, MN, and PMcC designed the study. BV and MI
15 conducted the literature searches. Members of the DECIDE-AI Steering
Healthplus.ai-R&D, Amsterdam, Netherlands Group (BV, DC, GSC, AKD, LF, BG, XL, PMa, LM, SS, PWa, PMc) provided
16
Department of Surgery, Maimonides Medical Center, New York, methodological input and oversaw the conduct of the study. BV and
NY, USA MN conducted the thematic analysis, Delphi rounds analysis, and
17
Wellcome Trust, London, UK produced the Delphi round summaries. Members of the DECIDE-AI
18 Consensus Group (BV, GSC, SP, BG, XL, BAM, PM, MM, LM, JO, CR, SS,
Alan Turing Institute, London, UK DSWT, WW, PWh, PMc) selected the final content and wording of the
19
Department of General Anesthesiology, Anesthesiology Institute, guideline. BC chaired the consensus meeting. BV, MN, and BC drafted
Cleveland Clinic, Cleveland, OH, USA the final manuscript and E&E sections. All authors reviewed and
20
Hospital for Sick Children, Toronto, ON, Canada commented on the final manuscript and E&E sections. All members
21 of the DECIDE-AI expert group collaborated in the development of
Dalla Lana School of Public Health, University of Toronto, Toronto,
the DECIDE-AI guidelines by participating in the Delphi process, the
ON, Canada
qualitative evaluation of the guidelines, or both. The corresponding
22
Morgan Human Systems, Shrewsbury, UK author attests that all listed authors meet authorship criteria and
23
The Medicines and Healthcare products Regulatory Agency, that no others meeting the criteria have been omitted. PMcC is the
London, UK guarantor of this work.
24
HeartFlow, Redwood City, CA, USA Funding: This work was supported by the IDEAL Collaboration. BV
25 is funded by a Berrow Foundation Lord Florey scholarship. MN is
Departments of Computer Science, Statistics, and Health Policy, and
supported by the UKRI CDT in AI for Healthcare (http://ai4health.
Division of Informatics, Johns Hopkins University, Baltimore, MD, USA
26
io - grant No P/S023283/1). DC receives funding from Wellcome
Bayesian Health, New York, NY, USA Trust, AstraZeneca, RCUK and GlaxoSmithKline. GSC is supported by
27
Singapore National Eye Center, Singapore Eye Research Institute, the NIHR Biomedical Research Centre, Oxford, and Cancer Research
Singapore UK (programme grant: C49297/A27294). MI is supported by a
28 Maimonides Medical Center Research fellowship. XL receives funding
Duke-NUS Medical School, National University of Singapore,
from the Wellcome Trust, the National Institute of Health Research/
Singapore
29
NHSX/Health Foundation, the Alan Turing Institute, the MHRA, and
NIHR Biomedical Research Centre Oxford, Oxford University NICE. BAM is a fellow of The Alan Turing Institute supported by EPSRC
Hospitals NHS Trust, Oxford, UK grant EP/N510129/, and holds a Wellcome Trust funded honorary
30
The BMJ, London, UK post at University College London for the purposes of carrying out
31
School of Medicine, University of Leeds, Leeds, UK independent research. MM receives funding from the Dalla Lana
School of Public Health and Leong Centre for Healthy Children. JO
This article is being simultaneously published in May 2022 by The is employed by the Medicines and Healthcare products Regulatory
BMJ and Nature Medicine. Agency, the competent authority responsible for regulating medical
Members of the DECIDE-AI expert group (also see supplementary file devices and medicines within the UK. Elements of the work relating
for a full list, including affiliations) are as follows: Aaron Y Lee, Alan to the regulation of AI as a medical device are funded via grants from
G Fraser, Ali Connell, Alykhan Vira, Andre Esteva, Andrew D Althouse, NHSX and the Regulators’ Pioneer Fund (Department for Business,
Andrew L Beam, Anne de Hond, Anne-Laure Boulesteix, Anthony Energy, and Industrial Strategy). SS receives grants from the National
Bradlow, Ari Ercole, Arsenio Paez, Athanasios Tsanas, Barry Kirby, Ben Science Foundation, the American Heart Association, the National
Glocker, Carmelo Velardo, Chang Min Park, Charisma Hehakaya, Chris Institute of Health, and the Sloan Foundation. DSWT is supported
Baber, Chris Paton, Christian Johner, Christopher J Kelly, Christopher by the National Medical Research Council, Singapore (NMRC/
J Vincent, Christopher Yau, Clare McGenity, Constantine Gatsonis, HSRG/0087/2018;MOH-000655-00), National Health Innovation
Corinne Faivre-Finn, Crispin Simon, Danielle Sent, Danilo Bzdok, Centre, Singapore (NHIC-COV19-2005017), SingHealth Fund Limited
Darren Treanor, David C Wong, David F Steiner, David Higgins, Dawn Foundation (SHF/HSR113/2017), Duke-NUS Medical School (Duke-
Benson, Declan P O’Regan, Dinesh V Gunasekaran, Dominic Danks, NUS/RSF/2021/0018;05/FY2020/EX/15-A58), and the Agency for
Emanuele Neri, Evangelia Kyrimi, Falk Schwendicke, Farah Magrabi, Science, Technology and Research (A20H4g2141; H20C6a0032).
Frances Ives, Frank E Rademakers, George E Fowler, Giuseppe Frau, PWa is supported by the NIHR Biomedical Research Centre, Oxford
H D Jeffry Hogg, Hani J Marcus, Heang-Ping Chan, Henry Xiang, and holds grants from the NIHR and Wellcome. PMc receives grants
Hugh F McIntyre, Hugh Harvey, Hyungjin Kim, Ibrahim Habli, James from Medtronic (unrestricted educational grant to Oxford University
C Fackler, James Shaw, Janet Higham, Jared M Wohlgemut, Jaron for the IDEAL Collaboration) and Oxford Biomedical Research Centre.
Chong, Jean-Emmanuel Bibault, Jérémie F Cohen, Jesper Kers, Jessica The funders had no role in considering the study design or in the
Morley, Joachim Krois, Joao Monteiro, Joel Horovitz, John Fletcher, collection, analysis, interpretation of data, writing of the report, or
Jonathan Taylor, Jung Hyun Yoon, Karandeep Singh, Karel G M Moons, decision to submit the article for publication.
Kassandra Karpathakis, Ken Catchpole, Kerenza Hood, Konstantinos Competing interests: All authors have completed the ICMJE uniform
Balaskas, Konstantinos Kamnitsas, Laura Militello, Laure Wynants, disclosure form at www.icmje.org/coi_disclosure.pdf and declare:
Lauren Oakden-Rayner, Laurence B Lovat, Luc J M Smits, Ludwig C MN consults for Cera Care, a technology enabled homecare provider.
Hinske, M Khair ElZarrad, Maarten van Smeden, Mara Giavina-Bianchi, BC was a Non-Executive Director of the UK Medicines and Healthcare
Mark Daley, Mark P Sendak, Mark Sujan, Maroeska Rovers, Matthew products Regulatory Agency (MHRA) from September 2015 until 31
DeCamp, Matthew Woodward, Matthieu Komorowski, Max Marsden, August 2021. DC receives consulting fees from Oxford University
Maxine Mackintosh, Michael D Abramoff, Miguel Ángel Armengol de Innovation, Biobeats, Sensyne Health, and has advisory role with
la Hoz, Neale Hambidge, Neil Daly, Niels Peek, Oliver Redfern, Omer Bristol Myers Squibb. BG has received consultancy and research
F Ahmad, Patrick M Bossuyt, Pearse A Keane, Pedro N P Ferreira, grants from Philips NV and Edwards Lifesciences, and is owner
Petra Schnell-Inderst, Pietro Mascagni, Prokar Dasgupta, Pujun and board member of Healthplus.ai BV and its subsidiaries. XL has
Guan, Rachel Barnett, Rawen Kader, Reena Chopra, Ritse M Mann, advisory roles with the National Screening Committee UK, the WHO/
Rupa Sarkar, Saana M Mäenpää, Samuel G Finlayson, Sarah Vollam, ITU focus group for AI in health and the AI in Health and Care Award
Sebastian J Vollmer, Seong Ho Park, Shakir Laher, Shalmali Joshi, Siri L Evaluation Advisory Group (NHSX, AAC). PMa is the cofounder of
van der Meijden, Susan C Shelmerdine, Tien-En Tan, Tom JW Stocker, BrainX and BrainX Community. MM reports consulting fees from AMS
Valentina Giannini, Vince I Madai, Virginia Newcombe, Wei Yan Ng, Healthcare, and honorariums from the Osgoode Law School and
Wendy A Rogers, William Ogallo, Yoonyoung Park, Zane B Perkins Toronto Pain Institute. LM is director and owner of Morgan Human
We thank all Delphi participants and experts who participated in the Systems. JO holds an honorary post as an Associate of Hughes Hall,
guideline qualitative evaluation. BV would also like to thank Benjamin University of Cambridge. CR is an employee of HeartFlow, including
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
salary and equity. SS has received honorariums from several 13 Corbridge C, Anthony M, McNeish D, Shaw G. A New UK
universities and pharmaceutical companies for talks on digital health Defence Standard For Human Factors Integration (HFI). Proc
and AI. SS has advisory roles in Child Health Imprints, Duality Tech, Hum Factors Ergon Soc Annu Meet 2016;60:1736-40.
Halcyon Health, and Bayesian Health. SS is on the board of Bayesian doi:10.1177/1541931213601398.
Health. This arrangement has been reviewed and approved by Johns 14 Stanton NA, Salmon P, Jenkins D, Walker G. Human factors in the
Hopkins in accordance with its conflict-of-interest policies. DSWT design and evaluation of central control room operations. CRC Press,
holds patents linked to AI driven technologies, and a co-founder and 2009. doi:10.1201/9781439809921.
15 US Food and Drug Administration. Applying Human Factors and
equity holder for EyRIS. PWa declares grants, consulting fees and
Usability Engineering to Medical Devices - Guidance for Industry and
stocks from Sensyne Health and holds patents linked to AI driven
Food and Drug Administration Staff. 2016.
technologies. PMc has advisory role for WEISS International and the
16 Medicines & Healthcare products Regulatory Agency (MHRA).
technology incubator PhD programme at University College London. Guidance on applying human factors and usability engineering to
BV, GSC, AKD, LF, MI, BAM, SD, PWh, and WW declare no financial medical devices including drug-device combination products in Great
relationships with any organisations that might have an interest in the Britain. 2021.
submitted work in the previous three years and no other relationships 17 Asan O, Choudhury A. Research Trends in Artificial Intelligence
or activities that could appear to have influenced the submitted work. Applications in Human Factors Health Care: Mapping Review. JMIR
Data sharing: All data generated during this study (pseudonymised Hum Factors 2021;8:e28236. doi:10.2196/28236
where necessary) are available upon justified request to the research 18 Felmingham CM, Adler NR, Ge Z, Morton RL, Janda M, Mar VJ. The
team and for a duration of three years after publication of this Importance of Incorporating Human Factors in the Design and
manuscript. Translation of this guideline into different languages is Implementation of Artificial Intelligence for Skin Cancer Diagnosis in
the Real World. Am J Clin Dermatol 2021;22:233-42. doi:10.1007/
welcomed and encouraged, as long as the authors of the original
s40257-020-00574-4
publication are included in the process and resulting publication. All
19 Sujan M, Furniss D, Grundy K, et al. Human factors challenges for
codes produced for data analysis during this study are available upon the safe use of artificial intelligence in patient care. BMJ Health Care
justified request to the research team and for a duration of three years Inform 2019;26:e100081. doi:10.1136/bmjhci-2019-100081
after publication of this manuscript. 20 Sujan M, Baber C, Salmon P, Pool R, Chozos N. Human Factors and
Patient and public involvement: Patient representatives were Ergonomics in Healthcare AI. Chartered Institute of Ergonomics &
invited to the Delphi process and participated in it. A patient Human Factors, 2021. https://ergonomics.org.uk/resource/human-
representative (PWh) was a member of the Consensus Group, factors-in-healthcare-ai.html
participated in the selection of the reporting items, edited their 21 Wronikowska MW, Malycha J, Morgan LJ, et al. Systematic review of
wording and reviewed the related E&E paragraphs. applied usability metrics within usability evaluation methods for
hospital electronic healthcare record systems: Metrics and Evaluation
Provenance and peer review: Not commissioned; externally peer Methods for eHealth Systems. J Eval Clin Pract 2021;27:1403-16.
reviewed. doi:10.1111/jep.13582
This is an Open Access article distributed in accordance with the 22 Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus
Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, clinicians: systematic review of design, reporting standards, and
which permits others to distribute, remix, adapt, build upon this work claims of deep learning studies. BMJ 2020;368:m689. doi:10.1136/
non-commercially, and license their derivative works on different bmj.m689
terms, provided the original work is properly cited and the use is non- 23 Collins GS, Moons KGM. Reporting of artificial intelligence prediction
commercial. See: http://creativecommons.org/licenses/by-nc/4.0/. models. Lancet 2019;393:1577-9. doi:10.1016/S0140-
6736(19)30037-6
24 Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific
1 Skivington K, Matthews L, Simpson SA, et al. A new framework for
reporting guidelines for diagnostic accuracy studies assessing AI
developing and evaluating complex interventions: update of Medical
interventions: The STARD-AI Steering Group. Nat Med 2020;26:807-
Research Council guidance. BMJ 2021;374:n2061. doi:10.1136/
8. doi:10.1038/s41591-020-0941-1
bmj.n2061
25 Cruz Rivera S, Liu X, Chan AW, Denniston AK, Calvert MJ, SPIRIT-AI
2 Liu X, Faes L, Kale AU, et al. A comparison of deep learning
and CONSORT-AI Working Group, SPIRIT-AI and CONSORT-AI Steering
performance against health-care professionals in detecting diseases
Group, SPIRIT-AI and CONSORT-AI Consensus Group. Guidelines for
from medical imaging: a systematic review and meta-analysis.
clinical trial protocols for interventions involving artificial intelligence:
Lancet Digit Health 2019;1:e271-97. doi:10.1016/S2589-
the SPIRIT-AI extension. Nat Med 2020;26:1351-63. doi:10.1038/
7500(19)30123-2
s41591-020-1037-7
3 Vasey B, Ursprung S, Beddoe B, et al. Association of
26 Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, SPIRIT-AI
Clinician Diagnostic Performance With Machine Learning-
and CONSORT-AI Working Group. Reporting guidelines for clinical
Based Decision Support Systems: A Systematic Review.
trial reports for interventions involving artificial intelligence: the
JAMA Netw Open 2021;4:e211276. doi:10.1001/
CONSORT-AI extension. Nat Med 2020;26:1364-74. doi:10.1038/
jamanetworkopen.2021.1276
s41591-020-1034-x
4 Freeman K, Geppert J, Stinton C, et al. Use of artificial intelligence for
27 Moher D, Schulz KF, Simera I, Altman DG. Guidance for developers of
image analysis in breast cancer screening programmes: systematic
health research reporting guidelines. PLoS Med 2010;7:e1000217.
review of test accuracy. BMJ 2021;374:n1872. doi:10.1136/bmj.
doi:10.1371/journal.pmed.1000217
n1872
28 Dalkey N, Helmer O. An Experimental Application of the DELPHI
5 Keane PA, Topol EJ. With an eye to AI and autonomous diagnosis. NPJ
Method to the Use of Experts. Manage Sci 1963;9:458-67.
Digit Med 2018;1:40.
doi:10.1287/mnsc.9.3.458.
6 McCradden MD, Stephenson EA, Anderson JA. Clinical research
29 Vasey B, Nagendran M, McCulloch P. DECIDE-AI 2022. Open Science
underlies ethical integration of healthcare artificial intelligence. Nat
Framework, 2022. doi:10.17605/OSF.IO/TP9QV
Med 2020;26:1325-6. doi:10.1038/s41591-020-1035-9
30 Vollmer S, Mateen BA, Bohner G, et al. Machine learning and
7 Vasey B, Clifton DA, Collins GS, et al, DECIDE-AI Steering Group.
artificial intelligence research for patient benefit: 20 critical
DECIDE-AI: new reporting guidelines to bridge the development-
questions on transparency, replicability, ethics, and effectiveness.
to-implementation gap in clinical artificial intelligence. Nat
BMJ 2020;368:l6927. doi:10.1136/bmj.l6927
Med 2021;27:186-7. doi:10.1038/s41591-021-01229-5
31 Bilbro NA, Hirst A, Paez A, et al, IDEAL Collaboration Reporting
8 McCulloch P, Altman DG, Campbell WB, et al, Balliol Collaboration. No
Guidelines Working Group. The IDEAL reporting guidelines: A delphi
surgical innovation without evaluation: the IDEAL recommendations.
consensus statement stage specific recommendations for reporting
Lancet 2009;374:1105-12. doi:10.1016/S0140-6736(09)61116-
the evaluation of surgical innovation. Ann Surg 2021;273:82-5.
8
doi:10.1097/SLA.0000000000004180
9 Hirst A, Philippou Y, Blazeby J, et al. No Surgical Innovation Without
32 Morley J, Floridi L, Kinsey L, Elhalal A. From What to How: An Initial
Evaluation: Evolution and Further Development of the IDEAL
Review of Publicly Available AI Ethics Tools, Methods and Research to
Framework and Recommendations. Ann Surg 2019;269:211-20.
Translate Principles into Practices. Sci Eng Ethics 2020;26:2141-168.
doi:10.1097/SLA.0000000000002794
33 Xie Y, Gunasekeran DV, Balaskas K, et al. Health Economic and
10 Finlayson SG, Subbaswamy A, Singh K, et al. The Clinician and
Safety Considerations for Artificial Intelligence Applications in
Dataset Shift in Artificial Intelligence. N Engl J Med 2021;385:283-6.
Diabetic Retinopathy Screening. Transl Vis Sci Technol 2020;9:22.
doi:10.1056/NEJMc2104626
doi:10.1167/tvst.9.2.22
11 Subbaswamy A, Saria S. From development to deployment:
34 Norgeot B, Quer G, Beaulieu-Jones BK, et al. Minimum information
dataset shift, causality, and shift-stable models in health AI.
about clinical artificial intelligence modeling: the MI-CLAIM checklist.
Biostatistics 2020;21:345-52. doi:10.1093/biostatistics/kxz041
Nat Med 2020;26:1320-4. doi:10.1038/s41591-020-1041-y
12 Kapur N, Parand A, Soukup T, Reader T, Sevdalis N. Aviation and
35 IMDRF Medical Device Clinical Evaluation Working Group. Clinical
healthcare: a comparative review with implications for patient safety.
Evaluation. 2019. Report No.: WG/N56FINAL:2019.
JRSM Open 2015;7:2054270415616548.
BMJ: first published as 10.1136/bmj-2022-070904 on 18 May 2022. Downloaded from http://www.bmj.com/ on 26 May 2024 at Chulalongkorn University. Protected by copyright.
36 IMDRF Software as Medical Device (SaMD) Working Group. Software 50 Park Y, Jackson GP, Foreman MA, Gruen D, Hu J, Das AK. Evaluating
as a Medical Device. Possible Framework for Risk Categorization and artificial intelligence in medicine: phases of clinical research. JAMIA
Corresponding Considerations, 2014. Open 2020;3:326-31. doi:10.1093/jamiaopen/ooaa033
37 National Institute for Health and Care Excellence (NICE). Evidence 51 Higgins D, Madai VI. From Bit to Bedside: A Practical Framework for
standards framework for digital health technologies. 2019. Artificial Intelligence Product Development in Healthcare. Adv Intell
38 European Commission, Directorate-General for Communications Syst 2020;2:2000052. doi:10.1002/aisy.202000052.
Networks, Content and Technology. Ethics guidelines for 52 Sendak MP, D’Arcy J, Kashyap S, Gao M, Nichols M, Corey K, et al. A
trustworthy AI, Publications Office, 2019. https://data.europa.eu/ path for translation of machine learning products into healthcare
doi/10.2759/346720 delivery. EMJ Innov 2020. doi:10.33590/emjinnov/19-00172.
39 Boel A, Navarro-Compán V, Landewé R, van der Heijde D. 53 Moher D, Jones A, Lepage L, CONSORT Group (Consolidated
Two different invitation approaches for consecutive rounds Standards for Reporting of Trials). Use of the CONSORT statement
of a Delphi survey led to comparable final outcome. J Clin and quality of reports of randomized trials: a comparative before-
Epidemiol 2021;129:31-9. doi:10.1016/j.jclinepi.2020.09.034 and-after evaluation. JAMA 2001;285:1992-5. doi:10.1001/
40 Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research jama.285.15.1992
electronic data capture (REDCap)--a metadata-driven methodology 54 Park SH. Regulatory Approval versus Clinical Validation of Artificial
and workflow process for providing translational research informatics Intelligence Diagnostic Tools. Radiology 2018;288:910-1.
support. J Biomed Inform 2009;42:377-81. doi:10.1016/j. doi:10.1148/radiol.2018181310
jbi.2008.08.010 55 US Food and Drug Administration (FDA). Clinical Decision Support
41 Harris PA, Taylor R, Minor BL, et al, REDCap Consortium. The Software - Draft Guidance for Industry and Food and Drug
REDCap consortium: Building an international community of Administration Staff 2019. www.fda.gov/media/109618/download
software platform partners. J Biomed Inform 2019;95:103208. 56 Lipton ZC. The Mythos of Model Interpretability. Commun
doi:10.1016/j.jbi.2019.103208 ACM 2018;61:36-43. doi:10.1145/3233231.
42 Nowell LS, Norris JM, White DE, Moules NJ. Thematic 57 Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of
Analysis: Striving to Meet the Trustworthiness Criteria. current approaches to explainable artificial intelligence in health
Int J Qual Methods 2017;16:1609406917733847. care. Lancet Digit Health 2021;3:e745-50. doi:10.1016/S2589-
doi:10.1177/1609406917733847. 7500(21)00208-9
43 International Organization for Standardization. Information 58 McIntosh C, Conroy L, Tjong MC, et al. Clinical integration of machine
technology - Artificial intelligence (AI) - Bias in AI systems and AI learning for curative-intent radiation treatment of patients with
aided decision making (ISO/IEC TR 24027:2021). 2021. prostate cancer. Nat Med 2021;27:999-1005. doi:10.1038/
44 IMDRF Medical Device Clinical Evaluation Working Group. Clinical s41591-021-01359-w
Investigation. 2019. Report No.: WG/N57FINAL:2019. 59 International Organization for Standardization. Ergonomics of
45 Hopper AN, Jamison MH, Lewis WG. Learning curves in surgical human-system interaction — Part 210: Human-centred design for
practice[abstract]. Postgrad Med J 2007;83:777-9. https://pmj.bmj. interactive systems (ISO 9241-210:2019). 2019.
com/content/83/986/777. doi:10.1136/pgmj.2007.057190 60 Norman DA. User Centered System Design. 1st ed. CRC Press, 1986.
46 International Organization for Standardization. Ergonomics of doi:10.1201/b15703.
human-system interaction - Part 11: Usability: Definitions and 61 Winkler J, Moser R. Biases in future-oriented Delphi studies: A
concepts (ISO 9241-11:2018). 2018. Report No.: ISO 9241- cognitive perspective. Technol Forecast Soc Change 2016;105:63-
11:2018. 76. doi:10.1016/j.techfore.2016.01.021.
47 von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC,
Vandenbroucke JP, STROBE Initiative. Strengthening the Reporting Supplementary appendix 1: Explanation and
of Observational Studies in Epidemiology (STROBE) statement:
guidelines for reporting observational studies. BMJ 2007;335:806-8. Elaboration (E&E)
doi:10.1136/bmj.39335.541782.AD Supplementary appendix 2: DECIDE-AI reporting
48 Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020
statement: an updated guideline for reporting systematic reviews. item checklist
BMJ 2021;372:n71. doi:10.1136/bmj.n71 Supplementary materials: Additional figures (1 and
49 Sedrakyan A, Campbell B, Merino JG, Kuntz R, Hirst A, McCulloch 2), tables (1-3), and notes
P. IDEAL-D: a rational framework for evaluating and regulating the
use of medical devices. BMJ 2016;353:i2372. doi:10.1136/bmj. Supplementary information: Full list of DECIDE-AI
i2372 expert group members and their affiliations