AI For HCI - BOOK PDF
AI For HCI - BOOK PDF
Yang Li
Otmar Hilliges Editors
Artificial
Intelligence for
Human Computer
Interaction:
A Modern Approach
Human–Computer Interaction Series
Editor-in-Chief
Jean Vanderdonckt
Louvain School of Management, Université catholique de Louvain,
Louvain-La-Neuve, Belgium
The Human–Computer Interaction Series, launched in 2004, publishes books that
advance the science and technology of developing systems which are effective and
satisfying for people in a wide variety of contexts. Titles focus on theoretical
perspectives (such as formal approaches drawn from a variety of behavioural
sciences), practical approaches (such as techniques for effectively integrating user
needs in system development), and social issues (such as the determinants of utility,
usability and acceptability).
HCI is a multidisciplinary field and focuses on the human aspects in the
development of computer technology. As technology becomes increasingly more
pervasive the need to take a human-centred approach in the design and
development of computer-based systems becomes ever more important.
Titles published within the Human–Computer Interaction Series are included in
Thomson Reuters’ Book Citation Index, The DBLP Computer Science
Bibliography and The HCI Bibliography.
Artificial Intelligence
for Human Computer
Interaction: A Modern
Approach
Editors
Yang Li Otmar Hilliges
Google Research (United States) Advanced Interactive Technologies Lab
Mountain View, CA, USA ETH Zurich
Zurich, Switzerland
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Forward for Artificial Intelligence for Human
Computer Interaction: A Modern Approach
From its earliest days, Artificial Intelligence has pursued two goals: to emulate human
behavior, and to achieve optimal performance regardless of method. On one side,
researchers argue that humans are the best example we have of intelligent systems,
so AI should focus on understanding and replicating the human mind and brain.
On the other side, researchers contend that selecting a good course of action is an
optimization problem, so AI should focus on mathematical equations and algorithms
that achieve or approximate optimality.
This book focuses on a middle ground that brings together these two approaches:
• AI should be seen as a tool that allows humans and computers to work better
together.
• Sometimes this means a high level of automation, but a human should still have
a high level of overall control and confidence in the system.
• Other times a human is immersed in the inner loop and effective human/computer
dialog is crucial to success.
• Understanding human intent and mental state is a precondition of effective dialog,
as is the ability for a computer to explain itself in terms a human can understand.
• Prior to any optimization, human and computer need to come to an agreement on
exactly what it is that should be optimized.
In recent years we have seen the rise of powerful new AI-enabled applications in
computer vision, speech recognition, natural language understanding, recommenda-
tion systems, robotics, and other subfields. Getting the full benefit from these systems
requires joint HCI/AI research.
For example, hundreds of millions of people use voice assistants regularly. Break-
throughs in deep learning AI made the recognition rate acceptable, and it is HCI that
makes the whole interface work. But many challenges remain:
• HCI research gave us the WIMP interface for devices with screens, but some
voice assistants have no screen. They will need a new design language. This
will require a partnership with AI providing ever-increasing capabilities–better
speech recognition, better models of user intent, more actions that the assistant
v
vi Forward for Artificial Intelligence for Human Computer Interaction: …
can perform–and HCI answering “how do we give users mental models that will
allow them to discover and understand the ever-increasing capabilities.”
• Voice assistants could grow as a platform to become as significant as the PC or
mobile platforms, but only if it is easy for developers to create applications that
take advantage of the power of AI with a convenient user interface. Today, many
voice assistant applications are little better than the annoying phone services that
say "press 1 for …" because that is the easiest methodology for developers who
are unskilled in machine learning to express the options. The full promise of AI
will only be achieved if the ability to easily develop powerful innovative systems
is democratized, not restricted to only PhD-level researchers.
• In traditional software, UX designers come up with guidelines that are imple-
mented by human programmers. But what happens when the "programmer" is
a machine learning system? How can UX guidelines be codified as inputs to a
system that is continually learning and evolving, and doing personalization for
each user?
• As AI moves from the research labs to real-world applications, we are finding that
some applications can unfairly disadvantage one subgroup of users over another,
due to disparities in data or to carelessness in setting the goals of the system. To
achieve fairness while maintaining privacy and efficacy will require HCI tools
that allow us to explore the data more thoroughly.
Of course there are many other AI applications besides voice assistants that will
also benefit from a partnership with HCI: autonomous vehicles, computer vision,
robots, recommendation systems, healthcare, drug discovery, elder care, and so on.
Each specialized application will require in-depth understanding of the use cases for
each distinct user population.
As is often the case when two fields collide, the effects are felt in both direc-
tions. HCI gains new interaction modalities and new tools to make interactions more
effective. But AI gains too. Designing machine learning software is fundamentally
different from designing traditional software, in that it is more a process of training
than programming. Thus, the interaction between human AI developers and the
machine-learning tools they rely on is a type of HCI that we are just beginning to
study and learn how to improve.
This book is not the only champion of the combination of HCI and AI. There
are other recent books such as Ben Shneiderman’s Human-Centered AI and Stuart
Russell’s Human Compatible. There are new institutes for Human-Centered AI
at universities including Stanford, Maryland, Berkeley, Utrecht, and the Technical
University of Denmark. The field is taking root.
What is unique about this book is its breadth, covering topics such as behavior
modeling, input modalities, data sets, crowdsourcing, and machine learning toolkits.
Specific application areas include drawing, natural language, GUIs, medical imaging,
and sound generation. The breadth extends to the authors as well as the topics–
we have authors who typically publish in AI natural language conferences such as
EMNLP, in computer vision conferences such as CVPR and SIGGRAPH, and in HCI
Forward for Artificial Intelligence for Human Computer Interaction: … vii
conferences such as CHI. You may discover someone working on similar problems
who comes from a very different background.
I believe that in a few years we will look back on today as a critical time in the
development of a new richer interaction between humans and AI systems, and we
will see this book as an important step in a better understanding of the breadth and
potential impact of this emerging joint field.
Peter Norvig
Introduction
ix
x Introduction
and reason about what humans “want” (i.e., their intent) are increasingly pressing
research problems that a significant amount of work across industry and academia
are dedicated to.
Finally, the question on what tools, interfaces and processes that humans who
design machine learning powered systems use, is becoming a topic of research on
its own. Due to the complexity of AI systems and the difficulty that humans have
in understanding and analyzing their inner workings, this line of work is a crucial
building block toward a world in which (i) engineers can efficiently build AI-based
systems, (ii) designers can create interfaces that work for humans and (iii) developers
have tools at hand that allow them to understand and explain how their algorithms
and systems work.
These and many other interesting questions at the intersection between AI and HCI
is what this book explores. In it we present a collection of work spanning HCI and
AI. We selected work that study this exciting and challenging area of research from
many different angles and perspectives. These include work that leverages modern
approaches in AI, particularly data-driven and deep learning methods to tackle HCI
problems of importance. Problems that were so far challenging or impossible to
address due to the limitations of traditional technologies. Other perspectives include
those of researchers that aim to understand the effect that data-driven techniques
have on the interaction between AI and human and those that propose new tools for
the design of such systems. Through these perspectives, we hope to bridge across
the often disjoint areas of AI and HCI and hope to provide new insights, research
questions, and hopefully some useful answers to important questions to researchers
and practitioners from both areas. Thus we hope to accelerate the identification of
new opportunities to advance both HCI and AI. We also hope to foster cross-field
collaboration between HCI and AI, and disseminate best practices, resources and
knowledge from both fields.
Background
The rise of deep learning [9] and the recent advance of data-driven methods [3] have
fundamentally transformed the field of AI. This impact can be felt much beyond the
field of AI itself, with deep learning methods aiding scientific discoveries such as
solving protein structures [12] and identifying exoplanets [13]. More closely related,
making user facing systems such as speech-based interfaces, self-driving cars, and
personal robots has become not only feasible but actually usable. Thus new oppor-
tunities for research in HCI and at the intersection with AI emerge if the machine no
longer is an explicitly programmed data processing automaton but a complex system
that is capable of carrying out tasks that where previously squarely in the domain of
humans and beyond.
Introduction xi
Overview
A rich spectrum of work has emerged, which leverage the strengths of both HCI
and AI. In this section, we give an overview of the trends in the research area,
opportunities and challenges that have been and are being raised. Based on these
we have selected a rich set of contributed articles from world-class authors that are
active in the area. We now provide a brief overview over the structure of the book
and hope to thus provide some guidance to the interested reader.
The first part of the book is devoted to human behavior modeling in the context of
interaction tasks. This is a classic problem in the field of HCI. Not only is it important
to predict how humans will interact with a human (and how long it will take, or how
error prone a task is) but also increasingly becomes a fundamental issue in AI itself, as
part of the quest of AI for computational modeling of human intelligence. In addition
to advancing the scientific understanding of human behaviors, these models can aid
interaction designers in determining how usable an interface is without having to test
it exhaustively with real users, which can be expensive and laborious. Furthermore,
such technologies may allow for online adaptation of an interface or user input
recognition system, ensuring optimal interaction flow and control based on various
factors such as user capability and goals.
In the chapter Human Performance Modeling with Deep Learning, Yuan, Pfeuffer,
and Li discussed a series of three case studies on how to use a data-driven deep
learning approach for human performance modeling. These projects were set to
extend classic approaches [4, 8] by allowing more complex behaviors to be modeled
in an end-to-end fashion. They reduce the effort for feature engineering and need for
making potentially limiting assumptions. These approaches are extensible and can
address new or emerging behaviors. Meanwhile, instead of using a deep model as
a black box, the works show how deep models can be analyzed to gain analytical
insights, and used for optimizing an interface.
The chapter Optimal Control to Support High-Level User Goals in Human-Com-
puter Interaction by Gebhardt and Hilliges offers a distinct perspective of interaction
modeling based on control theory. They formulate the flow of HCI as a dynamic
system where both human and computer adapt to achieve a high-level goal. More
specifically, they offered two approaches for optimizing such a dynamic system: one
based on classic control theory using model predictive control and the other based
on a reinforcement learning view of control. The authors first focus on the specific
domain of human-robot interaction and apply their methods subsequently to mixed
reality UIs. Compared to the Chap. 1, the chapter is more focused on adaptive user
interfaces where both the system and human evolve.
xii Introduction
The second part of the book is focused on another classic topic in HCI—input. There
has been a rich body of literature on how to expand the input bandwidth between
human and machine. This is often achieved by enabling new input modalities and
is aimed toward reducing user effort. In light of pro-active AI systems, sensing of
user action and by subsequent analysis, intent is becoming an increasingly important
topic.
Systems that interact with users always rely on mechanisms for humans to
specify their intentions. Traditional techniques, including mice, keyboards, and
touch screens, require the user to explicitly provide inputs and commands. However,
modern deep learning-based approaches are now robust enough to inherent ambiguity
and noise in real-world data in order to make it feasible to analyze and reason about
natural human behavior, including speech and motion but also more subtle activities
such as gaze patterns or biophysical responses. Such approaches now allow us to
go beyond simple gesture recognition [14] and pattern matching approaches [16],
which still require the user to memorize a set of specific commands, and to be able
to analyze complex human activity in a more continuous and holistic fashion. For
example understanding fine-grained hand articulation for use in VR and AR [6],
understanding and modeling natural handwritten text [2], or to estimate [10] and
even synthesize human gaze data [7]. Such methods then form the building blocks
for novel types of interactive systems in which the human and the machine interact
in a more immediate fashion, leveraging new question for HCI in terms of how to
design such UIs. However, AI-based techniques cannot only be used to sense user
input but also to learn high level concepts such as user preference or, more generally
speaking, to analyze the usage context to adapt the UI and to present information
proactively, given the estimated user intention (e.g., [5]).
The four chapters in this section are representative in the endeavors for using AI
to achieve these goals, including gaze estimation, text entry and gesture input.
Zhang, Park, and Feit open the section by presenting their work on gaze estima-
tion and how to enable gaze as an additional input modality. They give a thorough
survey of gaze estimation methods, and discuss how learning-based approaches can
Introduction xiii
significantly advance the topic. The authors employ AI techniques such as Convo-
lutional Neural Nets (CNN) and few-shot learning for gaze estimation. Staying true
to the cross-disciplinary spirit of the book the authors also discuss how gaze can be
integrated into interactive systems and provide guidelines and application examples.
Zhang et al. next address the important research topic of text entry. Text entry has
been extensively investigated over several decades, and mobile text entry is a rich
yet challenging research arena. Specifically, the authors focus on intelligent error
correction during mobile text entry, and formulate error correction as an encoding–
decoding process, which can be naturally realized using a common deep learning
architecture of an encoder–decoder with attention. In addition, the authors show how
interaction techniques can be designed to seamlessly work with an AI model.
Quinn, Feng, and Zhai then present their work on using deep models for detecting
touch gestures on capacitive touchscreens of mobile devices. They show how deep
models, using CNN backbones and LSTM cells, can detect richer expression of
finger touch, based on touch sensor image sequences, than simple pointing, which
can be translated to micro gestures or variations such as pressures. The work presents
original insights into how touch surface biomechanics provide unique opportunities
for rich gesture expressions and how a deep model can leverage these signals. The
authors then discuss interaction techniques can be designed on top of such capabilities
and first-hand knowledge on system integration of such a model into a mobile system.
Continuing on the topic of gesture input, Matulic and Vogel present their work on
enhancing digital pen interaction by detecting grip postures. They investigated three
sensing mechanisms and developed deep neural networks to recognize hand postures
and gestures. The work specifically compares deep models with traditional machine
learning method, and investigates a variety of task setups and difficulty levels as well
as how such a model can be used in a realistic interaction scenario. This work again
shows how CNNs can be useful for processing sensor input for gesture recognition.
Specific Domains
We conclude the book with a section that covers a collection of work showcasing
new frontiers of how modern-AI approaches enable new avenues for HCI research.
The topics range from interaction design, to language-based interaction, to sound
individualization and to medical imaging. These topics have started to quickly gain
traction in the research community and industry alike.
Huang et al. presents a collection of work they conducted to enable sketching as
a communication medium for creative tasks such as UI design and drawing. They
used deep models to capture rich expressions of natural sketches, which would be
difficult to achieve previously with traditional machine learning methods. The chapter
presents a set of approaches involved in this research, including data collection, model
development, and interactive system design, as well as evaluation, which provide a
useful example of how to conduct research in this area.
Aksan and Hilliges continue this discussion on ink-based forms of communication
(hand writing, sketching, diagrams)—the most flexible and freeform means of human
communication. It is exactly the flexibility and versatility of a quick doodle that
makes it hard to model via machine learning techniques. Such algorithms need to
Introduction xv
extract underlying structure and separate it from other formative variables such as
the handwriting style of a particular individual. The chapter brings together an HCI
and an ML perspective and discusses both the technical challenges of modeling ink
in a generative fashion and the applications in interactive tools.
Li, Zhou, and Li present their work on bridging two dominant communication
mediums: natural language and graphical user interfaces. The former is the major
form of communication in our everyday life, and the latter is the de facto standard
for conversing with a computer system such as a smartphone. The authors elaborate
on their work in two ways to bridge (1) natural language grounding that maps natural
language instructions to executable actions on GUIs and (2) natural language gener-
ation that describes a graphical UI element in prose such that they can be conveyed
to the user.
Continuing on the topic of combining natural language and GUIs, Li, Mitchell,
and Myers then present their efforts on developing Sugilite, which is a multi-modal,
conversational task learning agent on mobile devices, using both the user’s natural
language instructions and demonstrations on GUIs. Sugilite focuses on multiple
important issues in the area such as robustness, usability, and generalizability.
The chapter also highlights the authors more recent work on screen representation
learning that embeds the semantics of GUI screens, which is an important building
block for mobile task learning.
Using modern-AI methods to enhance medical practice has drawn increasing
interest of the field. In their chapter, Liang, He, and Chen share their work on bringing
modern-AI to medical imaging, a highly specialized profession that requires a physi-
cian to manually examine complex medical images. To aid physicians in such a
task that is time-consuming and error-prone, the authors discuss their approach for
bringing AI into the process by engaging patients with self-assessment and enabling
physicians to collaborate with AI to perform a diagnosis. Based on this work, the
authors offer a broad view about human-centered AI.
Lastly, Yamamoto and Igarashi introduce a method to adapt the output of the
generative model to a specific user for 3D spatial sound individualization, which
has been a time-consuming and expensive process requiring specialized devices.
The authors first train a deep model for sound generation and then tease apart the
individualization part from the general behavior using tensor decomposition. The
chapter makes a great case study of how to personalize an interactive system using
modern machine learning techniques. The authors generalize their findings for many
other HCI problems that would benefit from personalization.
Outlook
There have been many successes in applying modern AI methods to HCI as we have
seen in these chapters. We also see examples of HCI work that helps to advance AI in
return. With more and more work in the field starting to embrace AI-based methods
for solving HCI problems, we see a fundamental shift of HCI methodologies toward
xvi Introduction
more data driven and model centric viewpoints. With these methods, we also see
many hard HCI problems that can now be solved to a certain degree. We now discuss
challenges for research at the intersection of AI and HCI, and and at the same time
opportunities for impact.
Data challenges. Machine learning and AI techniques hold great promise in
shifting how we interact with machines from an explicit input model to a more implicit
interaction paradigm in which the machine observes and interprets our actions. To
achieve such a paradigm shift many challenges need to be overcome. First and fore-
most, deep learning methods are often data hungry, yet acquiring data of human
activity is much more difficult than in other domains such as computer vision or
NLP. Hence new ways to collect data and to make use of smaller datasets are of
central importance to HCI–AI research. We have seen examples of methods such as
human in the loop of AI systems and few shot learning along this direction. Progress
in this area can be made by close collaboration between AI and HCI researchers and
practitioners.
Inferring un-observable user state. Novel algorithms to capture and model high-
level user state and behavior, including cognitive activity and user intent could dras-
tically change what the UI is. Much progress has been made in HCI, computer vision
and machine learning to understand the obeservable aspects of human activty such
as speech, gestures, body position, and its spatial configuration. Some examples are
discussed in this book in detail. However, it remains a very hard challenge to infer
the underlying source of human activity. That is our needs, plans, and wants or our
intent. The difficulty stems from the fact that these states are purely cognitive and
hence are not directly observable. However, we are convinced that research into this
direction utlimately will be very fruitful since it would allow for the design of inter-
active systems and UIs that truly adapt to the users’ needs and would learn how to
behave to reduce or fully eradicate user frustration and dissatisfaction.
Interpretability. Having analytical understanding about machine intelligence is an
important topic both in the AI and the HCI field. Deep models with millions or even
billions of parameters are particularly difficult to analyze. While better modeling
accuracy is of great benefit, interpretability of a model is crucial for HCI researchers
to gain new knowledge and to advance the field. We feel the progress on this topic
will benefit both the fields tremendously. Last but not least, model interpretability
research can help answer many questions that arise on how such intelligent systems
can be made usable, and discoverable in the real world. It informs AI-based system
designers and developers on how to mitigate issues around privacy, user-autonomy,
and user-control.
Yang Li
Otmar Hilliges
Introduction xvii
References
Modeling
Human Performance Modeling with Deep Learning . . . . . . . . . . . . . . . . . . 3
Arianna Yuan, Ken Pfeuffer, and Yang Li
Optimal Control to Support High-Level User Goals
in Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Christoph Gebhardt and Otmar Hilliges
Modeling Mobile Interface Tappability Using Crowdsourcing
and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Amanda Swearngin and Yang Li
Input
Eye Gaze Estimation and Its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Xucong Zhang, Seonwook Park, and Anna Maria Feit
AI-Driven Intelligent Text Correction Techniques for Mobile Text
Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Mingrui Ray Zhang, He Wen, Wenzhe Cui, Suwen Zhu,
H. Andrew Schwartz, Xiaojun Bi, and Jacob O. Wobbrock
Deep Touch: Sensing Press Gestures from Touch Image Sequences . . . . . 169
Philip Quinn, Wenxin Feng, and Shumin Zhai
Deep Learning-Based Hand Posture Recognition for Pen
Interaction Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Fabrice Matulic and Daniel Vogel
xix
xx Contents
Specific Domains
Sketch-Based Creativity Support Tools Using Deep Learning . . . . . . . . . . 379
Forrest Huang, Eldon Schoop, David Ha, Jeffrey Nichols,
and John Canny
Generative Ink: Data-Driven Computational Models for Digital Ink . . . . 417
Emre Aksan and Otmar Hilliges
Bridging Natural Language and Graphical User Interfaces . . . . . . . . . . . . 463
Yang Li, Xin Zhou, and Gang Li
Demonstration + Natural Language: Multimodal Interfaces
for GUI-Based Interactive Task Learning Agents . . . . . . . . . . . . . . . . . . . . . 495
Toby Jia-Jun Li, Tom M. Mitchell, and Brad A. Myers
Human-Centered AI for Medical Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Yuan Liang, Lei He, and Xiang ‘Anthony’ Chen
3D Spatial Sound Individualization with Perceptual Feedback . . . . . . . . . 571
Kazuhiko Yamamoto and Takeo Igarashi
Part I
Modeling
Human Performance Modeling with
Deep Learning
Arianna Yuan and Ken Pfeuffer conducted the work during an internship at Google Research.
A. Yuan
Stanford University, 450 Serra Mall, Stanford, CA, USA
e-mail: xfyuan@stanford.edu
K. Pfeuffer
Aarhus University, Nordre Ringgade 1, Aarhus, Denmark
e-mail: ken@cs.au.dk
Y. Li (B)
Google Research, Mountain View, CA, USA
e-mail: liyang@google.com
1 Introduction
Modeling human visual attention in interaction tasks has been a long-standing chal-
lenge in the HCI field [5, 37, 41–43]. Building models that can accurately estimate the
difficulty of various visual search tasks has significant importance to user interface
(UI) design and development. Traditional approaches for examining visual search
involve usability tests with real human users, and a predictive model can save time
6 A. Yuan et al.
and cost for UI practitioners by offering them insights into visual search difficulty
before testing with real users.
Many classic results on visual search have been established across the fields of
HCI and cognitive science. For instance, researchers have differentiated two types of
visual search: feature search and conjunction search. Feature search is a visual search
process in which participants look for a given target surrounded by distractors that
differ from the target by a unique visual feature, such as orientation, color and shape,
e.g., searching a triangle among squares. On the other hand, in conjunction search,
participants look for a previously given target among distractors that share one or
more visual features with the target [35], such as searching a red triangle among
red squares and yellow triangles. Previous studies have shown that the efficiency
(reaction time and accuracy) of feature search does not depend on the number of
distractors [30], whereas the efficiency of conjunction search is dependent on the
number of distractors present—as the number of distractors increases, the reaction
time increases and the accuracy decreases [40]. In addition, using well-controlled
visual stimuli, previous studies showed that prior knowledge of the target greatly
influences visual research time [46], indicating the interaction between top-down
and bottom-up processing in the visual search process.
Despite the robustness and simplicity of these early findings, they are not very
practical and they only focus on modeling search time in abstracted settings. For
example, the effect of a number of distractors on visual search time in conjunc-
tion search is observed with oversimplified stimuli, e.g., geometric shapes or Gabor
patches. This is very unrealistic in everyday visual search tasks, e.g., finding a book-
ing button on a busy hotel home page. Later modeling studies have attempted to
simulate visual search in a more realistic context, such as searching for a target in
web pages, menus and other graphical interfaces [16, 21, 38]. However, those stud-
ies usually require the extraction of a set of predefined visual features of the target
and the candidate visual stimuli. Although those predefined, handcrafted features
are indicative of search time, there is often useful information in the visual scene
that cannot be captured by those handcrafted features and is missing in previous
computational models.
As mentioned in Introduction, deep learning allows us to take a data-driven
approach and save us some effort on intensive feature engineering. It has become
increasingly popular in the HCI field recently. For instance, Zheng et al. [47] pro-
posed a convolutional neural network to predict where people look on a web page
under different task conditions. Particularly, they predefined five tasks, i.e., signing-
up (email), information browsing (product promotion), form filling (file sharing, job
searching), shopping (shopping) and community joining (social networking), and
they pre-trained their model on a synthetic task-driven saliency dataset. In the work
we are going to present in the following sessions, Yuan and Li [45] introduced a more
generalizable deep learning method for visual search prediction, in the sense that it
does not assume the nature of the search task or the prior of the search pattern. It
takes advantage of both traditional modeling approaches and the popular deep learn-
ing method. Particularly, they combine existing heuristic-based, structured features
Human Performance Modeling with Deep Learning 7
that have long been used to model visual search tasks, together with the unstructured
image features—raw pixels of a web page—to predict human visual search time.
Yuan and Li focused on goal-driven visual search behaviors of human users on web
pages. For each task, a human user is first given a target to search for, and then locates
the item on a web page by clicking on it. The task resembles common interaction
behaviors on web pages. They base their research on a dataset that was collected via
crowdsourcing. The web pages used in the data collection were randomly selected
from the top 1 billion English web pages across 24 vertical classification categories.
For each task, an element on a web page was randomly selected as the target, and
the human worker was asked to find it on the web page. A target element is of one
of the following five types: image, text, link, button and input_field.
Each trial starts by showing the human worker a screen that contains only the
target prompt and a “Begin” button at the top-left corner of the screen. Once the
worker clicks on the “Begin” button, a web page is revealed and the target prompt
remains on the screen in case the worker forgets the target to look for. The trial is
finished once the worker finds and clicks on the target, or clicks on the “I can’t find
it” button on the top-right corner of the screen to skip the trial. If the target does
not contain any text, its thumbnail image is shown to the user as the prompt for
the search target. Otherwise, the text content is displayed to the human worker for
finding an element that contains such text. In either case, the target is ensured to
be a unique element on the page. There are 28,581 task trials in the dataset, which
were performed by 1,887 human workers. In the dataset, each user has at least 8
data points. The dataset was randomly split for training (1520 users, 22591 trials),
validation (184 users, 3151 trials) and testing (183 users, 2839 trials). There is no
overlap in users among any of these datasets—the data of each user can only appear
in one of these splits.
To build the model, Yuan and Li draw inspiration from cognitive psychology and
neuroscience, and attempt to simulate the human visual search process with deep
neural networks. In neuroscience, researchers have shown that when subjects try
to selectively attend to a target in a visual scene, the neural representations of the
objects, either the real target or the distractors that share similarity with the target,
will be enhanced [13, 27, 29]. Based on this finding, Yuan and Li capture the human
attention pattern with the neural attentional algorithm in deep learning. The atten-
tional mechanism has recently been widely adopted in image captaining and visual
question answering [1], which significantly boost the accuracy of these models by
8 A. Yuan et al.
Input
Raw pixels of the Layer
entire page
Classification
flatten
Concatenate
Raw pixels of
the target Regression
Structured features
concatenated, Input
Input
including numerical
Layer
features and
embeddings of
categorical features.
Fig. 1 The architecture of the visual search performance model. The attention map is computed as
the alignment between the latent representations of the entire UI (the web page) and the target, which
is then concatenated with the structured features as the input for predicting human performance in
visual search
allowing the model to selectively attend specific areas in a scene given the target. The
model with neural attentional mechanism also enhances the interpretability and the
trustworthiness of the model because it allows us to draw analytical understanding
of model behaviors by examining the attention map learned from the data.
The full architecture of Yuan and Li’s model is illustrated in Fig. 1. To process a
vast amount of information such as raw pixels of a web page, they use convolutional
neural networks (CNN) as their image feature extractors. CNN has been widely used
in object recognition, object detection and visual question answering [9, 22, 34].
This powerful architecture allows the model to capture the visual details on the web
page and the interaction between the web page and the target without deliberately
specifying any features. Because a target can be of variable sizes in different tasks,
they first resize a target into the same dimension of 64 × 64, and then use a 6-
layer convolutional neural network (CNN) to encode the target, which results in a
target embedding of dimensions 1 × 1 × 4. Similarly, for each task web page, Yuan
and Li first resize them to 512 × 512 and then use a 3-layer convolutional neural
network to encode the web page image from which results in an image embedding of
64 × 64 × 4. Therefore, there are in total 64 × 64 super-pixel representations, each
of which has dimension 4.
Computing Alignment Scores. With the embeddings of both the target (the goal) and
those of the web page images computed, the model next performs a multiplicative
attention mechanism [36, 44] over these embeddings. In particular, Yuan and Li
compute the cosine similarity between the target embedding vector and each super-
pixel representation to get a 2-D attention map. They evaluated three design choices
of the attention map: (1) using the original attention map; (2) applying the attention
map to the image embedding to get the attention-modulated image representation;
Human Performance Modeling with Deep Learning 9
(3) normalizing the attention map with a softmax(·) function. In all these cases,
they treat the attention map as the representation of unstructured webpage-target
features (referred to as the webpage-target embedding). It is combined with heuristic-
based, structured features (see next section), to predict search time. Yuan and Li find
that the first method yields the best performance. Thus, the results in the following
experiments are based on the first formulation of the attention map in the paper. The
mathematical formula for computing the attention map A ∈ R64×64 is the following:
4
Ai, j = Ii, j,k Tk (1)
k=1
where I ∈ R64×64×4 is the output from the last convolutional layer of the web page-
CNN and T ∈ R4 is the output from the last convolutional layer of the target-CNN.
Combining with Heuristic-Based Features. Yuan and Li performed an analyti-
cal examination of the dataset by looking at the correlations between several well-
established features in the literature of visual search time. They found that the y-
coordinate of the top-left corner of the target is positively correlated with the search
time: the bigger the y-coordinate, the longer it takes the participants to find the target,
which implies a dominant top-down vertical search strategy employed by the human
workers, since the starting position for each trial in the dataset is at the top of the
screen. Regarding the size features, not surprisingly, both the width and the height of
the target negatively correlate with the search time, i.e., the bigger the target is, the
shorter the search time is. Based on a linear model that uses target vertical positions
(y-coordinate) as one of the predictors, there is a positive correlation with the search
time. Yuan and Li also analyze the influence of the number of candidate items (page
complexity) over the search time. In addition to looking at how the total number of
items affects search time, they conduct a finer granularity analysis by investigating
how the number of candidate items in each object category would affect search time.
They find that the number of certain object types are strong predictors of the search
time. Images are the easiest to find whereas text is the hardest to find. Finally, target
types are also an indicator for search time.
To incorporate these findings, Yuan and Li add several heuristic-based, structured
features to the model. These features can be easily computed and do not require
domain expertise in performance modeling to use them. The features include (1)
the positional features, i.e., the (x, y) coordinates of the top-left corner of the target
bounding box; the Euclidean distance between the top-left corner of the target and
the top-left corner of the screen; (2) the size features, i.e., the width, the height as
well as the area of the target; (3) the number of candidates or distractors, i.e., the
number of leaf nodes in the DOM representation of the web page; and finally (4) the
target type. There are five possible target types: image, text, link, button and
input_field.
Except for the target types, all the other features are treated as numerical variables.
For target type, a categorical variable, we use real-valued embedding vectors to
encode different target types and the embeddings are learned from the data. The
10 A. Yuan et al.
2.3 Experiments
Yuan and Li use the validation dataset to determine the model architecture and hyper-
parameters such as the optimal stopping time and learning rate. Previous modeling
work in HCI or cognitive psychology often reports performance on how well a model
can fit the data. As a deep model has a vast number of parameters, it is trivial to fit
the training data perfectly. Therefore, it is important to report the modeling accuracy
on the test dataset, which the model has no access to during the training time. The
test accuracy truly reflects the model’s capability of capturing the human behavior.
Although the primary goal is predicting the search time, a regression problem,
they also evaluate their model on two additional objectives: classification and ranking
tasks. For the classification task, they categorize a visual search task into five difficult
levels, Very Easy, Easy, Neutral, Hard and Very Hard, by bucketizing a continuous
search time into one of these five categories according to the percentiles within
participants. Specifically, for each participant, a data point is categorized as Very Easy
if it is smaller than the 20th percentile of that human participant’s time performance,
Easy if it is between the 20th and the 40th percentile, Neutral if it is between the
40th and the 60th percentile, Hard if it is between the 60th and the 80th percentile
and finally Very Hard if it is above the 80th percentile. For the classification task, the
output layer is replaced with a 5-unit softmax layer, representing the probability of
a search task belonging to one of the five difficulty levels. For the ranking tasks, the
model is presented with a pair of randomly selected search tasks and it needs to decide
which of them requires a shorter search time. The output of the regression model is
used to rank the difficulty of a search task. The details of the model configuration,
hyperparameters and training procedures can be found in the original paper [45].
For the regression task, the work reports the R 2 metric for both within- and cross-
user cases. For the within-user case, R 2 is computed over the trials of each user. We
then report the score averaged across all the users. For the cross-user case, R 2 is
directly computed over all the trials from all the users. Both of them measure the cor-
relation between the predicted and the ground-truth visual search times. Table 1 also
shows the classification and the ranking accuracy. There was no repeat in performing
Human Performance Modeling with Deep Learning 11
Table 1 The performances of different models. The first eight rows (all but the last two rows) report
the results of linear models using a single feature (indicated by the Model column). The second
last row (structured-all) refers to the baseline linear model that uses all the structured features.
The last row reports the performance of Yuan and Li’s full model, which uses both structured and
unstructured inputs. The performance of the deep model is significantly better than the structured-all
baseline model. ** p < 0.01, *** p < 0.001
Model Within-user R 2 Cross-user R 2 Classification Ranking
x-coordinate 0.113 0.011 0.254 0.542
y-coordinate 0.338 0.229 0.346 0.669
Target width 0.097 0.015 0.235 0.524
Target height 0.124 0.025 0.279 0.601
Distance 0.177 0.089 0.294 0.622
Target area 0.106 0.037 0.266 0.567
Target type 0.103 0.012 0.250 0.551
Total candidates 0.137 0.051 0.276 0.578
Structured-all 0.373 0.267 0.355 0.691
Deep 0.384** 0.288*** 0.366*** 0.699***
net+structured
each search task both within and across users, and all these metrics are computed
based on the performance of each unique task. This metric computation is in contrast
to the traditional approach of averaging performance metrics across multiple trials
in HCI. Thus, in their setup, it is extremely challenging to obtain high accuracy due
to noises and individual differences.
We train different models multiple times with various random seeds that con-
trol the train-test split and parameter initialization. We then reported the averaged
performance metrics in Table 1. We can see that among all the single-feature base-
line models, y-coordinate has the highest performance. The baseline model that
uses all the structured features (structured-all) performs the best among all the
baseline models. The deep model (the last row in the table) that combines both
structured and unstructured input outperforms all the baseline models that use only
structured features, across all the three metrics. Particularly, the within-user R 2
of the deep model is significantly greater than the structured-all baseline model,
t (4) = 4, p < 0.01. The cross-user R 2 of the deep model is also significantly greater,
t (4) = 8, p < 0.001. The 5-way classification accuracy is significantly greater than
the structured-all model, t (4) = 9, p < 0.001. The same holds true for the ranking
accuracy, t (4) = 8, p < 0.001.
12 A. Yuan et al.
(a)
(b)
2.4 Analysis
To better understand the model prediction, Yuan and Li examine the attention map
learned by the model, a latent representation computed via Eq. 1. They found that
the model is able to detect the content on the web page, regardless of the background
color of the web page. As we can see in Fig. 2, target bounding box usually falls into
one of the highlighted regions in the attention map. Interestingly, the attention map
Human Performance Modeling with Deep Learning 13
tends to capture the distractors that are visually similar to the target. For instance,
in Fig. 2a, the target is an image and the model learns to highlight the images and
ignore the text on the page, whereas in Fig. 2b, when the target is text, the model
learns to highlight text but suppress images in the attention map.
To gain a deeper understanding of the model behavior, Yuan and Li analyzed
the embeddings for each of the 5 target types learned by the model, by projecting
these dimensions onto a 2-dimensional space using Principle Component Analysis
(PCA) for visualization (see Fig. 3). For the first principle component, i.e., the x-axis
in Fig. 3, link type is closest to text type because both types are text-based. It
is followed by button, which often contain text but not always so. They are then
followed by input_field and finally image. In other words, there is a sensible
transition from text-like stimuli to image-like stimuli. This kind of property emerged
from training for the regression task and for the classification task.
To summarize, this work shows that combining both structured and unstructured
features gives the model a unique advantage to both fit the training data and gen-
eralize to the unseen data. It outperforms baseline models that use only structured
features. Note that during training, the model was not explicitly supervised to form
the embedding relationships in Fig. 3. Yet, these patterns naturally occur as the model
learns from the data, because targets of specific types tend to manifest similar diffi-
culties for search tasks, and thus are closer in the embedding space. Yuan and Li’s
methodology can be readily extended to incorporate other cognitive findings as well
as emerging deep learning methods to combine the strength of both.
In this case study, we look at target selection from a vertical list or menu, which has
been a predominant task on modern graphical user interfaces such as a smartphone.
For a task like the web page target acquisition discussed in the previous section
(Sect. 2), it requires a model to address multiple performance components, such as
visual search and motor control. Yet, a unique aspect of this task is the learning
effect, which is not involved in the tasks in the previous section. Specifically, a user
is expected to acquire a target repeatedly, and intuitively those that are often accessed
by the user will become easier or faster to acquire over time.
A typical approach used by previous work is to explicitly combine multiple per-
formance components, often via addition, and each component is designed based on
a theoretical model. Specifically, Cockburn et al. [12] model menu selection time
by summing pointing time (using Fitts’ law), decision time (using Hick’s law) and
visual search time that are weighted by expertise. The visual search time is sim-
ply modeled as a linear function of the number of items present in the menu. The
expertise with an item, i.e., the learning effect, is designed as a rational function in
such a way that the expertise increases with the number of accesses to the item and
14 A. Yuan et al.
decreases with the number of items in the menu. In the same vein, Bailly et al. [3]
proposed a more complex model that is formulated based on gaze distribution for
menu selection tasks, in which the total performance time is the sum of serial and
direct visual search time, and the Fitts’ law time. Serial search carries less weights
and direct search gains more weight as the expertise increases.
While previous methods, which are mostly empirically tuned models based on
theoretical assumptions, have gained substantial progress in modeling menu tasks,
they are not easily extensible for accommodating various aspects of user interfaces
and human factors. Learning effect is a profound factor that affects every aspect
of human performance and has a complex interaction with other factors such as
the visual saliency of an item, which many of these interactions cannot be easily
articulated. In addition, new generations of computing devices such as touchscreen
smartphones have introduced many factors that are not covered by traditional models.
While it is possible to further expand existing models, there are many challenges and
require a tremendous amount of effort to do so.
In this work [26], Li et al. took a departure from traditional methods by using
a data-driven approach based on the recent advance of deep learning for sequence
modeling. The work uses a novel hierarchical deep architecture for human menu
performance modeling. A recurrent neural net, LSTM [19], is used to encode UI
attributes and tasks at each target item selection. One LSTM is used to represent a
menu with variable length, and incorporate UI attributes such as visual appearance
and semantics. Another LSTM is to capture learning effects, a major component in
human performance for repetitive tasks. The entire model is learned end-to-end using
stochastic gradient descent. The model outperforms existing analytical methods in
various settings for predicting selection time. Importantly, it is easily extensible for
accommodating new UI features and human factors involved in an interaction task.
Li et al. designed their models [26] based on two important capabilities of recurrent
neural net (RNN) [17]. First, it is capable of “reading” in a variable-length sequence
of information and encoding it as a fixed-length representation. It is important as an
interaction task often involves variable-length information. For example, the number
of items in a menu can vary from one application to another. Second, the model is
capable of mimicking users’ behavior by learning to both acquire and “remember”
new experience, and discard (or “forget”) what it learns if the experience is too dated.
While learning effects are a major component in human performance, prior works
primarily use a frequency count as the measure of the user’s expertise. In contrast,
Li et al’s model relies on LSTM [19], which offers a mechanism that is more natural
in mimicking human behaviors.
Encoding a Single Selection Task. At each step, a user selects a target item in
a vertical menu. As revealed by previous work, there are multiple factors in the
task affecting human performance, including the number of items in the menu, the
location of the target item in the menu, the visual salience of each item and whether
there are semantic cues in assisting visual search. For each element in the UI, an
item in the menu in our context, Li et al. represent it as a concatenation of a list of
attributes (see Eq. 2). They use 1 or 0 to represent whether it is the target item for the
current step. To capture the visual salience of an item, they use the length of the item
name. An item that is especially short or long in comparison to the rest of the items
on the menu tends to be easier to spot. To capture the semantics of an item, they
represent the meaning of the item name with a continuous vector that was acquired
from Word2vec [32], which project a word onto a continuous vector space where
similar words are close in this vector space. m sj denotes the vector representation of
16 A. Yuan et al.
To encode the selection task that involves a list of items in the menu, Li et al.
feed the vector representation of each item in a sequel to a recurrent neural net [17]
(see Fig. 4a). esj represents the hidden state of the recurrent net after reading the jth
j−1
item and seeing the previous items through es . n denotes the number of items in
the menu. This recurrent net performs as a task encoder (thereafter referred to as
the encoder net), and it does not have an output layer. The final hidden state of the
recurrent net, esn , a fixed-length vector, represents the selection task at step s. The
model then concatenates a one-hot vector to indicate whether the menu items are
semantically grouped, alphabetically sorted or unsorted, resulting in es . The task
encoder can accommodate a menu with any length, n, and UI attributes.
Modeling A Sequence of Selection Tasks With the interaction task at each step of a
sequence represented as es , it can now feed the sequence into another recurrent neural
net (see Fig. 4b), which is referred to as the prediction net. Note that es in Fig. 4b
represents the encoder net, which is a recurrent neural net itself whose outcome is fed
to the prediction net. The task at each step can vary simply because the user might
need to select a different target item. The UI at each step can also be different, e.g.,
an adaptive interface might decide to change the appearance of an item such as its
size [11] to make it easier to acquire.
The recurrent neural net predicts human performance time at each step, ti . The
predictions are based on not only the task at the current step but also the hidden state
of the previous step that captures the human experience performing previous tasks.
Previous work in deep learning has shown that adding more layers in a deep net can
improve the capacity for modeling complex behaviors [24]. To give the model more
capacity, Li et al. add a hidden layer, with ReLU [31] as the activation function,
Human Performance Modeling with Deep Learning 17
after the recurrent layer, denoted as nonlinear projection in Fig. 4b. Finally, the time
prediction ti is computed as a linear combination of the outcome of the nonlinear
transformation layer.
Model Learning and Loss Function. It is straightforward to compute the time
prediction with the feedforward process of a neural net. The two recurrent neural
nets involved in the model are trained jointly, end to end from the data by feeding in
sequences of selection tasks as input and observed performance times as the target
output (the ground truth), using stochastic gradient descent.
For time performance modeling, one common measure of prediction quality in
the literature has been R 2 (e.g., [2, 4, 11, 15]). It measures how well predicted times
match observed ones in capturing relative task difficulty or human performance
across task conditions and progression. For general time series modeling regarding
continuous values, there are other metrics often used, such as Root Mean Square
Error (RMSE) or Mean Absolute Error (MAE). Mathematically, R 2 is the correlation
between the sequence of observed times, yi , and the sequence of predicted times, ti ,
(see Eq. 3). |S| represents the length of the sequence, and ȳ is the mean of yi .
|S|
(yi − ti )2
R 2 = 1 − i=1
|S|
(3)
i=1 (yi − ȳ)
2
|S|
i=1 (yi − ȳ)2 reflects the variance of the observations in each sequence, which is
independent of models. Thus, it is a known constant for each sequence in the training
2
|S|we refer to2 as cs . To maximize R , we want to minimize the squared
dataset, which
error term i=1 (yi − ti ) , scaled by a sequence-specific constant cs , which defines
the loss function (see Eq. 4). The scaling acts effectively as adapting the learning rate
based on the variance of each sequence for training the deep neural net. Intuitively, for
each training sequence, the more variance the sequence has, the smaller the learning
rate we should apply for updating the model parameters, and vice versa.
|S|
i=1 (yi − ti )2
Lt = (4)
cs
With the loss function defined, the model can be trained using Backpropagation
Through Time (BPTT) [17], a typical method for training recurrent neural nets (see
more details in the following sections).
3.3 Experiments
To evaluate the model, for each dataset, Li et al. randomly split the data among users
with half of the users for training and the other half of the users for testing the model.
The experiment results were obtained based on the test dataset for which the model
was not trained on, which truly shows how well the learned model can generalize to
18 A. Yuan et al.
(a) The model accuracy over blocks on the (b) The Jacobian of Li et al.’s deep net indi-
smartphone dataset for menus with different cates how the time performance for selecting
lengths, with predicted times in solid lines a target item is affected by the past experience
and observed times in dashed lines. for selecting the item, in response to different
menu organizations. The X axis is the trials in
the sequence and the Y axis shows the mag-
nitude of the derivative, i.e., the impact.
Fig. 5 The accuracy and analysis of the menu selection models [26]
new data that is unseen during training. They report the accuracy of the model on
both target-level and menu-level R 2 that were used in the previous work [2]. Both
measure the correlation between predicted and observed performance times. Target-
level R 2 examines performance at each target position in a menu with a different
amount of practice (blocks). Menu-level R 2 examines the average performance over
all target positions in a menu with a varying amount of practice. For target-level R 2 ,
the model achieved 0.75 on the public dataset for the overall correlation across menu
organizations. In particular, R 2 for alphabetically ordered (A), semantically grouped
(S) and unsorted menus (U) are 0.78, 0.62 and 0.80, respectively. Note that the single
deep model here predicts for all menu organizations. In contrast, previously, Bailly
et al. tuned and tested their model for each menu organization separately. Their R 2
results were reported as 0.64 (A), 0.52 (S) and 0.56 (U) [2]. For menu-level R 2 ,
our model achieved 0.87 for overall correlation, 0.85 (A), 0.88 (S) and 0.94 (U),
whereas previous work reported 0.91 (A), 0.86 (S) and 0.87 (U) [2]. Similarly, Li
et al.’s model achieved competitive performance for the smartphone dataset that
involves only unordered menus: target-level R 2 = 0.76 and menu-level R 2 = 0.95.
It accurately predicts the time performance for each menu length (see Fig. 5a).
Human Performance Modeling with Deep Learning 19
3.4 Analysis
While it is generally challenging to analyze what a deep model learns, Li et al. offer
several analyses of the model behaviors and discuss how they match our intuition
about user behavior. To understand the behavior of their deep net model, Li et al.
compute its Jacobian that is the partial derivatives of the network output with respect
to a specific set of inputs (see Fig. 5b)—it indicates how sensitive the time prediction
is to the change in the input at each step. In particular, they want to find out how users’
past experience with selecting a target affects the users’ performance for selecting the
target item again. Figure 5b is generated by taking the Jacobian of the deep net output,
i.e., the time performance, with regards to the target feature in Eq. 2. We see that the
more recent the experience is with selecting a target item, the more influence it has
on the current trial for selecting the target item again. Intuitively, it might be because
the user remembers where the item is on the list, as found in previous work [2, 11].
However, such an effect eventually wears off as the experience becomes dated, which
is quite consistent with how human memory works. Previous work uses frequency
count to represent the user expertise with an item and is insufficient to capture the
profound aspects of human short-term memory such as the forgetting effect, and the
interaction of learning effects with other factors. For example, Li et al. found the
degree of how much the current performance relies on the past experience differs for
the different menu organizations. As shown in Fig. 5b, such sequence dependency has
the largest effect on unordered menus and less effect on semantically or alphabetically
organized menus. A sensible explanation for this phenomenon is that semantic and
alphabetic menus provide additional cues for users to locate an item, which results
in less dependency on memory for completing the task.
To recap what we learn from this case study, a deep model significantly out-
performed previous methods on the repetitive menu selection tasks where human
experience matters. Importantly, such a deep model can easily capture essential
components such as the learning effect and incorporate additional UI attributes such
as visual appearance and content semantics without changing model architectures.
By understanding how a deep learning model learns from human behavior, it can
be seen as a vehicle to discover new patterns about human behaviors to advance
analytical modeling.
In the previous case studies, we have shown how deep learning can be used to model
human behaviors by taking a dramatic departure from traditional methods relying
on extensive feature engineering and theoretical assumptions. In this section, we are
going to present a case study in which deep learning is used to enhance traditional
20 A. Yuan et al.
Most prior performance models of menus focused on desktop computers [2, 6, 7, 11,
23], leaving mobile devices underexplored despite their large adoption in everyday
life. Mobile devices provide a distinct interaction surface, with a small display and
direct touch user input. Pfeuffer and Li [33] focus on a scrollable two-dimensional
grid UI, a common interface on touchscreen mobile devices for presenting a large
number of items on a small-form factor device, e.g., apps in a launcher, photos in a
gallery and items in a shopping collection.
Human Performance Modeling with Deep Learning 21
Fig. 6 The experimental study setup (a) and a close-up view of the grid UI used in the study (b)
At the top level, Pfeuffer and Li’s model is very similar to previous models based on
the linear combination of individual performance components [3, 11]. The overall
selection time Ti for each item is the sum of navigation Tnav , visual search Tvs and
pointing time T point :
Ti = Tnav + Tvs + T point (5)
To model each of these components for a realistic grid task, the work uses deep
learning techniques to address aspects that lack a solid theoretical basis. By analyzing
the dataset, Pfeuffer and Li found an unexpected effect of the vertical grid position
on task completion time, which initially increases with the row position, and then
steadily decreases toward the bottom of the grid. They found that navigation strategy
is the main reason for the decrease in the time required. In 80% of trials, the user
navigates from the top of the grid continuously downwards (Top-Down), until the
target is found. In the rest 20% of trials, the user performs a flick gesture to scroll to
the bottom of the grid, and then navigates upward (Bottom-Up). As shown in Fig. 7,
the time for selecting a target increases approximately linear with the row position of
the target (decreasing for bottom-up). In essence, the bottom-up strategy resembles
the inverse of the top-down strategy plus the initial scroll-down gesture.
Several factors seem to have strong influences on strategy usage, including the
first letter of the target name. For most letters, users went for the top-down strategy,
but when the initial letter is positioned toward the end of the alphabet, the user tends
to use the bottom-up strategy more often. As a user becomes more experienced, the
use of the bottom-up strategy increases. A shorter grid length also seems to encourage
bottom-up strategy use.
Based on these observations, Pfeuffer and Li designed a novel probabilistic esti-
mation of navigation strategy to capture the switching between the top-down and
bottom-up navigation. The navigation time is a probabilistic combination of the top-
down (Ntdn ) and the bottom-up (Nbup ) navigation costs:
where Sbtm is a probability that the user navigates from the bottom of the grid.
The time for both navigation costs can be simply modeled with a linear function
of the target’s row position posr ow . The difference is that they use a different intercept
term, because the bottom-up navigation involves the effort to reach the end of the grid,
i.e., the initial swipe-down step. The bottom-up navigation counts the row position
from the bottom upwards instead from the top downwards: Ntdn = posr ow Tr ow + btdn
and Nbup = (len r ow − posr ow )Tr ow + bbup , where Tr ow denotes the time required
for a each row, and btdn and bbup are bias terms to be learned from the data. The
time needed for inspecting each row, Tr ow , decreases logarithmically with expertise
increase, modeled by Tr ow = ar exp(−br t) + cr , similar to Bailly et al.’s model [2],
where ar , br and cr are parameters to be learned from the data.
The strategy switching model has three cases determined by the grid length and
viewport size, i.e., the top rows, the bottom rows in the grid and rows in-between:
⎧
⎪
⎨0 if posr ow < viewr ow
Sbtm = 1 if posr ow > len r ow − viewr ow (7)
⎪
⎩
S pr ob otherwise
len r ow and l are normalized to ±0.5 by considering the range of these values in
the training dataset. Sigmoid function offers a suitable numerical range of 0–1 for
regulating the navigation cost. Sex p , the expertise of using a strategy, is computed as
follows:
Sex p = sigmoid(e0 + e1 t) (9)
which is a sigmoid function taking the linear transformation of the previous encoun-
ters t as input. The sigmoid function maps an unbounded value to a range from 0 to 1.
Sigmoid is appropriate in that expertise will eventually saturate with practice which
will approach 1 infinitely. si and ei are the parameters to be learned from the data.
The entire strategy model is equivalent to a standard feedforward neural network
using sigmoid as the activation function, and these parameters can be learned using
general stochastic gradient descent.
In addition to the navigation strategies, Pfeuffer and Li brought deep learning
to other components of the model. For example, to model the pointing time, one
challenge is that the absolute position of a target on the screen is undetermined
due to the human scrolling behavior. To address this issue, they use a probabilistic
combination of Fitts’ law time across all the vertical positions. The target’s Y position
on the screen is estimated using a Gaussian distribution as inspired by the study
finding, while X is given by poscol . They discretize the Y position as a fixed number
24 A. Yuan et al.
Fig. 8 Observed and predicted selection time for various grid performance factors
of rows in the viewpoint, viewr ow , and then compute the weighted average of the
cost for each row, j, to estimate the pointing time:
r ow
view
T point = P j T point j (10)
j=1
The time of each row in the viewpoint is calculated by a regular Fitts’ law model:
d([ poscol , j],viewctr )
T point j = a f + b f log2 1 + W
, where d is the Euclidean distance
between a given target position and the viewport center, viewctr . a f ad b f are learned.
The probability for the target to be on each row j is determined by a probability den-
sity of normal distribution that reflects how Y positions are distributed across the
screen.
1
P j = √ exp(− ( j/viewr ow − μ)2 /2σ 2 ) (11)
σ 2π
Human Performance Modeling with Deep Learning 25
4.3 Experiments
4.4 Analysis
In this analysis, instead of looking at model behaviors as previous case studies do,
we analyze how the predictive model can help reduce user effort in a mobile grid
interface. For example, in the App Launcher on Google Pixel devices, five suggestions
of the next apps to use are presented at the top of the screen based on the user’s current
context. If a target app is among these predictions, the user can immediately launch it
without searching the entire grid. A common way for deciding what items to suggest
at a given step, t, is based on the probability distribution over all possible items,
Pt , which are determined by an event prediction engine that is out of the scope
of this chapter. The event prediction engine may score each item (representing an
action) based on a range of external signals such as time of the day or user location.
Note that event prediction is different from the time performance prediction that our
model is designed for. For an example of an event prediction model, please see [25].
A probability-based method typically selects a given number of items (e.g., 5) that
have the highest probabilities, and placed them at a convenient location such as a
prediction bar at the top of the screen for easy access by the user. Let us formulate
the time cost for accessing item i at trial t as the following:
26 A. Yuan et al.
C if i ∈ T op5(Pt )
costti = (12)
G(i, t, g) otherwise
where T op5 picks 5 items that have the highest probabilities, and G(i, t, g) is the
time cost model that Pfeuffer and Li proposed that predicts time performance for
accessing item i given the trial t and a grid configuration g. C is a constant time for
accessing an item in the prediction bar.
Instead of only considering the probability of the item being the next app to usef,
it can potentially further reduce user effort by considering the time cost for accessing
each item in a grid interface. The expected utility for suggesting an item is the product
of its probability and its cost.
Ut = Pt G(t, g) (13)
G(t, g) computes the cost for each item in the grid based on the time performance
model derived in this paper, which results in a vector. represents the pairwise
product between the probability distribution and the cost vector. Thus, the utility-
based optimization can be formulated as follows:
C if i ∈ T op5(Ut )
costti = (14)
G(i, t, g) otherwise
The equation is similar to Eq. 14, except that T op5 selects items based on utilities
instead of probabilities.
Pfeuffer and Li validated their hypothesis by evaluating the two methods, with
or without expected utility computed from the grid performance model, with task
sequences generated from the uniform and the Zipf [48] task distributions, which
have been used in previous work for studying adaptive user interfaces [11]. In a Zipf
distribution, a small number of items are frequently used while the rest in the grid are
rarely used. At each trial, a probability distribution is drawn over all possible items
from a Dirichlet that is seeded with one of these target distributions. The target for
the trial is then drawn by sampling the probability distribution. This approach has
several benefits. First, the probability distribution is valid because the target is drawn
from it. Second, the resulting task sequence is mostly consistent with the desired
target distribution. Last, Dirichlet can be parametrized to simulate an item predictor
with different accuracy, i.e., how often the item that has the largest probability is
indeed the target.
Based on 1600 interaction sequences generated from the two task distributions,
with 4 grid lengths, a range of event prediction accuracy (from 10% to 70%), and
the two optimization methods, they compute the total time needed for completing a
sequence using a given optimization method, grid length, target distribution as well
as item prediction accuracy. Figure 9a–c shows the results of the simulation for both
conditions. They found the utility-based optimization outperformed the probability-
based method in reducing task completion time (F71 = 35, p < 0.0006). Across all
Human Performance Modeling with Deep Learning 27
Fig. 9 Simulation of utility versus probability-based optimization for factor Grid, Distribution and
Prediction Accuracy. Across these factors, the inclusion of utility reduces task completion times
conditions, the utility method is faster than the probability-based method (Fig. 9).
Particularly, when the grid has more items, the advantage of the utility-based method
is more pronounced (a), considering that targets that are further down the grid will
acquire higher utility to be placed at the prediction bar. By discretizing item prediction
accuracy to three equal bins: Low, Mid and High accuracy (Fig. 9c), we can see that
when the accuracy of item prediction is low, the advantage of the utility-based method
over the probability one is also more pronounced.
5 Discussion
Despite the great capacity of deep learning models, several challenges remain in using
such a data-driven approach in HCI. One of the common criticisms for deep learning
models is the “black-box” property of the models, i.e., these models are often hardly
interpretable. In the case studies we reviewed in this chapter, we see that specific
techniques are used such as examining the learned attention map (Sect. 2), or the
Jacobian of the network (the partial derivatives of the network output with respect
to a specific set of inputs) (Sect. 3) to get better interpretability. However, those
explorations are preliminary and future research is required to investigate deeper on
this issue.
Another challenge is related to the bias in data collection. Because deep learning
approaches are primarily data-driven, they are very sensitive to the statistics in the
training data. In other words, the models learn what they are trained on. That implies
that we need to be cautious about the potential bias in our data collection. For instance,
in Yuan and Li’s work [45], the data they used to train the model was collected from
tasks in which the “Begin” button is always positioned at the top-left corner of the
screen. Consequently, the starting position of the search is the same across trials,
which is different from real-life scenarios. Although the model is still useful as it can
be used to examine the relative performance of alternative designs via A/B testing
28 A. Yuan et al.
when the starting position is controlled, such bias needs to be corrected by collecting
more training trials with different starting positions in future research.
A third challenge is to model human cognitive processes as accurately as possible
using deep learning models. Take the visual search modeling on web pages [45]
as an example. Despite being biologically inspired, Yuan and Li’s model is not
intended to replicate the real human visual search process. For one thing, it does not
involve any sequential attention shifts, which is different from human behaviors. In
the future, we could utilize deep learning models such as recurrent neural networks
to capture the sequential aspects of visual search, which would allow us to compare
model predictions and human data on a more detailed level. Similarly, in the menu
modeling work, there are opportunities to extend the model for multilevel analysis
of interaction by considering behaviors of finer granularity such as gazes and manual
input paths.
Despite being a challenging issue, modeling human cognitive processes in HCI
tasks using machine learning models is still a great opportunity. For instance, in the
menu selection modeling work [26], Li et al. analyzed how human learning effects are
captured and mimicked by a deep model, and how the learning effect differs across
different menu organizations. The Jacobian responds differently for different menu
organizations, which reflects the memory effect as manifested in the data. These
findings can advance our understanding of how human behaves and thus may inspire
others to design new analytical models to capture these effects. Rather than using
deep models only for predictions, the deep learning approach can be seen as a vehicle
to discover new patterns about human behaviors to advance analytical research.
Cognitive process modeling related to menu selection can be further expanded by
manual navigation in addition to visual search. Our third use case showed that users
navigate with two strategies—starting from the top or the bottom of a scrollable grid.
The analytical understanding is incorporated through the probabilistic estimation
of the navigation strategies. This finding first explains the last item effect [2] and
proposes a modeling approach for the phenomenon. However, in general, we find
fixed UI locations such as the top or the bottom of a list can act as spatial cues to
the user and decrease the need for visual search over time. In contrast, dynamic UI
locations such as objects in a scrollable list lead to recurring visual search processes.
Thus, considering the user strategies with regards to static and dynamic UI elements
can become an integral part of modeling the user performance.
Although deep learning models are not as interpretable as traditional heuristic-
based models, we could still bring a lot of prior knowledge into the model by defining
the appropriate model architecture. We have seen modeling work that uses hierar-
chically organized recurrent neural nets to capture the performance of a sequence
of UI tasks [26], and convolutional neural networks for processing images for mod-
eling search time [45]. The choice of those model types reflects researchers’ prior
knowledge of the nature of those UI tasks. Future work should investigate how the
choice of different network architectures influences model performance.
There are several other directions for future work. In all the case studies we
reviewed, although the deep models outperform the baseline models, there is still a
lot of room for improvement in terms of model accuracy. In addition, in future work,
Human Performance Modeling with Deep Learning 29
6 Conclusion
In this chapter, we discuss how data-driven, deep learning approaches can revolu-
tionize human performance modeling in interaction tasks. In particular, we review
three projects from our group, including visual search modeling on arbitrary web
pages [45], repetitive menu selection on desktop and mobile devices [26], and target
selection on a scrollable mobile grid [33]. Each of these projects addresses a unique
interaction task and showcases different modeling strategies and techniques. These
methods, based on deep learning models, significantly outperform traditional mod-
eling techniques. Importantly, these methods are highly extensible for incorporating
new factors, alleviating the efforts for extensive feature engineering and allowing
end-to-end modeling. These methods also offer new avenues for modeling complex
human behavior in interaction tasks and are promising for making new analytical
findings that advance the science.
References
8. Card SK, Moran TP, Newell A (1980) The keystroke-level model for user performance time
with interactive systems. Commun ACM 23(7):396–410
9. Chen K, Wang J, Chen L-C, Gao H, Xu W, Nevatia R (2015) ABC-CNN: an attention based
convolutional neural network for visual question answering. arXiv:1511.05960
10. Chen X, Bailly G , Brumby DP, Oulasvirta A, Howes A (2015). The emergence of interac-
tive behaviour: a model of rational menu search. In: CHI’15 Proceedings of the 33rd annual
ACM conference on human factors in computing systems, vol 33. Association for Computing
Machinery (ACM), pp 4217–4226
11. Cockburn A, Gutwin C, Greenberg S (2007) A predictive model of menu performance. In:
Proceedings of the SIGCHI conference on human factors in computing systems (CHI ’07).
ACM, New York, NY, USA, pp 627–636. http://dx.doi.org/10.1145/1240624.1240723
12. Cockburn A, Gutwin C, Greenberg S (2007) A predictive model of menu performance. In:
Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp
627–636
13. Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the
brain. Nat Rev Neurosci 3(3):201
14. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional
transformers for language understanding. arXiv:1810.04805
15. Fitts PM (1954) The information capacity of the human motor system in controlling the ampli-
tude of movement. J Exper Psychol 47(6):381
16. Fu W-T, Pirolli P (2007) SNIF-ACT: A cognitive model of user navigation on the World Wide
Web. Human-Comput. Int. 22(4):355–412
17. Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer,
Studies in computational intelligence
18. Hick WE (1952) On the rate of gain of information. Q J Exp Psychol 4(1):11–26
19. Hochreiter S, Schmidhuber JU (1997) Long short-term memory. Neural Comput 9(8):1735–
1780
20. Johnson M, Schuster M, Le QV, Krikun M, Yonghui W, Chen Z, Thorat N, Viégas F, Wattenberg
M, Corrado G et al (2017) Google’s multilingual neural machine translation system: enabling
zero-shot translation. Trans Ass Comput Ling 5(2017):339–351
21. Jokinen Jussi PP, Zhenxin W, Sayan S, Antti O, Xiangshi R (2020) Adaptive feature guidance:
modelling visual search with graphical layouts. Int J Human-Comput Stud 136:102376
22. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional
neural networks. In: Advances in neural information processing systems, pp 1097–1105
23. Lane DM, Napier HA, Batsell RR, Naman JL (1993) Predicting the skilled use of hierarchical
menus with the keystroke-level model. Hum-Comput Interact 8(2):185–192. http://dx.doi.org/
10.1207/s15327051hci0802_4
24. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
25. Li Y (2014) Reflection: enabling event prediction as an on-device service for mobile interaction.
In: Proceedings of the 27th annual ACM symposium on user interface software and technology
(UIST ’14). Association for Computing Machinery, New York, NY, USA, pp 689–698. http://
dx.doi.org/10.1145/2642918.2647355
26. Li Y, Bengio S, Bailly G (2018) Predicting human performance in vertical menu selection using
deep learning. In: Proceedings of the 2018 CHI conference on human factors in computing
systems, pp 1–7
27. Liu T, Larsson J, Carrasco M (2007) Feature-based attention modulates orientation-selective
responses in human visual cortex. Neuron 55(2):313–323
28. MacKenzie IS, Buxton W (1992) Extending Fitts’ law to two-dimensional tasks. In: Proceed-
ings of the SIGCHI conference on human factors in computing systems, pp 219–226
29. Martinez-Trujillo JC, Treue S (2004) Feature-based attention increases the selectivity of pop-
ulation responses in primate visual cortex. Curr Biol 14(9):744–751
30. McElree B, Carrasco M (1999) The temporal dynamics of visual search: evidence for parallel
processing in feature and conjunction searches. J Exp Psychol: Human Percept Perf 25(6):1517
Human Performance Modeling with Deep Learning 31
31. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In:
ICML: Proceedings of the 27th international conference on machine learning
32. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation.
In: Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pp 1532–1543
33. Pfeuffer K, Li Y (2018) Analysis and modeling of grid performance on touchscreen mobile
devices. In: Proceedings of the 2018 CHI conference on human factors in computing systems,
pp 1–12
34. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with
region proposal networks. In: Advances in neural information processing systems, pp 91–99
35. Shen J, Reingold EM, Pomplun M (2003) Guidance of eye movements during conjunctive
visual search: the distractor-ratio effect. Can J Exp Psychol 57(2):76
36. Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–
4621
37. Tehranchi F, Ritter FE (2018) Modeling visual search in interactive graphic interfaces: adding
visual pattern matching algorithms to ACT-R. In: Proceedings of 16th international conference
on cognitive modeling. University of Wisconsin Madison, WI, pp 162–167
38. Teo L-H, John B, Blackmon M (2012) CogTool-Explorer: a model of goal-directed user explo-
ration that considers information layout. In: Proceedings of the SIGCHI conference on human
factors in computing systems. ACM, pp 2479–2488
39. Todi K, Jokinen J, Luyten K, Oulasvirta A (2019) Individualising graphical layouts with pre-
dictive visual search models. ACM Trans Int Intell Syst (TiiS) 10(1):1–24
40. Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol
12(1):97–136
41. van der Meulen H, Varsanyi P, Westendorf L, Kun AL, Shaer O (2016) Towards understanding
collaboration around interactive surfaces: exploring joint visual attention. In: Proceedings of
the 29th annual symposium on user interface software and technology. ACM, pp 219–220
42. Walter R, Bulling A, Lindlbauer D, Schuessler M, Müller J (2015) Analyzing visual attention
during whole body interaction with public displays. In: Proceedings of the 2015 ACM interna-
tional joint conference on pervasive and ubiquitous computing. ACM, New York, NY, USA,
pp 1263–1267
43. Wu X, Gedeon T, Wang L (2018) The analysis method of visual information searching in
the human-computer interactive process of intelligent control system. In: Congress of the
international ergonomics association. Springer, pp 73–84
44. Xu H, Saenko K (2016) Ask, attend and answer: exploring question-guided spatial attention for
visual question answering. In: European conference on computer vision. Springer, pp 451–466
45. Yuan A, Li Y (2020) Modeling human visual search performance on realistic webpages using
analytical and deep learning methods. In: Proceedings of the 2020 CHI conference on human
factors in computing systems, pp 1–12
46. Zhaoping L, Frith U (2011) A clash of bottom-up and top-down processes in visual search: the
reversed letter effect revisited. J Exp Psychol: Human Perc Perf 37(4):997
47. Zheng Q, Jiao J, Cao Y, Lau RWH (2018) Task-driven webpage saliency. In: Proceedings of
the European conference on computer vision (ECCV), pp 287–302
48. Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human
ecology. Addison-Wesley Press, Boston. http://psycnet.apa.org/record/2005-10806-009
Optimal Control to Support High-Level
User Goals in Human-Computer
Interaction
1 Introduction
their goals. For instance, robots enable users to accomplish tasks that were previ-
ously out of reach or difficult to conduct. However, they are hard to control and can
be dangerous if used wrongly. MR systems promise to augment the real world with
useful information, but it is conceivable that this can cause users to suffer information
overload if done naively [50]. Mobile devices provide access to information always
and everywhere. However, they are also constant sources of interruption, which can
have severe consequences for users such as inattention [54].
To address this dichotomy, user interfaces (UIs) of emerging technologies need to
reduce the complexity of interaction while preserving machine-provided capabilities.
Human-computer interaction (HCI) research has demonstrated that adaptive user
interfaces (AUIs) are capable of achieving this goal. Various works have shown that
AUIs are able to improve the usability of PC applications. Prominent examples even
achieved adoption in consumer software, e.g., the Microsoft Office Assistant [42]
and Autodesk’s Community Commands [64].
For an adaptive interface to be successful, the benefits of correct adaptations must
outweigh the costs, or usability side effects, of incorrect adaptations [24]. For existing
AUIs, the risk of incorrect adaptations is low since the non-adaptive interaction is
supplemented with recommendations, which users have to deliberately activate [46].
In contrast, we argue that UIs of emerging technologies need to continuously adapt
system input and output for users to be able to fully leverage them. The problem is
that these devices confront users with a large number of controllable elements that
change their state frequently. This is hard to cognitively process and, hence, users
struggle when using such devices without interaction support (see Sect. 2).
The necessity of continuous adaptations causes two unresolved challenges to
arise. First, inferring users’ intent is inherently hard. The task is to derive the intent
of users by observing their actions. However, observable actions only provide a
fragmentary and incomplete image of intent. In addition, users do not necessarily
perform actions that serve their goal. Thus, the same sequence of recorded actions
can correlate with a multitude of intentions, rendering the inference of user intent an
ill-posed inductive problem even when ground truth data is available. Secondly, user
intent is constantly changing. Conceptually, HCI can be seen as a loop [11] where
users specify commands using an input device and observe the effect of their actions
on a display. Based on the intermediate result, they refine their intent and iteratively
repeat this process until they are satisfied with the created outcome. This highlights
that user intent is not stationary but shaped by the closed-loop dynamics of HCI.
To address these challenges, we propose optimal control as a computational
framework to adapt system input and output in settings where permanent adapta-
tion is necessary. In its classical sense, optimal control steers a dynamical system
toward a goal state by relying on an underlying physical model [19]. It optimizes
the input to the system for the next time step while taking future time steps into
account. We employ optimal control to assist users in achieving their goal by relying
on an implicit model of their intent to adapt controls and displays. It considers the
closed-loop iterative nature of HCI by optimizing the UI’s underlying mechanisms
for a user’s next interaction step while taking their long-term goals into account.
Furthermore, it is robust against modeling errors of users’ high-level goals since
Optimal Control to Support High-Level User Goals … 35
only the first step of the prediction horizon is implemented and the horizon is re-
optimized with each user input. Both inherent properties render optimal control an
ideal framework to overcome the challenges raised above and, hence, promising to
improve the usability of AUIs over usual user modeling techniques.
To illustrate the usage of optimal control in interactive systems, we examine
two examples of its application. First, we propose an optimization-based optimal
control approach to enable quadrotor-based aerial videography for end-users. Second,
we learn cooperative personalized policies that support users in visual search tasks
by identifying objects of interest from their gaze-object interactions and adapt a
mixed-reality UI accordingly. Before diving into the case studies, we define the
problem setting in more detail, present related work and introduce the mathematical
background of optimal control.
2 Problem Setting
3 Related Work
Our work touches the research areas of human-computer and human-robot interac-
tions, control theory, as well as artificial intelligence. Due to the broadness of the
Optimal Control to Support High-Level User Goals … 37
topic, we focus the presentation of related work on approaches that use MPC and
RL as user models or to adapt UIs.
Model predictive control was first introduced in the late 1970s to control chemi-
cal reactions in industrial plants [21, 82]. Through the increase of available com-
putational power, nowadays MPC methods have been demonstrated to be able to
solve complex non-linear problems which go beyond simple reference tracking. For
instance, MPC methods solved over-constrained control problems for autonomous
driving [61] or robotic flight [73, 74]. This capability renders MPC an interest-
ing approach for adapting UIs according to users’ high-level goals. Such problems
resemble complex robotic tasks in that they are typically multi-objective, spanning
complex non-linear solution spaces for which no straightforward results exist. In
this section, we review related work in which MPC approaches are used to adapt
interfaces to users’ individual abilities or tasks to simplify the usage of the systems
these UIs control.
A series of works employs MPC to simplify the control of robotic systems. Such
works usually abstract the complex dynamics of a robot and allow users to control it by
using more intuitive higher level commands. In this spirit, Chipalkatty et al. propose
an MPC formulation that optimized inputs to a robotic system such that user intent is
preserved while enforcing state constraints of the low-level robotic task [16, 17]. In
another work, model predictive control was used to optimize the rendered forces of a
robotic system in physical human-robot interactions by predicting the performance
of the user [86]. Similarly, [48] used model predictive control with mixed-integer
constraints to generate human-aware control policies. These policies optimized the
assisstive behavior of a robot by maximizing its productivity while minimizing human
workload. In another human-robot collaboration task, MPC was used to minimize
the variations between human and robot speeds and, hence, maximize trust between
man and machine [83].
Another application area in which MPC was used to adapt to human behavior is
therapeutic systems. [92] estimated the joint torque of a patient from measured EMG
signals and then derived the deficient joint torque to generate the target movements by
considering the patient’s estimated joint torque with an MPC method. The optimized
joint torque of the MPC is implemented in a therapeutic robotic arm. Similarly,
[86] proposed an approach that leveraged MPC to optimize the rendered forces
of a therapeutic system according to the predicted performance of the user. User
performance is learned continuously to account for performance changes over time.
MPC was also employed for the adaptation of UIs outside of the context of human-
robot interaction. It demonstrated its capabilities as a user modeling technique for
applications in which the optimal current adaptation of a UI depends on the future
behavior of the user, which, in turn, can be modeled as a dynamical system. An
example of such a work is presented by [75] where a model predictive controller
38 C. Gebhardt and O. Hilliges
The idea of using reinforcement learning to solve optimal control problems was first
introduced in the late 1980s and early 1990s [9, 95]. With the increase in computa-
tional power, RL’s problem-solving capabilities rose and it has been demonstrated
to be able to learn complex system-environment dynamics from experience, and use
these models to solve hard stochastic sequential decision problems. For instance,
it was used to solve complex robotic control tasks (e.g., [45]) or to achieve super
human performance in difficult games (e.g., [88]). In this section, we present works
that use RL to adapt systems according to user behavior or learn from human data.
While RL learns a policy given a reward function and an environment, inverse
reinforcement learning (IRL) attempts to learn the reward function from a behavioral
policy of a human expert [76] or to directly learn a policy from the behavior of a
human expert [2]. This idea was successfully applied in the robotics domain [1, 20],
not only to model human routine behavior [7] but also for character control and
gaming [57]. Reference [53] extend the idea of IRL by learning sub-task structures
from human demonstrations and additionally identify reward functions per sub-task.
Another work accounted for the fact that humans do not always act optimally and
proposed cooperative inverse reinforcement learning to allow for active teaching and
learning for a more effective value alignment between the human and agent [38].
IRL was also extended to deep reinforcement learning (DRL) settings where human
feedback on agent behavior was used to learn deep reward functions [18].
A stream of research applied RL in combination with human motion capture
data to improve policies for character animation and control [58, 63, 65, 94]. These
works usually use an RL-agent to learn how to stitch captured motion fragments
such that natural character motion as a sequence of clips is attained. A more recent
work in character animation rewards the learned controller for producing motions
that resemble human reference data [79] or directly learns full-body RL-controllers
from monocular video [80]. In a similar fashion, Aytar et al. [5] use YouTube videos
of humans playing video games to specify guidance for learning in cases where the
normal reward function only provides sparse rewards. These works employ human
Optimal Control to Support High-Level User Goals … 39
4 Background
In this section, we review problem formulations and methods of optimal control for
discrete-time dynamic systems. We begin by introducing the optimal control problem
(OCP) for cases where the model of the dynamic system is known. We then review
model predictive control to solve these problems. After that, we establish Markov
decision processes (MDPs) that are used to formulate OCPs for unknown system
models. Finally, we describe how reinforcement learning is used to solve MDPs.
We introduce the optimal control problem (OCP) formulation for the case of a discrete
system model whose progression is optimized according to a cost function. Thus,
consider the following discrete-time linear time-invariant system
where x ∈ Rn is the state vector, u ∈ Rm is the input vector, A ∈ Rn×n is the system
matrix, B ∈ Rn×m is the input matrix and t ∈ N0 is the discrete time. In the case of
a non-linear system, the right part of Eq. 1 is substituted by a function that describes
the evolution of the dynamic system: x(t + 1) = F(x(t), u(t)).
Let x(t) be the state vector measured at discrete time t and xt+k be the state
vector predicted at discrete time t + k using Eq. 1 with initial condition xt = x(t).
Further, consider the constraints x(t) ∈ X ⊆ Rn , u(t) ∈ U ⊆ R M where X and U are
polyhedral sets containing the origin in their interiors. The general form of the cost
function is defined as
N −1
VN (x(t)) = r T (xt+N , ut+N ) + γ k r (xt+k , ut+k ) (2)
k=0
discount factor, Q ∈ Rn×n is the state weighting matrix, R ∈ Rm×m is the input
weighting matrix and P ∈ Rn×n the terminal weighting matrix.
The optimal control problem (finite horizon) is then defined as
Optimal Control to Support High-Level User Goals … 41
N −1
VN∗ (x(t)) = min r T (xt+N , ut+N ) + γ k r (xt+k , ut+k ) (3)
U(t)
k=0
subject to xt+k+1 = Axt+k + But+k , k = 0, ..., N − 1
xt+k ∈ X, k = 1, ..., N
ut+k ∈ U, k = 0, ..., N − 1
xt = x(t)
Model predictive control solves the OCP (see Eq. 3) for the measured state vector
x(t) to attain the predicted optimal input sequence U∗ (t). It then applies its first
element (u∗ (t)) to the system. This is repeated at each discrete time t with a receding
prediction horizon. Thus, MPC is also denoted as receding horizon control (RHC).
Note that MPC uses a discount factor of γ = 1.
The OCP can be reformulated as a quadratic programming (QP) problem (see
[36] for the derivation):
1
minimize XT HX + f T X (4)
X 2
subject to Aineq X ≤ bineq
and Aeq X = beq ,
where X denotes the stacked state vectors xt and inputs ut for each time-point, H and
f contain the quadratic and linear cost coefficients, respectively, which are defined
by (2) , Aineq , bineq comprise the linear inequality constraints of states and inputs, and
Aeq , beq are the linear equality constraints from our model (1) for each time-point
k ∈ 0, . . . , N . This problem has a sparse structure and can be readily solved using
standard quadratic programming methods. For solving non-linear dynamic systems
and cost functions, non-linear programs (NLPs) can be used. In this work, we rely
on numerical solvers to solve MPC problems. For completeness, we mention that
OCPs can also be solved explicitly using multi-parametric quadratic programming
methods. This holds for linear [8, 37] and non-linear systems [47].
where t is the number of time units after the agent chooses action a in state s and
F(t|s, a) is the probability that the next decision epoch occurs within t time units.
We investigate the idea of using optimal control to facilitate system use in an appli-
cation which supports users in the creation of aerial videos using quadrotors. This is
a cognitively demanding use case as users need to control a quadrotor and a camera
simultaneously while considering the aesthetic aspects of filmmaking. During flight,
they need to control the five dimensions of the quadrotor camera (x, y, z, pitch and
yaw) while reacting to constantly changing state values of the moving robot.
We focus on supporting users in the creation of scenic aerial shots of the cityscapes
and landscapes where the environment is static, i.e., camera targets are not moving
during the shot. This allows us to move the robotic control task into an offline setting
where designed quadrotor flights can be simulated and the resulting trajectories used
to be tracked with a real robot. Figure 2 illustrates our setting where the controlled
variable is the simulated video (resp. trajectory) as produced by our MPC-based
design tool. We simplify the hard control task by allowing users to specify sparse
keyframes that sketch their intended video in a realistic virtual environment (e.g.,
Google Earth). Based on these keyframes, our MPC method generates feedforward
reference trajectories for a quadrotor camera by using an explicit model of its dynam-
ics. It interpolates between user-specified keyframes with a function that defines the
quadrotor’s desired behavior according to aesthetic criteria of aerial film. In our set-
ting, the control horizon equals the prediction horizon as all of its stages are used to
adjust the controlled variable, i.e., the trajectory that produces the aerial video. The
control horizon describes the stages that are implemented as input to the controlled
entity. Due to the user closing the loop, the horizon is not implemented at a fixed
rate, like in systems control, but every time they decide to preview their designed
quadrotor flight plan in simulation. Based on the perceived match between the sim-
ulated and the intended video, users can decide to refine or specify additional inputs
to better communicate their intent.
Following the problem setting in Sect. 2, we leverage MPC to abstract the high
dimensionality of quadrotor camera reference trajectories (five dimensions × dis-
cretization step × temporal length of the video) by using the optimization to generate
them from sparse user-specified keyframes.
In this section, we present an optimal control approach to adapt a user-robot
interface that supports users in completing a robotic task. First, we identify the high-
level user goals of the problem domain. Then, we introduce an MPC formulation
44 C. Gebhardt and O. Hilliges
that supports users in achieving these goals. Finally, we investigate the effectiveness
of our approach in terms of improving users’ efficacy and efficiency.
With a contextual analysis, we intend to find the main aesthetic criteria of aerial
videos. The goal is to model these criteria in our MPC formulation such that end-
users are capable to create videos that adhere to them. Therefore, we conduct expert
interviews, to elicit aesthetic criteria of aerial footage, and a user study, to examine
if existing tools [30, 49] enable the creation of videos that follow them. In the
following, we present the two elicited main criteria and discuss if participants were
able to design videos that adhere to them using the investigated ones.1
Smooth Camera Motion: The key to aesthetically pleasing aerial video is described
by one videographer as “[...] the camera is always in motion and movements are
smooth”. Another expert stated that smoothness is considered the criterion for shots
with a moving camera (c.f. [4, 39]), whereas the dynamics of camera motion should
stay adjustable (c.f. [49]).
The user study showed that participants did struggle to create videos with smooth
camera motion using existing tools. The problem is that keyframe timings are kept
fixed in current optimization schemes and smooth motion can only be generated
subject to these hard constraints. To this end, users are required to specify a similar
ratio of distance in time to distance in space in between all keyframes. This is a task
most participants failed to accomplish as keyframes are specified in 5D (3D position
and camera pitch and yaw) and imagining the resulting translational and rotational
velocities are cognitively demanding.
Continuous and Precise Camera Target Framing: The ability to control and fine-
tune the framing of a filmed subject continuously and with high precision is an
essential aesthetic tool. The interviewees highlighted the importance of being able to
precisely position an object in the image plane subject to a compositional intention
(e.g., a simultaneously moving foreground and background). For this reason, aerial
video shots are usually taken by two operators, one piloting the quadrotor and one
controlling the camera, allowing the camera operator to focus on and constantly
fine-tune the subject framing.
The results of the user study indicated that current tools do not support precise and
continuous target framing, as the videos of nearly all participants featured camera
targets that moved freely in the image plane. The problem is that current algorithms
do not model camera targets or use overly simplified models. Our basic optimization
simply interpolates camera angles between the user-specified keyframes. In [49, 85],
camera targets are modeled as the intersection of a keyframe’s center ray with the
1 We refer the reader to [30] for details on the results and experimental design of both studies.
Optimal Control to Support High-Level User Goals … 45
5.2 Method
Based on the results of the contextual analysis, we specify requirements for our MPC
formulation. It should (1) automatically generate globally smooth camera motion
over all specified spatial positions, while (2) giving users precise timing control
(established in [49]). In addition, they should (3) precisely and continuously control
the image space position of a camera target in between keyframes and (4) help users
to frame camera targets at desirable image space positions. In this section, we present
how we model these requirements in our optimization scheme.
We use the approximated quadrotor camera model in [29]. This discrete first-order
dynamical system is used as an equality constraint in our optimization problem:
where xi ∈ R24 are the quadrotor camera states and ui ∈ R6 are the inputs to the
system at horizon stage i. Furthermore, r ∈ R3 is the position of the quadrotor, ψq
is the quadrotor’s yaw angle and ψg and φg are the yaw and pitch angles of the
camera gimbal. The matrix A ∈ R24 × 24 propagates the state x forward, the matrix
B ∈ R24 × 6 defines the effect of the input u on the state and the vector g ∈ R24 that of
gravity for one time-step. F is the the force acting on the quadrotor, Mψq is the torque
along its z-axis and Mψg , Mφg are torques acting on pitch and yaw of the gimbal.
i+1 = C
i + Dvi , 0 ≤ vi ≤ vmax , (7)
where
i = [θi , θ̇i ] is the state and vi is the input of θ at step i and C ∈ R2 × 2 , D ∈
R2 × 1 are the discrete system matrices. vi approximates the quadrotor’s acceleration
as θ is an approximation of the trajectory length.
With this extension of the dynamic model in place, we now formulate an objective
to minimize the error between the desired quadrotor position rd (θ ) and the current
quadrotor position r. Thus, we minimize 3D-space approximation of lag ˆl and
contour error ˆc of [74]. Splitting positional reference tracking into a lag error (lag in
time) and a contour error (deviation from path) is important for time-free reference
tracking [55], as this avoids behavior where a robot either lags behind the temporal
reference or cannot trade off positional fit for smoother motion. The positional error
term is then defined as
T
ˆ l (θ ) ˆ l (θ )
c p (θ, ri ) = Q , (8)
ˆ c (θ ) ˆ c (θ )
where Q is a diagonal positive definite weight matrix. Minimizing c p will move the
quadrotor along the user-defined spatial reference.
For the camera to smoothly follow the path, we need to ensure that θ progresses.
By specifying an initial θ0 and demanding θ to reach the end of the trajectory in the
terminal state θ N , the progress of θ can be forced with an implicit cost term. We
simply penalize the trajectory end time by minimizing the state-space variable T ,
cend (T ) = T. (9)
Optimal Control to Support High-Level User Goals … 47
To give users precise timing control, we augment our objective function with an
additional term for soft-constraint keyframe timings. Due to the variable horizon, we
lack a fixed mapping between time and stage. To be able to map timings with the
spatial reference, we use the θ -parameterization of the reference spline. Reference
timings hence need to be specified strictly increasing in θ . Based on the reference
timings and the corresponding θ -values, we interpolate a spline through these points,
which results in timing reference function td (θ ) which can be followed analogously
to spatial references by minimizing the cost,
where i is the current stage of the horizon and t is the discretization of the model.
The above extension enables mimicry of timing control in prior methods. However,
the actual purpose of specifying camera timings in a video is to control or change
camera velocity to achieve the desired effect. Since determining the timing of the
shot explicitly is difficult, we propose a way for users to directly specify camera
velocities. We extend the formulation of our method to accept reference velocities
as input. Again, we use the θ -parameterization to assign velocities to the reference
spline fd . To minimize the difference between the velocity of the quadrotor and the
user-specified velocity profile vd (θ ), we specify the cost,
2 This also prevents solutions of infinitely long trajectories in time where adding steps with ui ≈ 0
is free w.r.t. to Eq. (10)).
48 C. Gebhardt and O. Hilliges
where we project the current velocity of the quadrotor ṙi on the normalized tangent
vector of the positional reference function n.
For our target model, we assume that in scenic aerial video shots the quadrotor
camera focuses on one target per frame and targets change between frames. We
believe this is a valid assumption as human perceptual constraints prevent us from
focusing on more than one point at a time, which is also reflected in popular aerial
video techniques [15]. For each user-specified keyframe k j , our tool also provides
the assigned camera target t j (see [31] for details on the tool). To incorporate these
targets into our optimization, we fit a bounding cylinder with radius rt , height h t ,
and a center at position pt ∈ R3 . We use this primitive as its geometric properties
allow us to consider the 3D expansion of a target in our optimization scheme without
needing to sample vertices during optimization. Using the reference parameterization
of Eq. 7, we then compute a target reference spline that interpolates position, radius
and height of the consecutive targets t j of the video. The spline is defined as
ft (θ ) = [pt (θ ), rt (θ ), h t (θ )] ∈ R5 . (13)
It relates the sequence of camera targets to the θ -parameterized spline that specifies
the positional reference of the quadrotor camera.
With our method, we optimize the degrees of freedom of the quadrotor camera to
position 3D targets at desirable locations in the image frame. Therefore, we need
a reference function that specifies desirable image space locations for our target
reference spline ft (θ ) when observed from the according quadrotor position rd (θ ).
We start by specifying a set of vectors L in the camera clip space C that define
desirable image space positions according to videographic compositional rules. In
our implementation, L contains the center of the image plane (lc = [0, 0, 0, 1]T ) and
the intersection points of the Rules of Thirds (e.g., lc = [0, 13 , 13 , 1]T ).3 Second, we
compute the directional vector mvj between the position rd, j of each keyframe and
the position of its target pt, j in camera view space V . This vector is computed as
3 These points can be seen Fig. 3 and are the intersections of the blue dotted lines.
Optimal Control to Support High-Level User Goals … 49
where pvj is the view space position of t j , ψd, j and φd, j are the pitch and yaw
orientation of k j and Rψ,φ ∈ S O(3) is the rotation matrix from world frame I to
camera view space C. For each keyframe, we identify the vector in L that is closest
to the actual directional vector between t j in k j as
where vvj is the closest vector, Rc ∈ S O(4) is the camera matrix that performs rota-
tions from the camera’s view space V to its clip space C and w is the function that
normalizes homogeneously to Cartesian coordinates. Using the θ parameterization,
the reference directional vectors vvj are linearly interpolated to attain the function
vdv (θ ). If users want to keep the framing they specified, one can compute vdv (θ ) by
interpolating the mvj vectors. With the reference function for desired target framing
in place, we can now define the cost term to optimize camera target framing as
where pv is the position of the target in V and ri the position of the quadrotor at
a specific stage i of the optimization horizon. Figure 3 illustrates the cost term. It
optimizes the pose of the quadrotor camera such that the relative vector between
quadrotor and target pv aligns with the directional reference vector vdv (θ ) to position
the target at the desired image space position.
50 C. Gebhardt and O. Hilliges
With the framing optimization cost term in place, one can ensure that the camera target
is at the desired position in the image plane across the entire trajectory. However,
if the distance between the camera and camera target is small, it is possible that
although the center of the camera target is at the desired position on the image plane,
large parts of the target are not captured. While this is most likely not the case at
keyframes (they are specified by users), it can occur between them where reference
positions are attained via interpolation. To address this problem, we propose a second
cost term that ensures that camera targets are entirely visible in each frame of the
video. Figure 4 illustrates the intuition behind the objective term that introduces a
penalty when any part of the bounding cylinder of a camera target intersects with a
plane of the extended camera frustum and, hence, would not be visible.
For this cost term, we first calculate four points on the edges of the cylinder. As
the roll of the quadrotor camera is always zero, one can attain the points at the top
and bottom of the cylinder as
T
h t (θ )
pvt = pv + 0, , 0 (17)
2
T
h t (θ )
pb = p − 0,
v v
, 0 . (18)
2
To find the left and the right edges of the cylinder from the perspective of the
camera, we search the points that have the z value of pv , the distance rt from pv
and are on the plane that has pv as its normal. By substituting these values into the
point-normal form of this plane, we get the x-values of these points:
y 2pv − y pv y
x = x pv + , (19)
xv
where x pv , y pv are the respective x- and y-values of pv . Substituting x into the 2-norm
equation that computes the distance between pv and one of the points on the edge,
we attain the y-values as
Optimal Control to Support High-Level User Goals … 51
r2
y1,2 = y pv ± 2 t
. (20)
y pv
x2
+1
pv
Substituting the respective y-values back into Eq. 19, we can then specify plv =
[x2 , y2 , z v ]T , prv = [x1 , y1 , z v ]T .
To attain the planes that describe the camera frustum, we first project the corners
of the image plane into the camera view space:
where c1−4 are the four corners of the image plane defined in the camera clip space
and vtl , vtr , vbl and vbr are the top-left, top-right, bottom-left and bottom-right vectors
that form the edges of the camera frustum. We then compute the normals of the planes
of the frustum:
In addition, we define a visibility cost function that returns the squared minimum of
two distances and zero if they have different signs:
0 if |d1 + d2 | < |d1 | + |d2 |
f vis (d1 , d2 ) = (23)
min(d1 , d2 )2 otherwise.
The visibility cost term is computed by summing the result of Eq. 23 for the distances
between cylinder edge points and planes:
M
f all (nt−r , pvt−r ) = f vis (ntT pm , nbT pm ) + f vis (nlT pm , nrT pm )
pm
0 if xpv < 0
cvis (nt−r , pvt−r , xpv ) = (24)
f all (nt−r , pt−r ) otherwise,
v
where M = [pvt , pvb , plv , prv ] is the set of cylinder edge points and nT p is the distance
between a point p and a plane specified by the normal n. Revoking the costs if xpv < 0
ensures that the term is only active if the target is in front of the camera.
We construct the objective function by linearly combining the cost terms from Eqs.
(8), (9), (10), (11), (12), (16), (24) and a 2-norm minimization of v:
52 C. Gebhardt and O. Hilliges
where the scalar weight parameters w p , w jerk , wend , wt , wvel , w f , wvis , wv > 0 are
adjusted for a good trade-off between positional and temporal fit as well as smooth-
ness. The final optimization problem is then
N
minimize Ji (26)
x,u,
,v
i=0
subject to x0 = k0 (initial state)
0 = 0 (initial progress)
N = L (terminal progress)
xi+1 = Axi + Bui + g (dynamical model)
i+1 = C
i + Dvi (progress model)
xmin ≤ xi ≤ xmax , (state bounds)
umin ≤ ui ≤ umax , (input limits)
0 ≤
i ≤
max (progress bounds)
0 ≤ vi ≤ vmax (progress input limits),
5.3 Evaluation
The goal of our method is to enable end-users to plan and record aerial videos
that adhere to the aesthetic criteria of aerial film. As such, we identified smooth
camera motion and precise and continuous camera target framing. In the following
subsections, we first compare our method with existing approaches [30, 49] on
metrics that are representative of the said criteria. After that, we investigate if our
approach causes a perceptual difference of produced videos compared to the same
baselines.
Smooth Camera Motion: To assess quantitatively that our method generates smoother
camera motion, we compare the averaged squared jerk per horizon stage generated
Optimal Control to Support High-Level User Goals … 53
Fig. 5 Comparing average squared jerk (in ms 3 ) and angular jerk (in ◦s 3 ) per horizon stage of
2 2
different trajectories for our method and [29, 49] (note that the latter uses a different model)
with our method and with hard-constrained approaches [30, 49]. Thus, we use user-
designed trajectories from the user study in [30]. Figure 5 shows lower jerk and
angular jerk values for our optimization scheme compared to both baseline methods,
across all trajectories.
Continuous and Precise Camera Target Framing: To evaluate if our method fulfills
our requirements with respect to target framing, we design a challenging shot and
generate trajectories with four different approaches. The first is generated with the
target framing method of [30] that interpolates angles between keyframes. For the
second trajectory, we use the target framing method of [49] that interpolates the
intersection of the center rays of keyframes with the environment. The third trajectory
is generated with our method using only the framing cost term (c f ). The last trajectory
is generated by using both cost terms (c f and cvis ). To ensure that variations in
trajectories stem from differences in target framing, we implemented the framing
objectives of [30, 49] in our time-free MPC method (Eqs. 16 and 24 were replaced
with the respective terms of [30, 49]).
Figure 6 shows a frame-by-frame comparison of the resulting videos of the four
trajectories. For the [30] (a) and [49] (b), the camera target (the water tower) moves
freely around the image plane, partly disappearing in some frames. In contrast, the
video generated by using the cost term c f (c) corrects the camera orientation of the
keyframes (first and last frame) to align the target with the closest vertical line of
the Rules of Thirds (see dashed white lines). In between keyframes, the tower nicely
transitions between these two lines. However, when the camera approaches the target
closely, the tower is not entirely visible in some of the video frames. This is corrected
in the video of (d) where all cost terms of our method (c f and cv ) are active.
The technical comparison shows that our method generates smoother trajectories.
However, it has not been validated that the generated trajectories result in the
54 C. Gebhardt and O. Hilliges
Fig. 6 The figure Illustrates the effect of the framing term on videos generated from two keyframes.
For a [30] and b [49], the camera target (the water tower) moves freely on the image plane of the
videos. In c, generated with c f , the target is aligned with the (dashed white) vertical lines of the
Rules of Thirds at the keyframes and transitions between them in between keyframes. d, generated
with c f + cv , shows the same behavior and additionally corrects the framing such that the water
tower is visible through the entire shot. e displays a top-down view on the trajectories of (a–c) and
(d)
aesthetically more pleasing video. To this end, we conduct an online survey compar-
ing videos which follow user-specified timings, generated with our method and the
methods of [30, 49]. Thus, we compare user-designed trajectories from prior work
[30, 49]. For each question, we take the user-specified keyframes of the original
trajectory and generated a time-optimized trajectory of the same temporal duration
using our method. We then render videos for the original and time-optimized trajec-
tory using Google Earth. The two resulting videos are placed side-by-side, randomly
assigned to the left or right, and participants state which video they prefer on a forced
alternative choice 5-point Likert scale. Negative values mean that the original, timed
video is aesthetically more pleasing, 0 indicates no difference and a positive value
indicates a more aesthetically pleasing time-optimized video.4
Each of the 424 participants compared 14 videos. Results provide strong evidence
that our method has a positive effect on the aesthetic perception of aerial videos (see
Fig. 7). Furthermore, it has been shown that this effect is stronger for videos of non-
experts. Looking at expert-created videos, the picture is different. These videos were
rated as more pleasant when generated with methods which respect user-specified
timings. This can be explained by the fact that experts explicitly leverage shot tim-
ings to create compositional effects. Optimizing for global smoothness removes this
Fig. 7 Mean and 95% confidence interval of the effect of optimization scheme on all, non-expert
designed and expert designed videos. Significance notation is with respect to the null effect (zero)
Fig. 8 Mean and 95% confidence interval of the effect of optimization scheme on the perceived
match with videographer intent and aesthetics. Significance notation is with respect to the null effect
(zero)
intention from the result. However, the significant positive effect of our method on
all responses and a larger effect size for the positive effect of non-expert- compared
to the negative effect of expert designed videos indicate that smooth motion is a more
important factor for the aesthetic perception of aerial videos than timing. In addition,
we showed in a follow-up study that experts benefit from our soft-constrained instead
of baselines’ hard-constrained timings [33]. Soft-constraints allow the optimizer to
trade off the temporal fit for a smoother or physically feasible trajectory.
In this study, we analyzed the effect of our method and the approaches of [30, 49] on
the match between generated and user-intended target framing as well as on the aes-
thetic perception of aerial videos. More specifically, we conducted another pairwise
perceptual comparison of videos generated with the mentioned approaches. Thus, we
designed two questionnaires. With the first (INTENT), we investigated if video view-
ers perceive differences in how well the videos of conditions match with the intended
result of the videographer. The second questionnaire (AESTHETICS) examined the
effect of methods on the aesthetic perception of videos. To create the questionnaires,
we used user-designed trajectories from a study in [31]. For INTENT, we additionally
displayed the image space position at which the videographer intended to position
the target. These positions were specified according to the description of the partici-
pant who designed the trajectory and were shown as a thin cross. For AESTHETICS
the videos were displayed unmodified. In both surveys, the video of our method
was placed side-by-side with a video from one of the baseline conditions. Videos
were randomly assigned to the left or right, and participants stated which video they
preferred on a forced alternative choice 5-point Likert scale.
56 C. Gebhardt and O. Hilliges
518 participants answered both questionnaires. Negative values mean that the
video of [30, 49] is perceived as more pleasing, 0 indicates no difference and a
positive value means that the video of our method is perceived as more pleasing.
The results of the study indicate that participants perceived the videos of our method
to a better match with videographers’ intent compared to [30, 49] (see Fig. 8). In
terms of the aesthetic perception of videos, results are not as clear-cut. Participants
found our method to produce aesthetically more pleasing videos than the framing
method of [49]. However, no significant differences are found in the comparison
with [30]. Aesthetic perception of framing also depends on other factors than the
framing objectives formalized in the algorithm, e.g., the surroundings of the camera
target. We assume that the null result can be explained by participants that attribute
differences in the videos to a variety of such unmodeled factors. Nevertheless, the
positive result of our approach in comparison with [49] and the fact that the INTENT
questionnaire revealed a distinct significant effect in favor of our method over [30]
for the same videos encourages us to suggest that our approach positively effects the
aesthetic perception of videos.
We have shown that optimal control can help users to create robotic aerial videos. Our
proposed MPC approach supports users in a specific task they intend to accomplish
by using a system with (quasi-)deterministic dynamics that can be modeled explic-
itly. Other emerging technologies, like mixed-reality systems or mobile devices, aim
at supporting users in their everyday life. Hence, such systems are confronted with
users that will change their task, their goals or their context without explicitly com-
municating these changes to the system. Thus, such systems need to consider users’
state in their state space. To this end, the user state is approximated with sensors that
measure their context (e.g., location) or bodily signals (e.g., eye tracking). The prob-
lem is that the same measured approximated user state can correlate with a multitude
of true user states. This causes state transitions within the coupled user-system state
space to not be deterministic but stochastic. All of this results in complex dynamics
that cannot be modeled explicitly and, hence, model predictive control would fail to
compute a control policy to support users in such tasks.
An optimal control method that is capable to cope with such challenging settings
is reinforcement learning. RL can learn state transition models for complex non-
linear dynamics from experience and use these models in control policies to solve
stochastic sequential decision problems. This describes the very nature of mentioned
tasks. Thus, we pose such problems in the RL-framework where the user functions
as the RL-environment. Figure 9 illustrates our setting and contrasts it with standard
RL. Our agent makes observations about the user behavior (i.e., collected behavioral
data) to learn to support the user in a certain task. More specifically, the agent learns
to adapt system controls or displays based on a function that rewards its actions
given only the observed user behavior and no form of further explicit supervision.
Optimal Control to Support High-Level User Goals … 57
This function determines the reward based on the discrepancy of the agent’s exhibited
and desired behavior.
In a first case study, we investigate the applicability of this general setting in the
context of a mixed-reality adaptive UI. Thus, we use RL to adapt augmentations
based on users’ preferences or tasks learned from their gaze interactions with a UI.
We propose a method that learns to label objects by observing users’ gaze interactions
with objects and labels. The introduced reward function models user intent and guides
the resulting policies to minimize the displayed labels in an MR-environment, without
hiding relevant information. This filters information based on semantic and spatial
relevance and, hence, supports users in their task by avoiding visual overload.5
In this section, we first introduce the behavioral data we used to build the RL-
environment. We then detail our method including the RL-environment and reward
function that are necessary to train MR-labeling policies. Finally, we demonstrate
the capability of the method in several experiments.
To train our RL-agent, we require gaze trajectories. We collected this data via eye
tracking in a well-defined visual search task. Participants were asked to identify
targets among a set of objects that display the highest value on their label. Objects
are 3D primitives positioned on a shelf-like virtual environment (see Fig. 10) and all
the labels are present. The visual search environment is implemented in the Unity
game engine and rendered in Virtual Reality. Participants could see the scene through
an HTC Vive headset with integrated Tobii Pro eye tracking. We logged participants’
gaze data relative to object and label positions, and their pupil dilation. All data were
logged 120 Hz (operating frequency of eye tracker). In post-processing, we ran the
eye-tracking event detection algorithm of [23] to estimate fixations and saccades. We
collected 1300 visual search trials from 14 participants.
5 For a more detailed version of this section, we refer the interested reader to [28].
58 C. Gebhardt and O. Hilliges
6.2 Method
We propose an agent that observes gaze trajectories and learns to label or not to
label objects based on a function that rewards its actions given only the user gaze
behavior and no form of further explicit supervision. In the following subsections, we
will explain the individual components of our RL-method: the underlying decision
process, the state action space of the agent, the RL-environment, the reward function
and the learning procedure.
Fig. 11 a Angles αgo 1−5 between objects o1−5 and gaze ray n in the local coordinate frames of
g
the objects from the perspective of the user. b Geometric relations between the gaze unit vector ng ,
position of object po and user pu , gaze object vector rgo , and the gaze object angle αco
discounted. This depends on the difference between the current time and the time they
are normally encountered (after visiting the current state). Note that we use model-
free RL to approximate the distributions P(s |s, a) and F(t|s, a) of the SMDP’s
Bellman equation.
The agent will need to decide whether to show a label for each object in the scene. A
naïve way to represent the agent’s state would be to take the geometric relations of
all objects with respect to the user’s gaze in the world coordinate frame of the virtual
scene. However, this would result in the large state and action spaces, rendering
generalization to unseen scenes difficult. A more compact state-space representation
is given by the geometric relation of the gaze point with respect to the center of an
individual object. The agent then decides label visibility for all objects in the scene.
The state space is defined in the local coordinate frame of an object with its center
as the origin (see Fig. 11, a). More concretely, state and action spaces are given by
s = [bo , αgo , αgo ] (27)
show
a= (28)
hide
where αgo is the angle between gaze unit vector n g and gaze to object center vector r go
(see Fig. 11b) and αgo is the angular velocity calculated by taking finite differences
between two consecutive values of αgo . bo is a one-hot vector encoding for object
60 C. Gebhardt and O. Hilliges
properties which in the particular case of our visual search task is a binary feature
to distinguish between Os and Qs or spheres and other primitives. The actions are to
show or to hide the label of an object. Euclidean distance is not included in the state
space as it caused results to deteriorate.
6.2.3 Environment
We consider a label to be fixated if αgo is zero and if the algorithm of [23] detects
a fixation. All four if statements are necessary to avoid convergence to cases where
either all or no labels are shown. Empirically, we derived that reasonable policies are
attained with the reward values rl = 10 and rc = 1.
subsubsectionLearning Procedure With the RL-environment and the reward func-
tion in place we can now run standard algorithms like Q-learning and SARSA [?] to
learn an approximation of the state action value function. Due to the small state space
it is sufficient to represent the continuous state action value function q̂(st , at , wt ) with
a RBF-parameterized function approximator (cf. [84]). In our experiments more
powerful function approximators, such as deep neural networks, did not yield per-
formance improvements. For SARSA, the function’s update rule is as follows:
where wt is the parameter vector of the state action value function and denotes the
gradient of function q̂. In accordance with the underlying SMDP and to account for
the varying duration of saccades, an action at can be of different temporal length,
modeled by τt . Equation 30 corresponds to performing standard stochastic gradient
descent on the state action value function. Using epsilon-greedy exploration, the
agent then learns for a particular state st to show or to hide the label at in the next
state st+1 in correspondence to the reward provided by Equation 29 (see Figure 12).
62 C. Gebhardt and O. Hilliges
Fig. 13 Performance comparison between ours (in purple) versus an SVM-based baseline (in
green). Line denotes average normalized reward on an unseen trial (y-axis) over percentage of
experienced training samples (x-axis). The shaded area represents the standard deviation. Ours
attains higher rewards and continues to learn from experiencing more samples, whereas the baseline
converges to a low reward, displays high variance and does not improve with more samples
6.3 Evaluation
In this section, we evaluate the capability of our approach to support users in a visual
search task. Thus, we first report on the results of a comparison between our method
and a supervised baseline. We then present the results of a user study and, finally,
demonstrate the applicability of our method to more realistic use cases.
Fig. 14 The labeling of policies learned in our experiment (pink dot is user’s focus) for highest
number task a pre-attentive and b attentive object features as well as matching-string task d pre-
attentive and e attentive object features. The output of supervised policies learned for c pre-attentive
and f attentive object features is also shown
variance in rewards than RL-policies and thus are less stable. Finally, we highlight
that supervised policies show zero improvement with an increasing number of train-
ing samples, indicating that the SVM does not fully capture the underlying decision
process of the user (see next paragraph).
To assess if learned policies are useful, we investigated their output with new
users on unseen trials in the VR environment. We perform this test with policies of
the SVM and our RL-method. Interestingly, policies of the supervised setting mostly
converged to showing labels within a certain angle around the current fixation point
(see Fig. 14b). In contrast, the RL-agent learns to distinguish between target and
distractor objects and only displays the labels of targets (see Fig. 14a).
The goal of our method is to learn policies that support users while minimizing infor-
mation overload. To evaluate the success of our approach, we conducted a user study
in which a new set of participants solved the visual search task of the data collection
with the help of our RL-method labeling policies. We compare participants’ task
performance using our RL-method with three other baselines: (1) and (2) showing
labels of all objects at all time (SA = “Show All”), (3) showing the label of the
object with the single closest angular distance to the user’s gaze ray (CO = “Closest
Object”) and (4) showing labels of objects according to predictions of an SVM (SL =
“Supervised Learning”). In this experiment, the tasks, object features and apparatus
are identical to those used during data collection.
Figure 15 summarizes the results of our study. The data of 12 participants provides
evidence that our method (RL) can learn policies that support users in their tasks
while reducing the amount of unnecessarily shown labels. Statistical testing did not
find significant differences in task execution time, and support between our method
and the baseline of showing all labels at all times (SA). Nonetheless, RL reduces
the amount of shown labels compared to SA by 87%. Likewise, the conditions CO
and SL only show a fraction of the labels of SA. However, participants perceived
64 C. Gebhardt and O. Hilliges
Fig. 15 Mean and 95% confidence interval of a task execution time (in seconds), b perceived
support (Likert-item, higher number standing for more support) and c fraction of shown labels.
Significance notation is with respect to the condition RL
Fig. 16 Setting: a policy trained on data where the user was instructed to look for wine. b policy
trained on trials where users are looking for wine, water and juice. Results: a The policy correctly
displays only the label of a single item of interest (pink dot is user’s gaze). b The policy displays
the labels of the multiple items of interest, while hiding those of other drinks
they were significantly better supported by our policies compared to CO and SL.
We attribute this to the fact, that our method decides to show the label of an object
not only based on spatial information (e.g., the closest distance to gaze ray) but also
learns and considers the semantic relevance of the object for the task.
Fig. 17 Apartment scenario: a a realistic apartment environment with label and object occlusions
makes for a difficult visual search task; b, c policy that identifies target objects and only shows their
labels; d policy that shows labels of objects which are close to user’s gaze
have a high visual similarity (see juice and milk cartons in Fig. 16) which causes
participants to confuse items.
Apartment Scenario: Higher Visual Fidelity We ask participants to find the object
with the highest number on its label in the rooms of a virtual apartment (bathroom,
kitchen, living room, etc.) in which objects and labels can be occluded (see Fig. 17a).
We conduct this experiment to see if the additional randomness in the participant’s
gaze behavior, introduced by salient features as well as object and label occlusions,
prevents the agent from learning meaningful policies. Policies tend to show the labels
of all objects close to participants’ point of gaze since occlusions make it difficult
for the agent to identify behavioral differences for target and distractor objects (see
Fig. 17d). However, even in these environments, learning can converge to policies
that identify target objects and only show the desired labels (see Fig. 17b, c).
Both scenarios have revealed that the quality of labeling policies depends on
the compliance of participant behavior with the specified task. If users, during data
collection, regularly confused target and distractor objects and checked the labels of
distractors, the cooperative policy will learn to label objects wrongly. This can be
seen as a drawback of our approach which we discuss further in Sect. 7.
7 Discussion
Our mixed-reality case study has shown that it is possible to learn to assist users in
their task or reproduce their behavior solely from observing their interactions with
a UI. Policies were learned from logged interactions and no costly data labeling was
necessary. This is an interesting property for HCI as ample interaction log files exist
from which RL could leverage to develop predictive user models.
In addition, RL indicated to better predict user behavior compared to supervised
user modeling techniques. This can be attributed to the underlying formalism of
reinforcement learning. In RL, behavioral objectives for a policy are modeled with a
reward function with which they can be weighted according to their importance for a
problem domain. Furthermore, it does not assume state transitions to be deterministic
and models them with a probabilistic function. This allows determining the optimal
action in a state even in stochastic environments. RL also considers the expected
rewards in future states when deciding for an action in the current state of the agent.
66 C. Gebhardt and O. Hilliges
6 Myopic policies only consider the attainable reward in the next state and neglect other future states
In this chapter, we presented first evidence on how optimal control can improve user
support in challenging human-computer interaction use cases. To realize optimal
control’s full potential in the context of HCI, further research is necessary for which
we propose interesting directions in the following paragraphs.
Extend the application of MPC in HCI: Research proposed assignment problem
formulations for a large variety of HCI tasks, e.g., [6, 78]. By using mixed-integer
MPC approaches [52], these could be ported into the optimal control framework.
This would allow modeling iterative dependencies in such tasks that emerge from
the closed-loop dynamics of HCI.
Learn to adapt for changing user goals: The ideal interface recognizes the goal or the
task a user is pursuing and acts to support them accordingly. Future research endeav-
ors should work on developing agents that are capable of such behavior. To tackle
this challenge, machine learning research proposed different approaches to identify
higher level tasks from demonstrations [25, 53]. These could be used to develop
agents that infer possible user goals or tasks from human-interface interactions and
support users in accomplishing them. Similarly, one may investigate if multi-goal
adaptive UI agents can be learned via temporal integration of users’ reactions to
low-level adaptations in settings with shaped reward functions.
Extend adaptive capabilities of UI agents: Machine learning proposed RL algorithms
for large discrete [22] and even continuous action spaces [60]. Building on this work,
future research should aim at extending the display adaptation capabilities of UI
agents. RL-based agents should not only decide on the assignment of content but
also manipulate the content’s location and properties to better suit users’ tasks. This
might ultimately result in agents whose actions represent pixels and that learn to
create optimal UIs for observed user behavior.
Combine MPC and RL to support users in HCI: We have shown that the capabilities
of MPC and RL in the context of supporting interaction are complementary. Thus,
research should investigate their combination to advance the utility of AUIs. For
instance, models for complex non-linear environment dynamics could be learned
and used in a model-predictive optimization to compute policies without requiring
retraining for changing objectives. Another example would be an MPC that provides
basic guidance in situations that have not been encountered previously according
to modeled user goals and a learning-based method that iteratively improves and
personalizes adaptations with the increasing amount of encountered states.
Embedding UI agents into the real world: One promise of artificial intelligence are
agents that act in the real world to amplify and extend human abilities. UIs are an ideal
sandbox for this vision as their malfunctioning has minor consequences compared
to malfunctions of robots. Thus, RL-based UI agents can contribute to turning this
promise into reality. However, in realistic settings new challenges arise. Real-world
user behavior is subject to sidetracking, multitasking and changing preferences, and
68 C. Gebhardt and O. Hilliges
RL-based user modeling techniques need to be able to recover from them to provide
useful UI adaptations. Future research also needs to develop policies that work in
settings where learning and inference are done online. This could be achieved through
off-policy learning and an agent that continuously updates a policy based on users’
reaction toward the UI adaptation of a behavioral policy. The behavioral policy would
regularly be updated.
9 Conclusion
References
1. Abbeel P, Dolgov D, Ng AY, Thrun S (2008) Apprenticeship learning for motion planning with
application to parking lot navigation. In: IEEE international conference on intelligent robots
and systems 2008. IROS ’08. IEEE, pp 1083–1090
2. Pieter A, Ng Andrew Y (2004) Apprenticeship learning via inverse reinforcement learning. p
1
3. Kumaripaba A, Alan M, Antti O, Giulio J, Dorota G (2016) Beyond relevance: adapting explo-
ration/exploitation in information retrieval. Association for Computing Machinery, New York,
NY, USA
4. Audronis T (2014) How to get cinematic drone shots
5. Aytar Y, Pfaff T, Budden D, Le Paine T, Wang Z, de Freitas N (2018) Playing hard exploration
games by watching youtube. In: Advances in neural information processing systems
6. Gilles B, Antti O, Timo K, Sabrina H (2013) Menuoptimizer: interactive optimization of menu
systems. pp 331–342
7. Banovic N, Buzali T, Chevalier F, Mankoff J, Dey AK (2016) Modeling and understanding
human routine behavior. In: Proceedings of the 2016 CHI conference on human factors in
computing systems, CHI ’16. ACM, pp 248–260
8. Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator
for constrained systems. Automatica 38(1):3–20
9. Bertsekas Dimitri P, Tsitsiklis John N (1995). Neuro-dynamic programming: an overview, vol
1. IEEE, pp 560–564
10. Bronner S, Shippen J (2015) Biomechanical metrics of aesthetic perception in dance. Exp Brain
Res 233(12), 3565–3581:12
11. Chapanis A (1976) Engineering psychology. Rand McNally, Chicago
12. Chen M, Beutel A, Covington P, Jain S, Belletti F, Chi H (eds) (2019) Top-k off-policy cor-
rection for a reinforce recommender system. In: Proceedings of the twelfth ACM international
conference on web search and data mining, WSDM ’19. ACM, pp 456–464
13. Chen X, Bailly G, Brumby DP, Oulasvirta A, Howes A (2015) The emergence of interactive
behavior: A model of rational menu search. In: Proceedings of the 33rd annual ACM confer-
ence on human factors in computing systems, CHI ’15, pp 4217-4226, New York, NY, USA.
Association for Computing Machinery
14. Xiuli C, Sandra Dorothee S, Chris B, Andrew H (2017). A cognitive model of how people make
decisions through interaction with visual displays. Association for Computing Machinery, New
York, NY, USA
15. Cheng E (2016) Aerial photography and videography using drones, vol 1. Peachpit Press
16. Chipalkatty R, Droge G, Egerstedt MB (2013) Less is more: mixed-initiative model-predictive
control with human inputs. IEEE Trans Rob 29(3):695–703
17. Chipalkatty R, Egerstedt M (2010) Human-in-the-loop: Terminal constraint receding horizon
control with human inputs. pp 2712–2717
18. Christiano PF, Leike J, Brown T, Martic M Legg S, Amodei D (2017) Deep reinforcement
learning from human preferences. In: Advances in neural information processing systems, pp
4299–4307
19. Clarke DW, Mohtadi C, Tuffs PS (1987) Generalized predictive control-part i. the basic algo-
rithm. Automatica 23(2):137–148
20. Coates A, Abbeel P, Ng AY (2009) Apprenticeship learning for helicopter control. Commun
ACM 52(7):97–105
21. Cutler CR, Ramaker BL (1980) Dynamic matrix control - a computer control algorithm. In:
Joint automatic control conference, vol 17, p 72
22. Dulac-Arnold G, Evans R, van Hasselt H, Sunehag P, Lillicrap T, Hunt J, Mann T, Weber
T, Degris T, Coppin B (2015). Deep reinforcement learning in large discrete action spaces.
arXiv:1512.07679
23. Engbert R, Kliegl R (2003) Microsaccades uncover the orientation of covert attention. Vis Res
43(9):1035–1045
70 C. Gebhardt and O. Hilliges
24. Findlater L, Gajos KZ (2009) Design space and evaluation challenges of adaptive graphical
user interfaces. AI Mag 30(4):68–68
25. Frans K, Ho J, Chen X, Abbeel X, Schulman J (2017) Meta learning shared hierarchies.
arXiv:1710.09767
26. Fritsch FN, Carlson RE (1980) Monotone piecewise cubic interpolation. SIAM J Numer Anal
17(2):238–246
27. Gašić M, Young S (2014) Gaussian processes for POMDP-based dialogue manager optimiza-
tion. IEEE Trans Audio Speech Lang Process 22(1):28–40
28. Gebhardt C, Hecox B, van Opheusden B, Wigdor D, Hillis J, Hilliges O, Benko H (2019)
Learning cooperative personalized policies from gaze data. In: Proceedings of the 32nd annual
ACM symposium on user interface software and technology, UIST ’19, New York, NY, US.
ACM
29. gebhardt c, hepp b, naegeli t, stevsic s, hilliges o (2061) airways: optimization-based Planning
of Quadrotor Trajectories according to High-Level User Goals. In: ACM SIGCHI conference
on human factors in computing systems, CHI ’16, New York, NY, USA. ACM
30. Gebhardt C, Hilliges O (2018) WYFIWYG: investigating effective user support in aerial
videography. arXiv:1801.05972
31. Christoph G, Otmar H (2020) Optimizing for cinematographic quadrotor camera target framing.
In: Submission to ACM SIGCHI
32. Gebhardt C, Oulasvirta A, Hilliges O (2020) Hierarchical Reinforcement Learning as a Model
of Human Task Interleaving. arXiv:2001.02122
33. Gebhardt C, Stevsic S, Hilliges O (2018) Optimizing for aesthetically pleasing quadrotor cam-
era motion. ACM Trans Graph (Proc ACM SIGGRAPH) 37(4):90:1–90:11:8
34. Ali G, Judith B, Atsuto M, Danica K, Mårten B (2016) A sensorimotor reinforcement learning
framework for physical human-robot interaction. pp 2682–2688
35. Dorota G, Tuukka R, Ksenia K, Kumaripaba A, Samuel K, Giulio J (2013) Directing exploratory
search: Reinforcement learning from user interactions with keywords. pp 117–128
36. Görges D (2017) Relations between model predictive control and reinforcement learning.
IFAC-PapersOnLine 50(1):4920–4928
37. Grieder P, Borrelli F, Torrisi F, Morari M (2004) Computation of the constrained infinite time
linear quadratic regulator. Automatica 40(4):701–708
38. Hadfield-Menell D, Russell SJ, Abbeel P, Dragan A (2016) Cooperative inverse reinforcement
learning. In: Advances in neural information processing systems, pp 3909–3917
39. Hennessy J (2015) 13 powerful tips to improve your aerial cinematography
40. Ho B-J, Balaji B, Koseoglu M, Sandha S, Pei S, Srivastava M (2020) Quick question: Inter-
rupting users for microtasks with reinforcement learning. arXiv:2007.09515
41. Hogan N (1984) Adaptive control of mechanical impedance by coactivation of antagonist
muscles. IEEE Trans Autom Control 29(8):681–690
42. Horvitz EJ, Breese JS, Heckerman D, Hovel D, Rommelse K (2013) The lumiere project:
Bayesian user modeling for inferring the goals and needs of software users. arXiv:1301.7385
43. Howes A, Chen X, Acharya A, Lewis RL (2018) Interaction as an emergent property of a
partially observable markov decision process. Computational interaction design. pp 287–310
44. Zehong H, Liang Y, Zhang J, Li Z, Liu Y (2018) Inference aided reinforcement learning for
incentive mechanism design in crowdsourcing. In: Advances in Neural Information Processing
Systems. NIPS ’18:5508–5518
45. Hwangbo J, Lee J, Dosovitskiy A, Bellicoso D, Tsounis V, Koltun V, Hutter M (2019) Learning
agile and dynamic motor skills for legged robots. Sci Robot 4(26)
46. Anthony J, Krzysztof GZ (2012) Systems that adapt to their users. The Human-Computer
interaction handbook: fundamentals, evolving technologies and emerging applications. CRC
Press, Boca Raton, FL
47. Johansen TA (2004) Approximate explicit receding horizon control of constrained nonlinear
systems. Automatica 40(2):293–300
48. Jorgensen SJ, Campbell O, Llado T, Kim D, Ahn J, Sentis L (2017) Exploring model predictive
control to generate optimal control policies for hri dynamical systems. arXiv:1701.03839
Optimal Control to Support High-Level User Goals … 71
49. Joubert N, Roberts M, Truong A, Berthouzoz F, Hanrahan P (2015) An interactive tool for
designing quadrotor camera shots. vol 34. ACM, New York, NY, USA, pp 238:1–238:11
50. Julier S, Lanzagorta M, Baillot Y, Rosenblum L, Feiner S, Hollerer T, Sestito S (2000) Infor-
mation filtering for mobile augmented reality. In: Proceedings IEEE and ACM international
symposium on augmented reality (ISAR 2000). IEEE, pp 3–11
51. Kartoun U, Stern H, Edan Y (2010) A human-robot collaborative reinforcement learning algo-
rithm. J Intell Robot Syst 60(2):217–239
52. Kirches C (2011) Fast numerical methods for mixed-integer nonlinear model-predictive control.
Springer
53. Krishnan S, Garg A, Liaw R, Miller L, Pokorny FT, Goldberg K (2016) Hirl: hierarchical inverse
reinforcement learning for long-horizon tasks with delayed rewards. arXiv:1604.06508
54. Kostadin K, Jason P, Elizabeth WD (2016) “Silence your phones” smartphone notifications
increase inattention and hyperactivity symptoms. pp 1011–1020
55. Lam D, Manzie C, Good MC (2013) Multi-axis model predictive contouring control. Int J
Control 86(8):1410–1424
56. (2020) Optimal control for electromagnetic haptic guidance systems. In: Langerak Thomas,
Zarate Juan, Vechev Velko, Lindlbauer David, Panozzo Daniele, Hilliges Otmar (eds)
57. Lee SJ, Popović Z (2010) Learning behavior styles with inverse reinforcement learning. In:
ACM transactions on graphics (TOG), vol 29. ACM, p 122
58. Lee Y, Wampler K, Bernstein G, Popović J, Popović Z (2010) Motion fields for interactive
character locomotion. In: ACM transactions on graphics (TOG), vol 29. ACM, p 138
59. Liebman E, Saar-Tsechansky M, Stone P (2015) Dj-mc: a reinforcement-learning agent for
music playlist recommendation. In: Proceedings of the 2015 international conference on
autonomous agents and multiagent systems, AAMAS ’15, pp 591–599
60. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (eds) (2015)
Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
61. Liniger A, Domahidi A, Morari M (2015) Optimization-based autonomous racing of 1: 43
scale rc cars. Opt Control Appl Methods 36(5):628–647
62. Liu F, Tang R, Li X, Zhang W, Ye Y, Chen H, Guo H, Zhang Y (2018) Deep reinforcement learn-
ing based recommendation with explicit user-item interactions modeling. arXiv:1810.12027
63. Lo W-Y, Zwicker M (2008) Real-time planning for parameterized human motion. In: Proceed-
ings of the 2008 ACM SIGGRAPH/eurographics symposium on computer animation, SCA
’08, pp 29–38
64. Justin M, Wei L, Tovi G, George F (2009) Communitycommands: command recommendations
for software applications. pp 193–202
65. McCann J, Pollard N (2007) Responsive characters from motion fragments. In: ACM transac-
tions on graphics (TOG), vol 26. ACM, p 6
66. McRuer Duane T, Jex Henry R (1967) A review of quasi-linear pilot models
67. Michalska H, Mayne DQ (1993) Robust receding horizon control of constrained nonlinear
systems. IEEE Trans Autom Control 38(11):1623–1633, 11
68. Bastian M, Andreas K (2010) User model for predictive calibration control on interactive
screens. pp 32–37
69. Mitsunaga N, Smith C, Kanda T, Ishiguro H, Hagita N (2006) Robot behavior adaptation for
human-robot interaction based on policy gradient reinforcement learning. J Robot Soc Jpn
24(7):820–829
70. Modares H, Ranatunga I, Lewis FL, Popa DO (2015) Optimized assistive human-robot inter-
action using reinforcement learning. IEEE Trans Cybernet 46(3):655–667
71. Müller J, Oulasvirta A, Murray-Smith R (2017) Control theoretic models of pointing. ACM
Trans Comput-Hum Interact (TOCHI) 24(4):1–36
72. Murray-Smith R (2018) Control theory, dynamics and continuous interaction
73. Nägeli T, Alonso-Mora J, Domahidi A, Rus D, Hilliges O (2017) Real-time motion planning
for aerial videography with dynamic obstacle avoidance and viewpoint optimization. IEEE
Robot Autom Lett PP(99):1–1
72 C. Gebhardt and O. Hilliges
74. Nägeli T, Meier L, Domahidi A, Alonso-Mora J, Hilliges O (2017) Real-time planning for
automated multi-view drone cinematography. ACM Trans Graph 36(4):132:1–132:10
75. Thomas N, Ying-Yin H, Andreas K (2014) Planning redirection techniques for optimal free
walking experience using model predictive control. pp 111–118
76. Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. In: Proceedings of
the seventeenth international conference on machine learning, ICML ’00, pp 663–670
77. Oliff H, Liu Y, Kumar M, Williams M, Ryan M (2020) Reinforcement learning for facilitating
human-robot-interaction in manufacturing. J Manuf Syst 56:326–340
78. Park S, Gebhardt C, Rädle R, Feit A, Vrzakova H, Dayama N, Yeo H-S, Klokmose C, Quigley
A, Oulasvirta A, Hilliges O (2018) AdaM: adapting multi-user interfaces for collaborative envi-
ronments in real-time. In: ACM SIGCHI conference on human factors in computing systems,
cHI ’18, New York, NY, USA. ACM
79. Bin Peng X, Abbeel P, Levine S, van de Panne M (2018) Deepmimic: example-guided deep
reinforcement learning of physics-based character skills. ACM Trans Graph 37(4):8
80. Bin Peng X, Kanazawa A, Malik J, Abbeel P, Levine S (2018) Sfv: Reinforcement learning of
physical skills from videos. ACM Trans Graph, 37
81. Purves D, Fitzpatrick D, Katz LC, Lamantia AS, McNamara JO, Williams SM, Augustine GJ
(2000) Neuroscience. Sinauer Associates
82. Rachael JA, Rault A, Testud JL, Papon J (1978) Model predictive heuristic control: application
to an industrial process. Automatica 14(5):413–428
83. Mizanoor Rahman SM, Behzad S, Yue W (2015)Trust-based optimal subtask allocation and
model predictive control for human-robot collaborative assembly in manufacturing, vol 57250.
American Society of Mechanical Engineers, p page V002T32A004
84. Rajeswaran A, Lowrey K, Todorov EV, Kakade SM (2017) Towards generalization and sim-
plicity in continuous control. In Advances in Neural Information Processing Systems. NIPS
’17:6550–6561
85. Roberts M, Hanrahan P (2016) Generating dynamically feasible trajectories for quadrotor
cameras. ACM Trans Graph 354:61:1-61:11
86. Safavi A, Zadeh MH (2017) Teaching the user by learning from the user: personalizing move-
ment control in physical human-robot interaction. IEEE/CAA J Autom Sinica 4(4):704–713
87. Sheridan TB, Ferrell WR (1974) Man-machine systems; Information, control, and decision
models of human performance. The MIT press
88. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L,
Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature
550(7676):354–359
89. Su P-H, Budzianowski P, Ultes S, Gasic M, Young S (2017) Sample-efficient actor-critic
reinforcement learning with supervised data for dialogue management. arXiv:1707.00130
90. Sutton RS, Barto AG, Williams RJ (1992) Reinforcement learning is direct adaptive optimal
control. IEEE Control Syst Mag 12(2):19–22
91. Rowan S, Kieran F, Owen C (2019) A reinforcement learning and synthetic data approach to
mobile notification management. pp 155–164
92. Teramae T, Noda T, Morimoto J (2018) Emg-based model predictive control for physical
human-robot interaction: application for assist-as-needed control. IEEE Robot Autom Lett
3(1):210–217
93. Tjomsland J, Shafti A, Aldo Faisal A (2019) Human-robot collaboration via deep reinforcement
learning of real-world interactions. arXiv:1912.01715
94. Treuille A, Lee Y, Popović Z (2007) Near-optimal character animation with continuous control.
ACM Trans Graph 26(3):7
95. (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards
96. Wiener N (2019) Cybernetics or Control and Communication in the Animal and the Machine.
MIT press
Modeling Mobile Interface Tappability
Using Crowdsourcing and Deep Learning
1 Introduction
Tapping is arguably the most important gesture on mobile interfaces. Yet, it is still
difficult for people to distinguish tappable and not tappable elements in a mobile
interface. In traditional desktop GUIs, the style of clickable elements (e.g., buttons)
A. Swearngin (B)
University of Washington, Seattle, WA, USA
e-mail: amaswea@cs.washington.edu
Work conducted during an internship at Google Research, Mountain View, USA
Y. Li
Google Research, Mountain View, CA, USA
e-mail: liyang@google.com
Fig. 1 Our deep model learns from a large-scale dataset of mobile tappability collected via crowd-
sourcing. It predicts tappability of interface elements and identifies mismatches between designer
intention and user perception, and is served in the TapShoe tool that can help designers and devel-
opers to uncover potential usability issues about their mobile interfaces
are often conventionally defined. However, with the diverse styles of mobile inter-
faces, tappability has become a crucial usability issue. Poor tappability can lead to a
lack of discoverability [27] and false affordances [12] that can lead to user frustration,
uncertainty, and errors [3, 10].
Signifiers [27] can indicate to a user how to interact with an interface element.
Designers can use visual properties (e.g., color or depth) to signify an element’s
“clickability” [3] or “tappability” in mobile interfaces. Perhaps the most ubiquitous
signifiers in today’s interfaces are the blue color and underline of a link, and the
design of a button that both strongly signify to the user that they should be clicked.
These common signifiers have been learned over time and are well understood to
indicate clickability [26]. To design for tappability, designers can apply existing
design guidelines for clickability [3]. These are important and can cover typical cases,
however, it is not always clear when to apply them in each specific design setting.
Frequently, mobile app developers are not equipped with such knowledge. Despite
the existence of simple guidelines, we found a significant amount of tappability
misperception in real mobile interfaces, as shown in the dataset that we discussed
later.
Additionally, modern platforms for mobile apps frequently introduce new design
patterns and interface elements. Designing these to include appropriate signifiers
for tappability is challenging. Additionally, mobile interfaces cannot apply useful
clickability cues available in web and desktop interfaces (e.g., hover states). With
the flat design trend, traditional signifiers have been altered, which potentially causes
uncertainty and mistakes [10]. More data may be needed to confirm these results,
however, we argue that we need more data and automated methods to fully understand
the users’ perceptions of tappability as design trends evolve over time.
One way that interface designers can understand tappability in their interfaces is
through conducting a tappability study or a visual affordance test [34]. However, it is
time-consuming to conduct such studies. In addition, the findings from these studies
are often limited to a specific app or interface design. We aim to understand signifiers
at a large scale across a diverse array of interface designs and to diagnose tappability
problems in new apps automatically without conducting user studies.
In this chapter, we present an approach for modeling interface tappability at scale
through crowdsourcing and deep learning. We first collected and analyzed a large
Modeling Mobile Interface Tappability Using Crowdsourcing … 75
dataset of 20,000 tappability labels from more than 3,000 mobile app screens. Our
analysis of this dataset demonstrate that there are many false and missing signi-
fiers potentially causing people to frequently misidentify tappable and not tappable
interface elements. Through computational signfier analysis, we identified a set of
findings on factors impacting mobile interface tappability. Using this dataset, we
trained a deep learning model that achieved reasonable acccuracy with mean pre-
cision 90.2% and recall 87.0% on identifying tappable elements as perceived by
humans. To showcase a potential use of the model, we build TapShoe (Fig. 1), a web
interface that diagnoses mismatches between the human perception of the tappabil-
ity of an interface element and its actual state in the interface code. We conducted
informal interviews with 7 professional interface designers who were positive about
the TapShoe interface, and could envision intriguing uses of the tappability model
in realistic design situations. Based on our work on tappability modeling [31], we
discuss the following contributions in this chapter, and generalize them towards other
scenarios:
While this chapter describes a method to model and predict tappability which can
enable automatic usability analysis, tappability is only one aspect of usability. There
are potentially other such aspects of usability that can be modeled. We discuss some
of these aspects and potential areas to apply deep learning in Sect. 9.
2 Background
The concepts of signifiers and affordances are integral to our work. We aim to capture
them in a systematic way to construct a predictive model and to understand their use
in a large set of real mobile interfaces. Affordances were originally described by [13]
as the actionable properties between the world and actor (i.e., person). References
[25, 26] popularized the idea of affordances of everyday objects, such as a door
which has an affordance of opening. A “signifier” indicates the affordance of an
object [26]. For example, a door handle can signify the direction a door will open.
Norman, later, related the concept of signifiers to user interfaces [26]. Gaver [12],
described the use of graphical techniques to aid human perception (e.g., shadows or
76 A. Swearngin and Y. Li
rounded corners), and showed how designers can use signifiers to convey an interface
element’s perceived affordances. These early works form the core of our current
understanding of what makes a person know what is interactive. By collecting a
large dataset of tappability examples, we hope to expand our understanding of which
signifiers are having an impact at scale.
3 Related Work
While there are many methods to assess the design quality and usability of interfaces
(e.g., cognitive walkthrough, heuristic evaluation), there are few methods to auto-
matically assess an interface for its usability. Creating such automated methods can
help designers avoid the time and cost required for manual usability testing, and can
help them discover patterns that they may not have been able to discover without
computational support. In this chapter, we present a method to automatically model
a specific aspect of usability—tappability; however, we review work that applies
large scale data-collection and modeling to assess some aspect of interface design or
usability. This work falls into two main categories: (1) large-scale data collection to
assess interface design and usability (2) machine learning methods to assess interface
design and usability.
There have only been a few small-scale studies on the factors influencing clickability
in web interfaces [3, 10]. Usability testing methods have also adopted the idea of
visual affordance testing [34] to diagnose clickability issues. However, these studies
have been conducted at a small scale and are typically limited to the single app
being tested. We are not aware of any large-scale data collection and analysis across
app interfaces to enable diagnosis of tappability issues, nor any machine learning
approaches that learn from this data to automatically predict the elements that users
will perceive as tappable or not tappable.
To identify tappability issues automatically, we need to collect data on a large
scale to allow us to use a machine learning approach for this problem. Recently,
data-driven approaches have been used to identify usability issues [7], and collect
mobile app design data at scale [6, 8]. Perhaps most closely related to our work is Zipt
[7], which enables comparative user performance testing at scale. Zipt uses crowd
workers to construct user flow visualizations through apps that can help designers
visualize the paths users will take through their app for specific tasks. However,
with this approach, designers must still manually diagnose the usability issues by
examining the visualizations. In this chapter, we focus on identifying an important
Modeling Mobile Interface Tappability Using Crowdsourcing … 77
Deep learning [19] is an effective approach to learn from large-scale datasets, and
recent work has begun to explore applying deep learning to assess various aspects
of usability. In our work, we trained a deep feedforward network, which uses con-
volutional layers for image processing and embedding for categorical data such as
words and types, to automatically predict human tappability perception.
Recent work has used deep learning approaches to predict human performance
on mobile apps for tasks such as grid selection [29], menu selection [20], and visual
search time [38]. Deep learning models have also been built to identify salient ele-
ments in graphic designs and interfaces to help designers know where their users
will first focus their eyes [4, 5, 39], and to predict user engagement with animation
in mobile apps to help designers examine user engagement issues [36]. Beyond a
single app design and layout, others have built deep learning models to automati-
cally explore design solutions [40] and generate optimized layouts [9] for mobile
app designs. No previous work has yet applied deep learning to predict the tappabil-
ity of interface elements. Deep learning allowed us to leverage a rich set of features
involving the semantic, spatial, and visual properties of an element without extensive
feature engineering, and to address the problem in an end-to-end fashion, which is
easily scalable to complex problems.
there is a lack of a dataset and deep understanding about interface tappability across
diverse mobile apps. Having such a dataset and knowledge is required for us to
create automated techniques to help designers diagnose tappability issues in their
interfaces.
Fig. 2 The interface that workers used to label the tappability of UI elements via crowdsourcing. It
displays a mobile interface screen with interactive hotspots that can be clicked to label an element
as either tappable or not tappable
Modeling Mobile Interface Tappability Using Crowdsourcing … 79
an element will respond to a tapping event. For each screen, we selected up to five
unique clickable and non-clickable elements. When selecting clickable
elements, starting from a leaf element, we select the top-most clickable element
in the hierarchy for labeling. When a clickable element contains a sub-tree of ele-
ments, these elements are typically presented as a single interface element to the
user, which is more appropriate for the worker to label as a whole. When a clickable
container (e.g., ViewGroup) is selected, we do not select any of its child elements,
thus preventing any duplicate counting or labeling. We did not select elements in the
status bar or navigation bar as they are standard across most screens in the dataset.
To perform a labeling task, a crowd worker hovers their mouse over the interface
screenshot, and our web interface displays gray hotspots over the interface elements
pre-selected based on the above process. Workers click on each hotspot to toggle the
label as either tappable or not tappable, which are colored in green and red, respec-
tively. We asked each worker to label around six elements for each screen. Depending
on the screen complexity, the amount of elements could vary. We randomized the
elements as well as the order to be labeled across each worker.
4.2 Results
We collected 20,174 unique interface elements from 3,470 app screens. These ele-
ments were labeled by 743 unique workers in two rounds where each round involved
different sets of workers (see Table 1). Each worker could complete up to 8 tasks. On
average, each worker completed 4.67 tasks. Of these elements, 12,661 of them are
indeed tappable, i.e., the view hierarchy attribute clickable=True, and 7,513 of
them are not.
How well can human users perceive the actual clickable state of an element
as specified by developers or designers? To answer this question, we treat the
clickable value of an element in the view hierarchy as the actual value and
human labels as the predicted value for a precision and recall analysis. In this dataset
Table 1 The number of elements labeled by the crowd workers in two rounds, along the precision
and recall of human workers in perceiving the actual clickable state of an element as specified in
the view hierarchy metadata
Positive class #Elements Precision (%) Recall (%)
R1 clickable=True 6,101 79.81 89.07
clickable=False 3,631 78.56 61.75
R2 clickable=True 6,560 79.55 90.02
clickable=False 3,882 78.30 60.90
All clickable=True 12,661 79.67 89.99
clickable=False 7,513 78.43 61.31
80 A. Swearngin and Y. Li
of real mobile app screens, there were still many false signifiers for tappability,
potentially causing workers to misidentify tappable and not tappable elements (see
Table 1). The workers labeled non-clickable elements as tappable 39% of time. While
the workers were significantly more precise in labeling clickable elements, workers
still marked clickable elements as not tappable 10% of the time. The results were
quite consistent across two rounds of data collection involving different workers and
interface screens. These results further confirmed that tappability is an important
usability issue worth investigation.
Several element types have conventions for visual appearance, thus users would con-
sistently perceive them as tappable [25] (e.g., buttons). We examined how accurately
workers label each interface element type from a subset of Android class types in the
Rico dataset [6]. Figure 3 shows the distribution of tappable and not tappable ele-
ments by type labeled by human workers. Common tappable interface elements like
Button and Checkbox did appear more frequently in the set of tappable elements. For
each element type, we computed the accuracy by comparing the worker labels to the
view hierarchy clickable values. For tappable elements, the workers achieved
high accuracy for most types. For not tappable elements, the two most common
types, TextView and ImageView, had low accuracy percentages of only 67 and 45%,
respectively. These interface types allow more flexibility in design than standard ele-
ment types (e.g., RadioButton). Unconventional styles may make an element more
prone to ambiguity in tappability.
4.3.2 Location
We hypothesized that an element’s location on the screen may have influenced the
accuracy of workers in labeling its tappability. Figure 4 displays a heatmap of the
accuracy of the workers’ labels by location. We created the heatmap by computing
the accuracy per pixel, using the clickable attribute, across the 20,174 elements
we collected using the bounding box of each element. Warm colors represent higher
Modeling Mobile Interface Tappability Using Crowdsourcing … 81
Tappable
Other
ListView
RadioButton
EditText
Button
TextView
CheckBox
ImageView
ImageButton
ViewGroup
0 500 1000 1500 2000 2500 3000 3500 4000
Not Tappable
Other
TextView
ImageView
Correct
View
ViewGroup Incorrect
0 1000 2000 3000 4000 5000
Fig. 3 The number of tappable and not tappable elements in several type categories with the bars
colored by the relative amounts of correct and incorrect labels
accuracy values. For tappable elements, workers were more accurate towards the
bottom of the screen than the center top area. Placing a not tappable element in these
areas might confuse people. For tappable elements, there are two spots at the top
region of high accuracy. We speculate that this is because these spots are where apps
tend to place their Back and Forward buttons. For not tappable elements, the workers
were less accurate towards the screen bottom and highly accurate in the app header
bar area with a corresponding area of low accuracy for tappable elements. This area
is not tappable in many apps, so people may not realize any element placed there is
tappable.
4.3.3 Size
There was only a small difference in average size between labeled tappable and not
tappable elements. However, tappable elements labeled as not tappable were 1.9
times larger than tappable elements labeled as tappable indicating that elements with
large sizes were more often seen as not tappable. Examining specific element types
can reveal possible insights into why the workers may have labeled larger elements
as not tappable. TextView elements tend to display labels but can also be tappable
elements. From design recommendations, tappable elements should be labeled with
82 A. Swearngin and Y. Li
100 100
200 200
300 300
y
y
400 400
500 500
600 600
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
x x
Fig. 4 Heatmaps displaying the accuracy of tappable and not tappable elements by location where
warmer colors represent areas of higher accuracy. Workers labeled not tappable elements more
accurately towards the upper center of the interface, and tappable elements towards the bottom
center of the interface
short, actionable phrases [32]. The text labels of not tappable TextView elements
have an average and median size of 1.48 and 1.55 times larger respectively than
those of tappable TextView elements. This gives us a hint that TextView elements
may be following these recommendations. For ImageView elements, the average
and median size for not tappable elements were 2.39 and 3.58 times larger than
for tappable elements. People may believe larger ImageView elements, typically
displaying images, to be less likely tappable than smaller ImageView elements.
4.3.4 Color
Based on design recommendations [3], color can also be used to signify tappability.
Figure 5 displays the top 10 dominant colors in each class of labeled tappable and
not tappable elements, which are computed using K-Means clustering. The dominant
colors for each class do not necessarily denote the same set. The brighter colors such
as blue and red have more presence, i.e., wider bars, in the pixel clusters for tappable
elements than those for not tappable ones. In contrast, not tappable elements have
more gray and white colors. We computed these clusters across the image pixels for
12 thousand tappable and 7 thousand not tappable elements and scaled them by the
Modeling Mobile Interface Tappability Using Crowdsourcing … 83
Tappable
Not Tappable
Fig. 5 The aggregated RGB pixel colors of tappable and not tappable elements clustered into the
10 most prominent colors using K-Means clustering
proportion of elements in each set. These differences indicate that color is likely a
useful distinguishing factor.
4.3.5 Words
As not tappable textual elements are often used to convey information, the number
of words in these elements tend to be large. The mean number of words per element,
based on the log-transformed word count in each element, was 1.84 times greater
for not tappable elements (Mean: 2.62, Median: 2) than tappable ones (Mean: 1.42,
Median: 1). Additionally, the semantic content of an element’s label may be a distin-
guishing factor based on design recommendations [32]. We hypothesized that tap-
pable elements would contain keywords indicating tappability, e.g., “Login”. To test
this, we examined the top five keywords of tappable and not tappable elements using
TF-IDF analysis, with the set of words in all the tappable and not tappable elements as
two individual documents. The top 2 keywords extracted for tappable elements were
“submit” and “close”, which are common signifiers of actions. However, the remain-
ing keywords for tappable elements, i.e., “brown”, “grace” and “beauty”, and the
top five keywords for not tappable elements, i.e., “wall”, “accordance”, “recently”,
“computer”, and “trying”, do not appear to be actionable signifiers.
learning approach to address the problem. Overall, our model is a feedforward neural
network with a deep architecture (multiple hidden layers). It takes a concatenation
of a range of features about the element and its screen and outputs a probability of
how likely a human user would perceive an interface element as tappable.
Our model takes as input several features collected from the view hierarchy metadata
and the screenshot pixel data of an interface. For each element under examination,
our features include (1) semantics and functionality of the element, (2) the visual
appearance of the element and the screen, and (3) the spatial context of the element
on the screen.
The length and the semantics of an element’s text content are both potential tap-
pability signifiers. For each element, we scan the text using OCR. To represent the
semantics of the text, we use word embedding that is a standard way of mapping word
tokens into a continuous dense vector that can be fed into a deep learning model.
We encode each word token in an element as a 50-dimensional vector representation
that is pre-learned from a Wikipedia corpus [28]. When an element contains multiple
words, we treat them as a bag of words and apply max pooling to their embedding
vectors to acquire a single 50-dimensional vector as the semantic representation of
the element. We also encode the number of word tokens each element contains as a
scalar value normalized by an exponential function.
There are many standard element types that users have learned over time (e.g., buttons
and checkboxes) [25]. However, new element types are frequently introduced (e.g.,
floating action button). In our model, we include an element type feature as an
indicator of the element’s semantics. This feature allows the model to potentially
account for these learned conventions as a users’ background plays an important
role in their decision. To encode the Type feature, we include a set of the 22 most
common interface element types, e.g., TextView or Button. We represent the Type in
the model as a 22-dimensional categorical feature, and collapse it into 6-dimensional
embedding vector for training, which provides better performance over sparse input.
Each type comes with a built-in or specified clickable attribute that is encoded as
either 0 or 1.
Modeling Mobile Interface Tappability Using Crowdsourcing … 85
As previously discussed, visual design signifiers such as color distribution can help
distinguish an element’s tappability. It is difficult to articulate the visual perception
that might come into play and realize it as executable rules. As a result, we feed an
element’s raw pixel values and the screen to which the element belongs to the network,
through convolutional layers—a popular method for image processing. We resize the
pixels of each element and format them as a 3D matrix in the shape of 32 × 32 × 3,
where the height and width are 32, and 3 is the number of RGB channels. Contextual
factors on the screen may affect the human’s perception of tappability. To capture
the context, we resize and format the entire screen as another visual feature. This
manifests as a 3D matrix in the shape of 300 × 168 × 3 and preserves the original
aspect ratio. As we discuss later, a screen contains useful information for predicting
an element’s tappability even though such information is not easy to articulate.
Fully Connected
Convolutional Convolutional
BOW
Embedding &
Word Count Convolutional Convolutional
Type &
Intended Element Pixels Screen Pixels
Clickability
86 A. Swearngin and Y. Li
Figure 6 illustrates our model architecture. To process the element and screenshot
pixels, our network has three convolutional layers with ReLU [23] activation. Each
convolutional layer applies a series of 8 3 × 3 filters to the image to help the model
progressively create a feature map. Each convolutional layer is followed by a 2 × 2
max pooling layer to reduce the dimensionality of the image data for processing.
Finally, the output of the image layers is concatenated with the rest of the features
into a series of two fully connected 100-dimensional dense layers using ReLU [23]
as the activation function. The output layer produces a binary classification of an
element’s tappability using a sigmoid activation function to transform the output
into probabilities from zero to one. The probability indicates how likely the user
would perceive the element as tappable. We trained the model by minimizing the
sigmoid cross-entropy loss between the predicted values and the binary human labels
on tappability of each element in the training data. For loss minimization, we used
the Ada adaptive gradient descent optimizer with a learning rate of 0.01 and a batch
size of 64. To avoid model overfitting, we applied a dropout ratio of 40% to each
fully connected layer to regularize the learning. We built our model using Tensorflow
[1] in Python and trained it on a Tesla V100 GPU.
We evaluated our model using tenfold cross validation with the crowdsourced dataset.
In each fold, we used 90% of the data for training and 10% for validation, and trained
our model for 100,000 iterations. Similar to an information retrieval task, we exam-
ine how well our model can correctly retrieve elements that users would perceive
as tappable. We select an optimal threshold based on Precision-Recall AUC. Our
model achieved a mean precision and recall, across the 10 folds of the experiment,
of 90.2% (SD: 0.3%) and 87.0% (SD: 1.6%). To understand what these numbers
imply, we analyzed how well the clickable attribute in the view hierarchy pre-
dicts user tappability perception: precision 89.9% (SD: 0.6%) and recall 79.6% (SD:
0.8%). While our model has a minor improvement on precision, it outperforms the
clickable attribute on recall considerably by over 7%.
Although identifying not tappable elements is less important in real scenarios,
to better understand the model, we report the performance concerning not tappable
elements as the target class. Our model achieved a mean precision of 70% (SD: 2%)
and recall of 78% (SD: 3%), which improves precision by 9%, with a similar recall,
over the clickable attribute (precision 61%, SD: 1% and recall 78%, SD: 2%).
One potential reason that not tappable elements have a relatively low accuracy is that
they tend to be more diverse, leading to more variance in the data.
In addition, our original dataset had an uneven number of tappable and not tap-
pable elements (14,301 versus 5,871), likely causing our model to achieve higher
Modeling Mobile Interface Tappability Using Crowdsourcing … 87
Table 2 Confusion matrix for the balanced dataset, averaged across the 10 cross-validation exper-
iments
Predicted tappable Predicted not tappable
Actual tappable 1195 260
Actual not tappable 235 1170
precision and recall for tappable elements than not tappable ones. Therefore, we
created a balanced dataset by upsampling the minority class (i.e., not tappable). On
the balanced dataset, our model achieved a mean precision and recall of 82 and 84%
for identifying tappable elements, and a mean precision and recall of 81 and 86% for
not tappable elements. Table 2 shows the confusion matrix for the balanced dataset.
Compared to using view hierarchy clickable attribute alone, which achieved
mean precision of 79% and recall of 80% for predicting tappable elements, and 79
and 78% for not tappable ones, our model is consistently more accurate across all the
metrics. These performance improvements show that our model can effectively help
developers or designers identify tappability misperceptions in their mobile interfaces.
We speculate that our model did not achieve even higher accuracy because human
perception of tappability can be inherently inconsistent as people have their own
experience in using and learning different sets of mobile apps. This can make it
challenging for the model to achieve perfect accuracy. To examine our hypothesis,
we collected another dataset via crowdsourcing using the same interface as shown
in Fig. 2. We selected 334 screens from the Rico dataset, which were not used in our
previous rounds of data collection. We recruited 290 workers to perform the same
task of marking each selected element as either tappable or not tappable. However,
each element was labeled by 5 different workers to enable us to see how much these
workers agree on the tappability of an element. In total, there were 2,000 unique
interface elements and each was labeled 5 times. In total, 1,163 elements (58%)
were entirely consistent among all 5 workers which include both tappable and not
tappable elements. We report two metrics to analyze the consistency of the data
statistically. The first is in terms of an agreement score [35] that is computed using
the following formula:
|Ri | 2
|Re |
e∈E r ∈R
A= × 100% (1)
|E|
In Eq. 1, e is an element in the set of all interface elements E that were rated by the
workers, Re is the set of ratings for an interface element e, and Ri is the set of ratings
88 A. Swearngin and Y. Li
Predicted
Tappable
0.8
0.6
0.4
0.2
Predicted
Not Tappable
All Agree 4/5 Agree 3/5 Agree 3/5 Agree 4/5 Agree All Agree
Not Tappable Not Tappable Not Tappable Tappable Tappable Tappable
Fig. 7 The scatterplot of the tappability probability output by the model (the Y axis) versus the
consistency in the human worker labels (the X axis) for each element in the consistency dataset
in a single category (0: not tappable, 1: tappable). We also report the consistency of
the data using Fleiss’ Kappa [11], a standard inter-rater reliability measure for the
agreement between a fixed number of raters assigning categorical ratings to items.
This measure is useful because it computes the degree of agreement over what would
be expected by chance. As there are only two categories, the agreement by chance is
high. The overall agreement score across all the elements using Eq. 1 is 0.8343. The
number of raters is 5 for each element on a screen, and across 334 screens, resulting
in an overall Fleiss’ Kappa value of 0.520 (SD = 0.597, 95% CI [0.575,0.618], P =
0). This corresponds to a “Moderate” level agreement according to [18]. What these
results demonstrate is that, while there is a significant amount of consistency in the
data, there still exists a certain level of disagreement on what elements are tappable
versus not tappable. Particularly, consistency varies across element Type categories.
For example, View and ImageView elements were labeled far less consistently (0.52,
0.63) than commonplace tappable element types such as Button (94%), Toolbar
(100%), and CheckBox (95%). View and ImageView elements have more flexibility
in design, which may lead to more disagreement.
To understand how our model predicts elements with ambiguous tappability, we
test our previously trained model on this new dataset. Our model matches the uncer-
tainty in human perception of tappability surprisingly well (see Fig. 7). When workers
are consistent on an element’s tappability (two ends on the X axis), our model tends
Modeling Mobile Interface Tappability Using Crowdsourcing … 89
to give a more definite answer—a probability close to 1 for tappable and close to 0
for not tappable. When workers are less consistent on an element (towards the middle
of the X axis), our model predicts a probability closer to 0.5.
One motivation to use deep learning is to alleviate the need for extensive feature
engineering. Recall that we feed the entire screenshot of an interface to the model to
capture contextual factors affecting the user’s decision that cannot be easily articu-
lated. Without the entire screenshot as input, there is a noticeable drop in precision
and recall for tappable of 3 and 1%, and for not tappable, an 8% drop in precision but
no change in recall. This indicates that there is useful contextual information in the
screenshot affecting the users’ decisions on tappability. We also examined removing
the Type feature from the model, and found a slight drop in precision about 1% but
no change in recall for identifying tappable elements. The performance change is
similar for the not tappable case with 1.8% drop in precision and no drop in recall.
We speculate that removing the Type feature only caused a minor impact because
our model has already captured some of element type information through its pixels.
Fig. 8 The TapShoe interface. An app designer drag and drops a UI screen on the left. TapShoe
highlights interface elements whose predicted tappability is different from its actual tappable state
as specified in its view hierarchy
90 A. Swearngin and Y. Li
6 TapShoe Interface
We created a web interface for our tappability model called TapShoe (see Fig. 8).
The interface is a proof of concept tool to help app designers and developers examine
their UI’s tappability. We describe the TapShoe interface from the perspective of an
app designer, Zoey, who is designing an app for deal shopping, shown in the right
hand side of Fig. 8. Zoey has redesigned some icons to be more colorful on the
home page links for “Coupons”, “Store Locator”, and “Shopping”. Zoey wants to
understand how the changes she has made would affect the users’ perception of which
elements in her app are tappable. First, Zoey uploads a screenshot image along its
view hierarchy for her app by dragging and dropping them into the left hand side of
the TapShoe interface. Once Zoey drops her screenshot and view hierarchy, TapShoe
analyzes her interface elements, and returns a tappable or not tappable prediction
for each element. The TapShoe interface highlights the interface elements with a
tappable state, as specified by Zoey in the view hierarchy, that does not match up
with user perception as predicted by the model.
Zoey sees that the TapShoe interface highlighted the three colorful icons she
redesigned. These icons were not tappable in her app but TapShoe predicted that the
users would perceive them as tappable. She examines the probability scores for each
element by clicking on the green hotspots on the screenshot to see informational
tooltips. She adjusts the sensitivity slider to change the threshold for the model’s
prediction. Now, she sees that the “Coupons” and “Store Locator” icon are not high-
lighted and that the arrow icon has the highest probability of being perceived as
tappable. She decides to make all three colorful icon elements interactive and extend
the tappable area next to “Coupons”, “Store Locator”, and “Website”. These fixes
prevent her users from the frustration of tapping on these elements with no response.
We implemented the TapShoe interface as a web application (JavaScript) with a
Python web server. The web client accepts an image and a JSON view hierarchy
to locate interface elements. The web server queries a trained model, hosted via a
Docker container with the Tensorflow model serving API, to retrieve the predictions
for each element.
To understand how the TapShoe interface and tappability model would be useful in a
real design context, we conducted informal design walkthroughs with 7 professional
interface designers at a large technology company. The designers worked on design
teams for three different products. We demonstrated TapShoe to them and collected
informal feedback on the idea of getting predictions from the tappability model, and
on the TapShoe interface for helping app designers identify tappability mismatches.
We also asked them to envision new ways they could use the tappability prediction
model beyond the functionality of the TapShoe interface. The designers responded
Modeling Mobile Interface Tappability Using Crowdsourcing … 91
positively to the use of the tappability model and TapShoe interface, and gave several
directions to improve the tool. Particularly, the following themes have emerged.
The designers saw high potential in being able to get a tappability probability score for
their interface elements. Currently, the TapShoe interface displays only probabilities
for elements with a mismatch based on the threshold set by the sensitivity slider.
However, several of the designers mentioned that they would want to see the scores for
all the elements. This could give them a quick glance at the tappability of their designs
as a whole. Presenting this information in a heatmap that adjusts the colors based on
the tappability scores could help them compare the relative level of tappability of each
element. This would allow them to deeply examine and compare interface elements
for which tappability signifiers are having an impact. The designers also mentioned
that sometimes, they do not necessarily aim for tappability to be completely binary.
Tappability could be aimed to be higher or lower along a continuous scale depending
on an element’s importance. In an interface with a primary action and a secondary
action, they would be more concerned that people perceive the primary action as
tappable than the secondary action.
The designers also pointed out the potential of the tappability model for helping
them systematically explore variations. TapShoe’s interface only allows a designer
to upload a single screen. However, the designers envisioned an interface to allow
them to upload and compare multiple versions of their designs to systematically
change signifiers and observe how they impact the model’s prediction. This could
help them discover new design principles to make interface elements look more or
less tappable. It could also help them compare more granular changes at an element
level, such as different versions of a button design. As context within a design can
also affect an element’s tappability, they would want to move elements around and
change contextual design attributes to have a more thorough understanding of how
context affects tappability. Currently, the only way for them to have this information
is to conduct a large tappability study, which limits them to trying out only a few
design changes at a time. Having the tappability model output could greatly expand
their current capabilities for exploring design changes that may affect tappability.
92 A. Swearngin and Y. Li
Several designers wondered whether the model could extend to other platforms. For
example, their design for desktop or web interfaces could benefit from this type of
model as well. Additionally, they have collected data that our model could already
use for training. We believe our model could help them in this case as it would be
simple to extend to other platforms or to use existing tappability data for training.
We also asked the designers how they felt about the accuracy of our model.
The designers believed the model could be useful in its current state even for helping
them understand the relative tappability of different elements. Providing a confidence
interval for the prediction could aid in giving them more trust in the prediction.
8 Discussion
Our model achieves good accuracy at predicting tappable and not tappable interface
elements and the TapShoe tool and model are well received by designers. Here we
discuss the limitations and directions for future work.
One limitation is that our TapShoe interface, as a proof of concept, demonstrates
one of many potential uses for the tappability model. We intend to build a more
complete design analytics tool based on designers’ suggestions, and conduct further
studies of the tool by following its use in a real design project. Particularly, we will
update the TapShoe interface to take early stage mockups other than UI screens that
are equipped with a view hierarchy. This is possible because designer can mark up
elements to be examined in a mockup without having to implement it.
Our tappability model is also only trained on Android interfaces, and therefore,
the results may not generalize well to other platforms. However, our model relies on
general features available in many UI platforms (e.g., element bounding boxes and
types). It would be entirely feasible to collect a similar dataset for different platforms
to train our model and the cost for crowdsourcing labeling is relatively small. In fact,
we can apply a similar approach to new UI styles that involve drastically different
design concepts, e.g., emerging UI styles in AR/VR. This is one of the benefits
to use a data-driven, deep learning approach, as shown in this work, which can be
easily scalable across new interaction styles or UI platforms without extensive feature
engineering and model redesign.
9 Future Work
annotation task beyond a simple binary rating of tappable versus not tappable to a
rating incorporating uncertainty, e.g., adding a “Not sure” option or a confidence
scale in labels.
The tappability model that we developed is a first step towards modeling tappa-
bility. There also may be other features that could add predictive power to the model.
As we begin to understand more of the features that people use to determine which
elements are tappable and not tappable, we can incorporate these new features into
a deep learning model as long as they are manifested in the data. For example, we
used the Type feature as a way to account for learned conventions, i.e., the behavior
that users have learned over time. As users are not making a tappable decision solely
based on the visual properties of the current screen, we intend to investigate more
high-level features that can capture user background, which has a profound effect on
user perception.
Additionally, identifying the reasons behind tappable or not tappable perception
could potentially enable us to offer designers recommendations for improving the
users’ tappability perception. This also requires us to communicate these reasons with
the designer in a human-understandable fashion. There are two approaches to pursue
this. One is to analyze how the model relies on each feature, although understanding
the behavior of a deep learning model is challenging and it is an active area in the
deep learning field. The other approach is to train the model to recognize the human
reasoning behind their selection. Progress in this direction will allow a tool to provide
designers a more complete and useful tappability report.
Finally, we believe there are potentially many other aspects of usability that can
be assessed in the way that our work assess tappability. Some work has already
explored predicting some new usability aspects like visual search time [38] and
user engagement with animations [36]. However, such works have been limited
to predicting usability issues for a single screen. Some aspects of usability (e.g.,
difficulty or predictability of a task, user task completion) that span across multiple
screens have not yet been modeled. Automating such analyses will require creating
methods to collect and label relevant data at a large scale and operationalize them in
models for prediction. Deep learning is again well equipped to do so as techniques for
sequence modeling has been substantially advanced [15, 33]. We already see models
such as LSTM [15] been successfully applied in interaction sequence modeling for
interaction tasks such as menu selection [20].
10 Conclusion
a proof of concept, we introduced TapShoe, a tool that uses the deep model to
examine interface tappability, which received positive feedback from 7 professional
interaction designers who saw its potential as a useful tool for their real design
projects. This work serves as an example of how we can apply state-of-the-art deep
learning methods such as Convolutional Neural Nets to assess a crucial aspect of
interface usability—tappability, and showcase bolts and nuts for conducting research
at the intersection of interaction design and deep learning. We hope that this work can
serve as an inspiration to researchers to further investigate methods for data collection
and automated assessment of usability using deep learning methods, so that designers
can leverage AI to more quickly and thoroughly evaluate their interfaces.
References
13. Gibson JJ (1978) The ecological approach to the visual perception of pictures. Leonardo
11(3):227–235
14. Greenberg MD, Easterday MW, Gerber EM (2015) Critiki: a scaffolded approach to gathering
design feedback from paid crowd workers. In: Proceedings of the 2015 ACM SIGCHI confer-
ence on creativity and cognition’, C&C ’15. ACM, New York, NY, USA, pp 235–244. https://
doi.org/10.1145/2757226.2757249
15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
16. Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Pro-
ceedings of the SIGCHI conference on human factors in computing systems, CHI’08. ACM,
New York, NY, USA, pp 453–456. https://doi.org/10.1145/1357054.1357127
17. Komarov S, Reinecke K, Gajos KZ (2013) Crowdsourcing performance evaluations of user
interfaces. In: Proceedings of the SIGCHI conference on human factors in computing systems,
CHI’13. ACM, New York, NY, USA, pp 207–216. https://doi.org/10.1145/2470654.2470684
18. Landis JR, Koch GG (1977) An application of hierarchical kappa-type statistics in the assess-
ment of majority agreement among multiple observers. Biometrics 363–374
19. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
20. Li Y, Bengio S, Bailly G (2018) Predicting human performance in vertical menu selection using
deep learning. In: Proceedings of the 2018 CHI conference on human factors in computing
systems, pp 1–7
21. Luther K, Tolentino J-L, Wu W, Pavel A, Bailey BP, Agrawala M, Hartmann B, Dow SP
(2015) Structuring, aggregating, and evaluating crowdsourced design critique. In: Proceed-
ings of the 18th ACM conference on computer supported cooperative work & social comput-
ing, CSCW’15. ACM, New York, NY, USA, pp 473–485. https://doi.org/10.1145/2675133.
2675283
22. Material Design Guidelines (2018). https://material.io/design/
23. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In:
Proceedings of the 27th international conference on machine learning (ICML), pp 807–814
24. Nebeling M, Speicher M, Norrie MC (2013) Crowdstudy: general toolkit for crowdsourced
evaluation of web interfaces. In: Proceedings of the SIGCHI symposium on engineering inter-
active computing systems. ACM, pp 255–264. https://doi.org/10.1145/2494603.2480303
25. Norman D (2013) The design of everyday things: revised and expanded edition, Constellation
26. Norman DA (1999) Affordance, conventions, and design, Interactions 6(3):38–43. https://doi.
org/10.1145/301153.301168
27. Norman DA (2008) The way i see it: signifiers, not affordances. Interactions 15(6):18–19.
https://doi.org/10.1145/1409040.1409044
28. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation.
In: Proceedings of the 2014 conference on empirical methods in natural language processing,
EMLNP’14, pp 1532–1543
29. Pfeuffer K, Li Y (2018) Analysis and modeling of grid performance on touchscreen mobile
devices. In: Proceedings of the SIGCHI conference on human factors in computing sys-
tems, CHI’18. ACM, New York, NY, USA, pp 288:1–288:12. https://doi.org/10.1145/3173574.
3173862
30. Schneider H, Frison K, Wagner J, Butz A (2016) Crowdux: a case for using widespread and
lightweight tools in the quest for ux. In: Proceedings of the 2016 ACM conference on designing
interactive systems, DIS’16. ACM, New York, NY, USA, pp 415–426. https://doi.org/10.1145/
2901790.2901814
31. Swearngin A, Li Y (2019) Modeling mobile interface tappability using crowdsourcing and deep
learning. In: Proceedings of the 2019 CHI conference on human factors in computing systems,
CHI ’19. Association for Computing Machinery, New York, NY, USA, pp 1–11. https://doi.
org/10.1145/3290605.3300305
32. Tidwell J (2010) Designing interfaces: patterns for effective interaction design. O’Reilly Media,
Inc
33. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polo-
sukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach
96 A. Swearngin and Y. Li
Abstract The human eye gaze is an important non-verbal cue that can unobtrusively
provide information about the intention and attention of a user to enable intelligent
interactive systems. Eye gaze can also be taken as input to systems as a replacement
of the conventional mouse and keyboard, and can also be indicative of the cognitive
state of the user. However, estimating and applying gaze in real-world applications
poses significant challenges. In this chapter, we first review the development of gaze
estimation methods in recent years. We especially focus on learning-based gaze
estimation methods which benefit from large-scale data and deep learning methods
that recently became available. Second, we discuss the challenges of using gaze
estimation for real-world applications and our efforts toward making these methods
easily usable for the Human-Computer Interaction community. At last, we provide
two application examples, demonstrating the use of eye gaze to enable attentive and
adaptive interfaces.
1 Introduction
The human eye has the potential to serve as a fast, pervasive, and unobtrusive way
to interact with the computer. Reliably detecting where a user is gazing at allows
the eyes to be used as an explicit input method. Such a new way of interaction
has been shown to outperform traditional input devices such as the mouse due to
the ballistic movement of eye gaze [42, 52]. Moreover, it allows interaction under
Camera
3D Gaze Origin
3D Gaze Direction
Point-of-Gaze (PoG)
Screen / Display
User
Fig. 1 The standard setting for remote camera-based gaze estimation. A camera captures images of
the user’s face region. The problem of gaze tracking is to infer the 3D gaze direction in the camera
coordinate system or the 2D point-of-gaze (PoG) on the screen from the image recorded by the
camera
circumstances where no external input device is available or operable by the user [27].
Beyond explicit input, the movement patterns of a user’s eyes reveal information
about the cognitive processes, level of attention, and interests or abilities [6]. This
offers exciting opportunities to develop novel intelligent and interactive systems that
truly understand the user.
In this chapter, we focus on remote camera-based gaze estimation and its appli-
cations. This is typically done in a setting such as that depicted in Fig. 1 where a
camera is positioned at a certain distance from and facing the user’s eyes. The prob-
lem these methods aim to solve is to infer the 3D gaze direction or the 2D on-screen
point-of-gaze (PoG) from images recorded by the camera. The 3D gaze origin is
often defined to be at the center of the eye or face. Note that these gaze estimation
approaches can also be adapted to head-mounted devices such as those used in AR
and VR settings, though we do not discuss them in this chapter [23].
Estimating the gaze position of a user is a challenging task. Subtle movements of
the eyeball can change the gaze direction dramatically and the difficulty of the task
varies greatly across people. Reliably determining where a user is looking on a screen
or inside a room has been an active research topic for several decades. Classic gaze
estimation methods often use high-resolution machine vision cameras and corneal
reflections from infra-red illuminators to determine the gaze direction [16]. These
methods can provide reasonable gaze estimation accuracy of around one degree after
personal calibration in well-controlled environments. However, dedicated hardware
is essential for their performance, which limits their use in real-world applications.
The rise of AI methods, such as deep learning approaches, has advanced the use
of learning-based gaze estimation methods. In contrast to classic methods, learning-
based methods are based on purposefully designed machine learning models, for
example, neural networks, for the gaze estimation task. These learning-based meth-
ods either estimate the gaze position directly from an image of the user’s eye or
Eye Gaze Estimation and Its Applications 101
face [24, 56, 61], or derive intermediate eye features for gaze direction regres-
sion [33, 47]. This group of methods often assume an unmodified environment.
That is, no additional infra-red illumination is available to provide reflections on the
surface of the cornea. Hence, learning-based gaze estimation methods can work with
a single off-the-shelf webcam [33, 34, 36, 55, 56]. This makes these approaches
more widely and more easily applicable for human-computer interaction (HCI) in
everyday settings [54, 57].
Still, many challenges persist in making gaze tracking practicable for computer
interaction. For example, personal calibration plays a major role in gaze estimation
and also has an impact on user experience. The calibration procedure often requires
users to focus on designated points for a period of time. This can be cumbersome or
in some cases even impossible and disturbs the user experience. Nevertheless, per-
sonal calibration is crucial for many gaze estimation methods to perform accurately.
Thus, recently researchers have built on AI-based techniques for gaze redirection to
generate additional eye images for personalization and thus reduce the number of
calibration samples [51]. Other researchers have worked on providing easy-to-use
software toolkits for making learning-based gaze estimation methods accessible to
HCI researchers and developers [59].
Designing useful and usable gaze-aware interfaces is another major challenge. In
practice, tracking accuracy and precision vary largely depending on factors such as
the tracking environment, user characteristics, and others [7]. In comparison to mouse
or touch input, eye tracking might yield a highly noisy signal with poor accuracy.
Still, information about eye gaze, even from noisy data, can enable novel and useful
interactions. However, design guidelines developed for traditional interfaces cannot
be applied here. Instead, we need new design approaches making efficient use of the
noisy gaze signal.
In this chapter, we first provide some background on the problem of gaze tracking.
We then offer an overview of recent approaches toward improving performance on the
gaze estimation task with the power of AI. We then discuss the practical challenges
when applying gaze estimation methods for computer interaction and designing gaze-
aware interfaces, offering concrete design guidelines and actionable insights for the
HCI community. Finally, we describe two application examples: (1) gaze-aware
interaction with real-life objects and (2) automatic interface adaptation by assessing
information relevance from users’ eye movements. These examples showcase the
exciting opportunities gaze tracking offers for Human-Computer Interaction.
2 Background
In the following, we start with a brief introduction to the human eye, its movements,
and the relation to human attention. We then discuss different categories of gaze
estimation methods and introduce learning-based methods. Lastly, we briefly discuss
the need for the personal calibration of gaze estimators and how this has been done
in existing works.
102 X. Zhang et al.
The human visual field is about 114◦ [20] large of which we can only see sharply in an
area of 1◦ [2] during so-called fixations, when the gaze is focused on a fixed position
in the environment. To perceive information from a larger area, the eyes perform
saccades, fast ballistic movements that allow us to move between fixation points to
integrate information from other areas. See, for example, [40] for further introduction
into the working principles of the human eye gaze. The duration and frequency of
such fixations and saccades can provide information about a user’s attention. It can
be used by interactive systems in combination with their awareness of the visual
stimulus or interface to enable explicit gaze input or make further inferences about
a user’s cognitive state.
However, a person does not always consciously control their eye gaze. Often, it is
stimulus-driven and attracted by visual features, or “idles” in uninteresting regions.
Thus, there is a difference between the eyes focusing on a point and a person’s
covert attention (i.e. their mental focus). Even when focusing on a certain point,
people can shift their conscious attention within the larger field of view similar to a
spotlight and to some extent independent of the gaze position. This allows them to
not just passively perceive information but visually process and encode it for further
cognitive processing [39]. A major challenge for using gaze for HCI is to isolate and
analyze the underlying cognitive processes from such noisy gaze behavior where
overt and covert attention are mixed. In the later part of this chapter, we describe
some applications that aim to make sense of noisy gaze behavior [7, 54].
The gaze estimation methods considered here try to infer information about where
a person is looking from an image of the users’ eyes or face image. They can be
categorized into three groups: model-based, feature-based, and appearance-based
methods [16]. In both model- and feature-based methods, key landmarks are often
required to be detected, such as the pupil center, eye corner, and iris contour. Generally
speaking, model-based methods fit a pre-defined 3D eyeball model to the detected
eye landmarks and take the direction from eyeball center to the pupil center as
the gaze direction [48, 49]. The eyeball model can optionally incorporate an offset
parameter which can be determined with personal calibration data [46]. Feature-
based methods take eye-region landmarks as features for the direct regression of
gaze direction [41]. Since the input feature dimension is limited by the number of
determined key points, these methods often cannot handle complex changes such as
large head movements. Both model-based and feature-based methods conventionally
demand accurate eye landmark detection, often necessitating complex or expensive
hardware setups. For example, multiple high-resolution infrared-light cameras along
with optimal infrared-light sources are the standard hardware configuration for most
Eye Gaze Estimation and Its Applications 103
of these methods. Appearance-based methods directly learn the mapping from the
eye or face image to the gaze direction [44]. Since there is no need for explicit eye
landmark detection (and corresponding training data annotation in the real world),
appearance-based methods can work with a single webcam without any additional
light source. However, these methods can be sensitive to illumination condition
changes or unfamiliar facial appearances due to the scarcity of training data.
Recent developments in deep learning have given rise to a large array of promising
learning-based gaze estimation methods. We refer to these methods as being learning-
based, in order to encompass hybrid methods [34, 50] as well as appearance-based
methods that benefit from large amounts of training data and highly complex neu-
ral network architectures [24, 53]. In particular, appearance-based gaze estimation
methods work with just a single webcam under challenging lighting conditions even
over long operating distances of up to 2 m [10, 59]. This is because deep convolu-
tional neural networks—when given large and varied amounts of training data—are
effective at defining useful image-based features, and thus often outperform hand-
defined features. Importantly, this allows for the new task of person-independent gaze
estimation. That is, a generic learning-based gaze estimation model can be directly
applied to a previously unseen user and achieve 4◦ –6◦ of mean angular error even in
very challenging conditions.
Integrating known priors such as the 3D structure of the eyeball or eyelids into
neural networks is a promising direction of research. A hierarchical generative model
has been proposed for improving gaze estimation by understanding how to control
and generate eye shape [47]. A so-called gazemaps representation has been used to
implicitly encode a 3D eyeball model and then taken as an intermediate output for
gaze direction regression [33]. Applying deep learning-based landmark localization
architectures for eye-region landmark detection has also been shown to be more
effective than traditional edge- or contour-based methods [11, 34].
A primary reason for this performance gap is the so-called “angle kappa” as the
angular difference between the line-of-sight of a user (actual axis along which an
eye “sees”) and the optical axis of their eyeball (defined by the geometry of the
head and eye). For a more principled definition of angle kappa, please refer to [29].
This difference varies greatly across people with typical differences being two–three
degrees [29]. Importantly, the line-of-sight cannot be measured by a camera alone
as it is defined by the position of the fovea, which cannot be observed. The optical
axis, on the other hand, can be reasonably estimated from the appearance of the eye
region.
The classic literature tackles this issue by explicitly defining the kappa angle as
a parameter to an optimization problem. In all gaze estimator calibration methods, a
user is asked to gaze at specified points on a screen or in space. An optimization-based
scheme is then often applied to determine the user-specific parameters of the model.
An important consideration in these schemes is in requiring minimal “calibration
samples” from the end-user such as to make the experience less cumbersome and also
to enable spontaneously interactive applications in everyday scenarios. Conventional
approaches are quite effective in clean and controlled laboratory settings where the
position and shape of the iris and eyeball can be reasonably measured. In-the-wild
settings and unconstrained head movements of the user, however, pose significant
challenges that learning-based methods can easily address. However, learned models
can be tricky to adapt as user-specific parameters are usually not explicitly defined.
Several feasible calibration strategies have been suggested recently for learning-
based gaze estimation, either via optimization of user-specific parameters defined
at specific parts of the network or via eye-region image synthesis for personalized
training data generation. The more direct and effective approaches define parameters
which can be adapted based on a few labeled samples from the target user. Approaches
have been proposed to apply these parameters at the input [17, 25] and output [4] of
the neural network. As the primary factor in the difference between users is the angle
kappa, such low-dimensional definitions of user parameters are surprisingly effective.
Yet other approaches have been proposed for learning a light-weight regression model
from penultimate layer activations [24, 34], or gradient-based meta-learning as a
method for effective few-shot neural network adaptation [35]. A unique approach
suggests correcting an initially estimated gaze direction based on changes in the
appearance of the presented visual stimuli [36]. Importantly, this approach does not
require any explicit calibration but instead relies on the model having been trained
on paired eye gaze and visual stimulus data.
An alternative area of research is in “gaze redirection”, where the objective is in
accurate and high-quality eye image generation with control of gaze direction. While
earlier learning-based methods in this area focused only on the image synthesis aspect
[13], later works have shown that generating person-specific eye images with varying
gaze directions can allow for an alternative method of personal calibration. That is,
given a few samples from the target user, gaze redirection methods can be used to
create a training dataset tailored to the target user [51]. Though not directly related
Eye Gaze Estimation and Its Applications 105
to personal calibration, later works [18, 62] have further improved the accuracy
and quality of generated images and have shown that limited gaze datasets can be
augmented via such synthesis schemes.
The gaze estimation method we proposed in [53] was the first work to use a convo-
lutional neural network architecture for gaze estimation. Our later works extended
the architecture to much deeper networks such as VGG-16 and ResNet-50 [56, 60].
These works introduce a basic pipeline for image-based gaze estimation. That is,
given an input image taken from a single webcam, we learn a direct mapping to the
gaze direction (see Fig. 2). The first step in this pipeline is face detection and facial
landmark localization. Then, we fit a pre-defined 3D face model to the detected facial
landmarks to estimate the rotation and translation of the head. With this head pose
information, we perform a procedure known as “data normalization” to cancel the
rotation around the roll-axis and crop the eye or face image to a consistently defined
size [43]. This data normalization procedure was later optimized further to increase
its effectiveness toward improving gaze estimation performance [58]. Finally, the
cropped image, together with the head pose, is fed into the convolution neural net-
works to regress to the final gaze direction in the camera coordinate system. The
gaze direction can be presented as a three-dimension vector in the Cartesian coor-
dinate system. We choose to convert it to a two-dimensional vector in the spherical
coordinate system representing the polar angle and azimuthal angle. In this way, we
reduce one degree of freedom from the gaze direction vector and center the output
values around zero, for better ease of regression.
106 X. Zhang et al.
Generic face
model
Fig. 2 The gaze estimation pipeline proposed in [56] describes a pre-processing procedure for
extracting eye patches which are then input to a convolutional neural network that directly predicts
the gaze direction of the imaged eye
The final output of the pipeline shown in Fig. 2 is a 3D gaze direction value. Alterna-
tively, a different group of methods directly output the 2D gaze location on a screen
that the user is assumed to be gazing at. See [55] for an example of experiments
directly comparing between 2D and 3D gaze estimation. Apart from a change in the
dimensionality of the output value, 3D and 2D gaze estimation differ in practice in
terms of how the head position is integrated into the estimation pipeline. 3D gaze
estimation methods typically insert the 3D head orientation value (often referred to
as head pose) directly into the network as input to one of the last fully connected
layers. The task of 2D screen-space point-of-regard regression (2D gaze estimation),
however, theoretically requires more complex information such as the definition of
the pose, scale, and bounds of the screen plane as well as a reliable estimation of the
translation of the head in relation to the screen. This can be approximated by pro-
viding a binary “face grid” where the number of black pixels (as opposed to white
pixels) indicates the size and position of the user’s face [24]. While this alternative
2D problem formulation tackles the gaze estimation task more directly, its main
drawback is that the trained model is specific to the device used in the training data.
Hence, 2D models are not robust to changes such as the camera hardware, screen
size and pose, and other factors pertaining to the camera-screen relationship. 3D
gaze direction estimation is thus a more generic approach that can consolidate data
samples from different devices both at training time and test time.
Early works in gaze estimation only take a single eye image as input since it is
often deemed to be sufficient in inferring gaze direction [43, 53]. However, learning-
based methods can be surprisingly effective in extracting information from seemingly
redundant image regions and thus regions beyond the single eye could be helpful for
Eye Gaze Estimation and Its Applications 107
training neural networks. Taking a face image along with both eye images [24] or
simply the two eye patches [5, 10] as input for gaze estimation can achieve better
performance than a model taking single eye input. We were the first to use a single
full-face image as input for the gaze estimation task [55] showing that this achieved
the best results compared to other kinds of input regions. Furthermore, to fully use
the information of the full-face, we proposed a soft attention mechanism [55] and
a hard attention mechanism [61] to efficiently learn information from the full-face
patch. In [55], we allow the neural network to self-predict varying weights for dif-
ferent regions of the input face in order to make model training efficient. However,
in contrast to object classification tasks where the scale of activation values of each
feature map is correlated to the importance of a template or object class, gaze estima-
tion as a regression task can benefit from an attention mechanism that goes beyond
activation value modulation. Our later work proposes a hard attention mechanism to
force the model to focus on the sub-regions of the face [61]. Taking a full-face patch
as input, our method first crops sub-regions with multiple region selection networks.
These sub-regions are then passed as input to the gaze regression network which
predicts gaze direction. Since each sub-region is resized to be the same as the origi-
nal input face image, the receptive field is enlarged, thus, the gaze regression model
can extract large and informative feature maps from the sub-regions. This method
successfully picks the most appropriate eye region for gaze estimation depending
on different input image conditions such as occlusion and lighting conditions. How-
ever, the model itself can be difficult to train and take much time to converge. How to
efficiently learn the information from the full-face patch is still an ongoing research
topic.
In addition to studying various methods of input region selection for gaze estima-
tion, we also suggest various approaches to learning unique gaze-specific representa-
tions in neural networks. Such representations can be explicitly defined or implicitly
learned. The first representation as proposed in [34] is explicitly defined as being
eye-region landmark coordinates. The fully convolutional network proposed in this
work is able to detect eye-region landmarks from images captured with a single
webcam, even under challenging lighting conditions. Compared to the classic edge-
based eye landmark detection method [16, 48], the convolutional network provides
more robust landmark prediction. These detected landmarks are then used for model-
based or feature-based gaze estimation. However, since it still requires eye landmark
detection, this method can only work in settings with a close distance between the
user and camera such as the laptop and desktop setting [59] and relies on high-
quality synthetic training data. We further improve our method by first predicting
a novel pictorial representation that we call a “gazemap”, then use it as input for a
light-weight gaze regression network [33]. In this work, the proposed method lever-
ages the power of hourglass networks to extract this image-based “gazemap” feature
108 X. Zhang et al.
which is composed of silhouettes of the eyeball and the iris. It is an abstract, picto-
rial representation of eyeball structure which is the minimally essential information
necessary for the gaze estimation task. The gazemap representation is not explicitly
correlated with key landmarks in the input eye image and can be generated from the
3D gaze direction labels. Hence, the latter approach can be applied to models that
need to be trained directly on real-world data. The alternative is to train on synthetic
data, which can result in a model that does not perform sufficiently well due to the
domain gap between synthetic and real data domains.
To train a generic gaze estimator that can be applied to a large variety of conditions
and devices, it is critical that learning-based gaze estimation methods are trained with
datasets that have good coverage of real-world conditions. Unless the model has had
a chance to encounter data with large variations, it could suffer due to over-fitting to
the more limited training data and perform in unexpected ways outside of the original
data regime. Essentially, we should not expect learned models to handle samples that
are out-of-distribution. Specifically, for assessing a dataset for the gaze estimation
task, there are several factors that should be considered, such as the range of gaze
direction, range of head poses, diversity of lighting conditions, variety of personal
appearances, and input image resolution.
Early datasets mainly focus on the head pose and gaze direction coverage under
controlled lighting conditions such as UT-Multiview [43] and EYEDIAP [12]. Our
MPIIGaze dataset, as the first of its own kind, brought the task of gaze estimation out
from the conventional and controlled laboratory setting out into the real-world setting
which covers different lighting conditions [53, 56]. This was done by installing a
data recording software on 15 participants’ laptop computers and prompting the
participant every 10 min to ask for 20 gaze data samples. In this task, participants
were asked to look at dots on the screen as they appear, then press the space button to
confirm that he/she was looking at the dot. In this way, we could record the dot that
the participant was looking at, and at the same time, the position of the on-screen
dot was stored, along with an image of the participant’s face taken with the built-in
camera of the laptop. Since the data samples were collected without restriction on
location and time, we were able to collect samples under many different lighting
conditions with natural head movement. However, since the MPIIGaze dataset was
collected with laptop devices, the head pose and gaze direction ranges are limited
to the size of typical laptop screens. Therefore, models trained only on MPIIGaze
data may not apply well to settings with larger displays and viewing distances, for
example, participants gazing at a TV in a public space.
Such limitation by the capture device appears in many existing datasets. Similar
to our MPIIGaze, the GazeCapture dataset limited itself with small ranges of head
poses and gaze directions due to using mobile phone and tablet devices for data
collection [24]. The EYEDIAP dataset is designed specifically for head poses and
Eye Gaze Estimation and Its Applications 109
Fig. 3 Head pose (top row) and gaze direction (bottom row) distributions of different datasets.
The head pose of Gaze360 is not shown here since it is not provided by the dataset. The figure is
adapted from [60]
gaze directions of the desktop setting [12]. The RT-GENE dataset tried to use a
head-mounted eye tracker to provide accurate gaze direction ground-truth and large
spatial coverage of head poses and gaze directions [10]. The recent Gaze360 dataset
used a moving camera to simulate different head poses [21]. However, the image
and ground-truth quality were not guaranteed with these datasets, and the coverage
of head poses and gaze directions was not properly designed.
We provide the ETH-XGaze dataset consisting of over one million high-resolution
images of varying gaze directions under extreme head poses [60]. This dataset was
collected with a custom setup of devices including a screen to show visual content
from a projector, four lighting boxes to simulate different lighting conditions, and
18 digital SLR cameras which can capture high-resolution (6000 × 4000 pixels)
images. The cameras were arranged such as to cover different perspectives of the face
of the participant, effectively making each camera position correspond to one “head
orientation” in the final processed dataset. Since the participant was placed close
to the screen, a large range of gaze directions was captured during each recording
session. A comparison of head pose and gaze direction ranges is made between our
ETH-XGaze dataset and other datasets in Fig. 3. From the figure, we can see that
our ETH-XGaze dataset provides the largest range of head poses and gaze directions
compared to previous datasets. ETH-XGaze is a milestone toward providing full
robustness to extreme head orientations and gaze directions and should enable the
development of interesting novel methods that better incorporate understandings of
the geometry of the human head and the eyeball within.
In addition to exploring the spatial dimension with the 18-camera ETH-XGaze
dataset, we chose to explore the temporal dimension of gaze tracking in an end-
to-end fashion. That is, we aimed to go beyond the static face images provided by
most gaze estimation datasets by providing video data. In addition, we observed
that when humans gaze at objects or other visual stimuli, their eye movements are
often correlated with particular changes or movements in the stimuli. Yet, no large-
scale video-based dataset exists to relate the change in the appearance of the human
directly to a video of the visual stimulus. To fill this gap, we proposed another novel
110 X. Zhang et al.
Fig. 4 EVE data collection setup and example of (undistorted) frames collected from the 4 camera
views with example eye patches shown as insets [36]
dataset called EVE to provide temporal information of both the human face and
corresponding visual stimulus for improving the temporal gaze estimation task [36].
The EVE dataset was recorded with four video cameras facing the participants, with
various visual contents shown on the screen. The custom multi-view data capture
setup and example frames are shown in Fig. 4. The custom setup synchronized
information from three webcams running 30 Hz, one machine-vision camera running
60 Hz, and one Tobii Pro Spectrum eye tracker running 150 Hz. A large variety
of visual stimuli were presented to our participants including images, videos, and
Wikipedia web pages. We ensured that each participant observes 60 image stimuli
(for 3 s each), at least 12 min of video stimuli, and 6 min of Wikipedia stimulus (three
2-min sessions). To our understanding, EVE is the first dataset to provide continuous
video recordings of both the user and the visual stimuli while the user is free-viewing
the presented visual stimuli. Alongside the dataset, we propose a method which shows
that when a video of the user and screen content are taken as input, it is possible to
correct for biases in a pre-trained gaze estimator by relating changes in the screen
content with eye movement. Effectively, this allows for calibration-free performance
improvements, finally yielding 2.5◦ of mean angular error.
Learning-based gaze estimation methods have developed rapidly and now begin
to challenge classical methods. However, an accurate comparison of different gaze
estimation methods is not a trivial task since they have different requirements in terms
of capture hardware and lighting conditions. In [59], we compared three typical gaze
estimation methods including two of our webcam-based methods [34, 55] and the
commercial Tobii EyeX eye tracker on data collected from 20 participants. Our
Eye Gaze Estimation and Its Applications 111
Fig. 5 Gaze estimation errors of different methods in degrees across distances between the user
and camera (left), and a number of samples for the personal calibration. Dots are results averaged
across all 20 participants and we linked them by lines [59]
112 X. Zhang et al.
three approaches only need a few personal calibration samples to reach reasonable
accuracy. However, the number of necessary calibration samples may increase for
real-world applications compared to this simple setting.
In principle, there are two main challenges for learning-based gaze estimation meth-
ods caused by personal differences. The first one is the kappa angle which varies by
around two–three degrees on average across people [29]. The second challenge is
personal eye appearance differences such as the shape of the eye and color of the iris.
The eye appearance is also affected by changes in gaze directions and head poses,
which is further connected to the image capturing or personal computing device. For
example, a gaze estimator trained on images captured on a smartphone device that is
held closer to the user may not perform well when directly applied to a large public
display such as an advertisement board in a shopping mall. This could be caused by
loss of image resolution and quality and unfamiliar head poses during operation. Due
to these challenges, learning-based methods may benefit from further adaptation in
challenging conditions that are not covered by the training data.
A basic experimental observation is that increasing the number of dataset partici-
pants results in improved general gaze estimation performance [24]. That is, learning
from more peoples’ data allows for a method that generalizes better to previously
unknown users. However, as introduced in Sect. 2.4, there still exists a large perfor-
mance gap that can be recovered when using just a few samples from the final target
user to adapt learned models.
Eye Gaze Estimation and Its Applications 113
Nevertheless, collecting personal calibration data is still an effective way for good
gaze estimation performance. In our work [57], we proposed to use multiple types of
devices to collect samples for specific users, then aggregate all of these samples to
train a joint model for that specific user across devices. The intuition behind this work
is that the personal appearance should be the same for different devices which we
can learn with the shared layers in the middle of our model. Our approach can benefit
applications that are expected to be used by a user over a long period of time and
across multiple devices, with personal calibration data being collected occasionally.
An alternative and promising method of increasing the amount of training data
for specific persons is generative modeling. Given a few labeled samples, a high-
quality generative model would be able to create tailored training data from which
a robust yet personalized gaze estimation model could be learned. Our first work in
this direction used an architecture based on CycleGAN [63] for realistic eye image
generation, where gaze direction is provided as an input condition to the network,
and training is supervised via perceptual and gaze direction losses [18]. Although this
method is successful at generating photo-realistic images of the eye, it is not aware
of head orientation and cannot easily be trained with noisy real-world images. We
later proposed a transforming encoder-decoder architecture to tackle these issues,
where features pertaining to gaze direction, head orientation, and other appearance-
related factors are explicitly defined at the bottleneck of the autoencoder [62]. To
truly enable training on in-the-wild datasets, we allow for the definition of implicitly
defined “extraneous” factors at the bottleneck. The reconstruction objectives allow for
these extraneous factors to encode information that is task-irrelevant yet allows for the
satisfaction of the image-to-image translation objective. This approach, in particular,
was shown to improve performance in the person-independent cross-dataset setting,
but with further development, it should be possible to demonstrate improvements in
the personalizing of gaze trackers. While personalized data collection is a promising
and active direction of research, much work is yet needed for it to be effective.
Alternatively, our other research works show that learning-based gaze estima-
tor calibration is definitely possible with tens of samples using simple regression
schemes and with as few as one to three samples when using a more advanced meta-
learning scheme. By defining input features using eye-region landmarks detected
by a fully convolutional neural network, we show that a support vector regression
model is capable of improving performance significantly with as few as 10 calibra-
tion samples. An appearance-based gaze estimator taking full-face input images was
shown to be effective in tandem with a simple polynomial regression scheme taking
point-of-regard as input, resulting in less than 4◦ of error with just 4 calibration sam-
ples [59], albeit in controlled experimental settings. When training on real-world
data, a transforming encoder-decoder architecture coupled with a gradient-based
meta-learning scheme was shown to be highly effective, with as few as one to three
calibration samples yielding close to 3◦ of error on challenging in-the-wild datasets
[35]. The code for the latter two systems is open-source and thus contributes toward
effecting real improvements with regards to the applicability of learning-based gaze
estimation methods to HCI applications.
114 X. Zhang et al.
Fig. 6 Accuracy and precision of gaze tracking vary largely across users. Values increase steeply
for different percentiles of users both in x- and y-directions
The estimated gaze data can be highly noisy and inaccurate. Nevertheless, it can
be used for computer input, to improve user experience or otherwise enable new
interaction if potential noise is taken into account during the design of gaze-aware
interfaces. To this end, we have studied tracking performance in practical setups to
derive design guidelines and actionable insights for the design of robust gaze-aware
interfaces.
In [7], we collected eye-tracking data of 80 people in a calibration-style task, where
participants were asked to fixate randomly positioned targets on the screen for 2 s. We
used two different eye trackers (Tobii EyeX and SMI REDn scientific, both 60 Hz)
under two lighting conditions (closed room with artificial lighting, room with large
windows facing the tracker) in a controlled but practical setup. In contrast to many
lab studies, we did not exclude any participant due to insufficient tracking quality.
Instead, we were interested in learning about the possible variations in tracking
accuracy (the offset from the true gaze point) and precision (the spread of tracked
gaze points). These could be due to the independent variables of our study (lighting,
tracker, screen regions), as well as due to external factors that we did not control but
that are typical for real-life setups (participants wearing glasses or mascara, varying
eye physiology, etc.).
The collected data reveals large variations of tracking quality in such a practical
setup. Figure 6 shows the average accuracy and precision across all focused targets
for different percentiles of participants. Very accurate fixations (25th percentile) are
only 0.15 cm in the x-direction and 0.2 cm in the y-direction offset from the target. On
the other hand, inaccurate fixations (90th percentile) can be as far offset as 0.93 cm
in the x-direction and 1.19 cm in the y-direction—a more than six-fold difference—
similar to the spread of the gaze points. Additionally, we found the precision of the
estimated gaze points to be worse toward the right and bottom edges of the screen, as
shown in Fig. 7. The ellipses represent the covariance matrix computed over all gaze
points from all participants. In contrast, we found no significant variation across the
screen for accuracy.
Eye Gaze Estimation and Its Applications 115
With data from such a calibration-style task, we can derive appropriate sizes for
gaze targets, i.e. the regions in a UI that recognize if the user’s gaze falls inside its
borders. Given the gaze points belonging to a fixation, we can assume they are nor-
mally distributed in x- and y-directions independently, with an offset Ox/y (accuracy)
from the center of the fixated target and a standard deviation σx/y (precision). From
these, we can compute the necessary width and height for a gaze-aware element to
be usable under such tracking conditions with the following equation:
Multiplying σx/y by 2 results in about 95% of gaze points falling inside the target,
according to the properties of a normal distribution. While this seems conservative,
an error rate of more than 5% (every 20th gaze point falling outside the target area)
might slow down performance and lead to errors that can be hard to recover from.
Figure 8 visualizes the size computation and shows two example cases with good
and bad tracking quality. In [7], we give explicit target sizes for different percentiles
of users. They vary from 0.94 × 1.24 cm for users that track well (25th percentile)
up to 5.96 × 6.24 cm if we want to allow robust interaction for nearly all users in
the dataset (90th percentile).
Target sizes can be significantly reduced if the gaze data is first filtered to remove
noise artifacts and reduce signal dispersion. However, in contrast to laboratory stud-
ies, interactive applications cannot post-process the gaze data but must filter it in
real time. This makes the recognition of outliers and artifacts difficult since it can
introduce delays of several frames. Gaze filters must also account for the quick and
sudden changes between saccades and fixations. In contrast, eye tremors, microsac-
cades, and noise should be filtered in order to stabilize the signal and improve preci-
sion. This makes commonly used methods, such as moving average, Kalman filter,
or Savitzky-Golay filter, less useful [45].
The choice of the filter and its parameters can be seen as a trade-off between the
target size required for robust interaction and the signal delay in following a saccade.
In [7], we proposed a method to optimize the parameters for any filter given gaze
data from a calibration-style data as described earlier. In a grid search, it instantiates
116 X. Zhang et al.
Fig. 8 Using the accuracy (Ox/y ) and precision (σx/y ) of the estimated gaze points belonging to
a fixation, we can compute target sizes that would allow for robust interaction with an interface.
The plot shows examples from fixations of two different users one with good and one with poor
tracking quality [7]
a filter with each possible parameter, computes the resulting target size after filtering
the data, and simulates saccades between such targets to determine any signal delay.
The result is a Pareto front of parameter combinations that yield the minimum target
size for a specific delay.
Using this method, we compare five commonly used gaze filters with three differ-
ent kernel filters: the Stampe filter, the 1e filter, a set of weighted average filters with
linear, triangular, and Gaussian kernel functions, an extension with saccade detec-
tion, and one with additional outlier correction. See [7] for a description of each
filter. The filters differ in the trade-offs they achieve for target size and signal delay.
Generally, we found that a weighted average filter with a saccade detection performs
best in terms of target size when signal delay should be short (up to one frame or
32 ms with 30 Hz tracker). The best performance is achieved with additional outlier
correction at the cost of 2–2.5 frames delay.
The use of a filter with optimized parameters can reduce the target sizes by up to
42% (see Table 1). However, the filter can only improve the precision of the data,
not its accuracy. Simulation based on real data yields important insights into the
effect of filters on the signal. Filters that by design should not introduce any or only
a short signal delay, in practice, introduce much larger delays to the gaze signal. For
example, depending on the noise and set parameters, it may wrongly detect saccades
as outliers or as part of fixation and either remove them or heavily smooth the signal.
In such cases, an additional delay occurs before the filtered signal follows a saccade
to a new fixation point. See [7] for an in-depth discussion of the tested filters.
We can summarize our analysis in a set of concrete design guidelines for gaze-
enabled applications:
• Target sizes of at least 1.9 × 2.35 cm allow for reliable interaction for at least
75% of users if optimal filtering is used.
Eye Gaze Estimation and Its Applications 117
Table 1 Recommended target sizes for robust interaction by eye gaze. The values for raw and
filtered show the improvement that can be achieved by filtering the gaze data. The percentiles show
how much target sizes can vary for different levels of tracking quality [7]
Width (cm) Height (cm)
Raw Filtered Improv Raw Filtered Improv
Overall 3.0 2.02 33% 3.14 2.19 30%
Percentile
25% 0.94 0.58 38% 1.24 0.8 35%
50% 1.8 1.12 38% 2.26 1.48 35%
75% 3.28 1.9 42% 3.78 2.35 38%
90% 5.96 3.9 35% 6.24 4.24 32%
• Target dimensions should take into account the larger spread of gaze points in
the y-direction we observed. Thus, the height should be somewhat larger than the
width.
• Visual representation of elements can be smaller in which case the element should
have a transparent margin that is also reactive to the user’s gaze.
• Placement of targets should avoid the bottom or right edge of the screen, for which
accuracy and precision were found to be significantly worse.
• Filter gaze points using a weighted average filter (over 36/40 frames in x/y direc-
tion) with a Gaussian or Triangular kernel and saccade detection (threshold of
1.45/1.65 cm in x/y direction). Additional outlier correction can further improve
precision but at the cost of a two-sample delay.
1 http://www.opengaze.org.
118 X. Zhang et al.
5 Applications
Eye tracking provides information on where a user is looking, the dynamics of the
gaze behavior, or the simple presence of the eyes on an object or screen. Such informa-
tion offers a range of opportunities for computer interaction (see, for example, [28]).
On the one hand, explicit eye input allows controlling an interface by fixating the
corresponding UI elements or executing a prescribed series of fixations, saccades,
or smooth pursuits. This requires users to consciously control their eyes which can
be difficult but useful when other input modalities are not available or impractical.
Explicit gaze input is used, for example, in virtual or augmented reality applica-
tions [19, 22] or to enable interaction for people with motor impairments [27]. On
the other hand, attentive interfaces use information about the natural gaze behavior of
users often without them noticing. They can obtain insights on the user’s experience
with an interface, their cognitive processes, their skills or struggles, and their inten-
tions or preferences [14, 26]. In this section, we focus on such attentive interfaces
that make implicit use of the gaze information. We present two applications that use
this data in different ways: (1) as a way to establish a user’s intention to interact
with a device by tracking the location of their natural gaze, and (2) for adapting the
interface to make the displayed information more relevant to a user by observing
their gaze behavior over time.
2 https://github.com/swook/GazeML.
Eye Gaze Estimation and Its Applications 119
Fig. 9 Overview of our method in [54]. Taking images from the camera as input, our method first
detects the face and facial landmarks (a). It then estimates the gaze directions p and extracts CNN
features f using a full-face appearance-based gaze estimation method (b). During training, the gaze
estimates are clustered (c) and samples in the cluster closest to the camera get a positive label while
all others get a negative label (d). These labeled samples are used to train a two-class SVM for eye
contact detection (e). During testing (f), the learned features f are fed into the two-class SVM to
predict eye contact on the desired target object or face (g)
Our method works with the assumption that the target object is the one closest to
the camera, thus, our method only requires a single off-the-shelf RGB camera placed
close to the target object. Once the camera is placed, the approach does not require any
personal or camera-object calibration . As illustrated in Fig. 9, the input to our method
is the video sequence from the camera over a period of time. During the training, our
method runs the gaze estimation pipeline introduced in our work [55] to obtain the
estimated gaze direction. Assuming dummy camera parameters, the estimated gaze
direction vector g is projected to the camera image plane and converted to on-plane
gaze locations p. While the gaze estimation results are used for sample clustering,
we extract a 4096-dimensional face feature vector f from the first fully connected
layer of the neural networks.
As we stated in [55], the estimated gaze direction g is not accurate enough without
personal calibration, and it cannot be mapped directly to the physical space without
the camera-object relationship parameter. However, it indicates the relative gaze
direction of the user from the camera position. Hence, these estimated gaze directions
can be grouped into multiple clusters corresponding to several objects in front of the
user. Given that our method assumes that the target object is the one closest to the
camera, the sample cluster of the target object is identified as the cluster closest
to the origin point of the camera coordinate system. Other clusters are assumed to
correspond to other objects, and samples from these clusters are used as negative
samples.
Labeled samples obtained from the previous step are used to train the eye contact
classifier. This is a two-class classifier that determines if the user is looking at the
target object or not in the current input frame. We use a high-dimensional feature
vector f extracted from the gaze estimation network to leverage richer information
instead of only gaze locations. Furthermore, we apply principal component analysis
(PCA) to the training data and reduce the dimension of feature vector f that the
subspace retains the 95% variance.
120 X. Zhang et al.
Fig. 10 Examples of gaze location distribution for the object-mounted (tablet, display, and clock)
and head-mounted settings from [54]. The first row shows the recording setting with marked target
objects (green), camera (red), and other distraction objects (blue). The second row shows the gaze
location clustering results with the target cluster in green and the negative cluster in blue. The third
row shows the ground-truth gaze locations from a subset of 5,000 manually annotated images with
positive (green) and negative (blue) samples
During testing, input images are fed into the same pre-processing pipeline with
the face and facial landmark detection, and feature f is extracted from the same gaze
estimation neural networks. It is then projected to the PCA subspace, and the SVM
classifier is applied to output eye contact labels. Note that during both the training
and test phases, we neither need to label the input frame sample nor calibrate the
camera-object relationship.
To evaluate our method for eye contact detection, we collected two datasets for
two challenging real-world scenarios: office scenario and interaction scenario. The
example of the two scenarios is shown in Fig. 10. For the office scenario, the camera
is object-mounted as the camera was mounted or placed very near to the target object,
and we aimed to detect eye contact of a single user with these target objects during
everyday work at their workplace. We recorded 14 participants in total (five females)
and each of them recorded four videos for different target objects: one for the clock,
one for the tablet, and two for the display with two different camera positions. The
recording duration for each participant ranged from three to seven hours.
In the interaction scenario (see far right of Fig. 10), a user was wearing a head-
mounted camera while being engaged in everyday social interactions. This scenario
was complementary to the office scenario in that the face of the user became the
target and we aimed to detect eye contact of the second person who talked with the
user. We recruited three users (all male) and recorded them while they interviewed
multiple people on the street.
The example of gaze location distribution for the two scenarios is shown in Fig. 10.
In the first row, we show the recording settings for the different target objects. We
Eye Gaze Estimation and Its Applications 121
mark the target object (green rectangle), camera (red rectangle) positions, and other
distraction objects (blue rectangle) in the figure. The second row of Fig. 10 shows
sample clustering results where we mark the target cluster with green dots while
all other negative sample clusters are marked with blue dots. Noise samples are
marked as black and the big red dot is the camera position as the origin of the camera
coordinate system. The third row shows the corresponding ground-truth annotated
by two annotators.
From the second row of Fig. 10, we can see that the grouped sample clusters can
be associated with objects in front of the camera, especially for the office scenarios
as object layout is fixed. For the interaction scenario, we can observe one centered
cluster and other random distributed samples. This is due to the fact that there is
no fixed attractive object next to the user’s face. Our sample clustering method can
achieve good clustering results and successfully pick the cluster that belongs to the
target object. It can also be easily extended to include objects that are newly added
to the scene by updating the clusters. However, our method requires sufficient data
for good clustering—usually about a few hours of recording. Besides, the target
object should attract enough attention to the user, and it has to be isolated from
other objects. Nonetheless, our method provides a way of eye contact detection
with a single RGB camera without neither tedious personal calibration nor complex
camera-object relationship calibration.
The user’s gaze behavior can reveal whether the content displayed to a user is useful
and relevant to their current task. In particular when making a decision, showing the
right information to the user is crucial for the decision quality [31]. For example, a
user might look at the details of a product for deciding whether to buy it or check
the weather forecast to decide whether to go for a hike. If important information
is missing from an interface, a user might be affected and make a wrong decision.
On the other hand, displaying all available information might not be effective due
to device constraints (e.g. on a small screen of a mobile phone) or because it might
lead to information overload and a bad user experience. What makes the design of
such interfaces challenging is also that users perceive the relevance of information
differently [30], an aspect that cannot be foreseen at design time but must be detected
and accounted for at run-time. However, the challenge is how to infer the relevance
of the displayed information online, without having to interrupt users in their task.
Eye gaze has proven to be an unobtrusive and objective measure for a person’s
attention [37]. In this section, we show how we can analyze this data during the
decision process of a user to obtain insights on the relevance of the displayed infor-
mation [8]. This requires no explicit user input but analyzes the natural gaze behavior
of the user while they focus on their decision-making task. In contrast to simpler,
visual search tasks, the challenge is that the gaze behavior varies drastically during
122 X. Zhang et al.
Table 2 For recognizing information relevance from gaze behavior, we combine six well-
established gaze metrics which we can associate with the three cognitive stages of decision-
making [8]
Orientation
TFF Time to first The time elapsed between the presentation of a
fixation stimulus and the first time that gaze enters a given
AOI. A low TFF value indicates high relevance
FPG First Pass Gaze The sum of duration of fixations on an AOI during
the first pass, i.e. when the gaze first enters and leaves
the AOI. A high FPG value indicates high relevance
Evaluation
SPG Second pass gaze The sum of duration of fixations on an AOI during
the second pass. A high SPG value indicates high
relevance
RFX Refixations count The number of times an AOI is revisited after it is
first looked at. A high RFX value indicates high
relevance
Verification
SFX Sum of fixations The total number of fixations within an AOI. A high
SFX value indicates high relevance
ODT Overall Dwell Time The total time spent looking at an AOI including
fixations and saccades. A high ODT value indicates
high relevance
Fig. 11 Two alternative ways to adapt a room search interface. Left: relevant content is highlighted
through color boxes. Right: irrelevant information is suppressed by graying it out [8]
Intuitively, this means that for that element, the gaze metric deviates from the average
across all elements indicating a different gaze behavior of the user. To establish the
relevance of an element, we count the number of votes cast by the 6 metrics and
compare it to a threshold. Requiring a higher number of votes yields a lower number
of elements being detected as relevant. This is further discussed below. In any case,
this approach does not assume a fixed number of relevant elements a priori. Also, it
is training-free and requires no ground-truth data.
Once we know whether the displayed information is relevant for the user’s deci-
sion, we can adapt the interface to facilitate the decision process. Broadly speaking,
many of the adaptation techniques proposed in the literature (see e.g. [9]) can be
divided into two types: (1) emphasizing relevant content (e.g. coloring, rearrang-
ing or replicating elements) and (2) suppressing irrelevant information (e.g. graying
out, removing, and moving elements to less prominent positions). See Fig. 11 for an
example application. To obtain a benefit from adaptation, it is critical to minimize the
risk of usability issues due to wrong adaptations. For example, wrongly highlighting
seldom used elements in a menu can induce a performance cost that exceeds the
benefit of adaptation [9]. On the other hand, failing to highlight an important menu
item might not bring any benefit to the user but induces no cognitive cost either.
Thus, when emphasizing content, a successful relevance detector should identify
the subset of relevant UI elements (i.e. true positives) while minimizing the risk to
detect irrelevant ones as important (i.e. false positives). When suppressing content,
on the other hand, we are interested in recognizing the non-relevant elements (i.e.
true negatives) while avoiding suppressing any relevant ones (i.e. false negatives)
which might induce a high cognitive cost.
We can easily tune the different recognition rates of the relevance detector
(true/false positive/negative) by varying the number of votes required to recognize
an element as being relevant. Different voting schemes are possible. We can require
a minimum of 1–6 gaze metrics to cast a vote, or we can be more selective and con-
sider votes from metrics of the same stage (see Table 2) as redundant. In this case, we
might require votes from a minimum of 2 different or all 3 stages. Figure 12 shows the
resulting trade-off between the true positive (relevant elements correctly detected)
and the false positive rates (irrelevant elements detected as relevant) depending on
124 X. Zhang et al.
the voting scheme. The shaded areas indicate rates that seem acceptable for empha-
sizing or suppressing information. Data comes from an empirical study capturing the
gaze behavior of 12 participants during interaction with a financial trading interface
with information about a specific stock. Participants should decide whether to invest
in the stock or not. Details are given in [8].
The figure shows that the true/false positive rate of the recognizer can easily
be adjusted in a predictable manner to account for the requirements of different
adaptation schemes. We recommend the following vote thresholds:
• For emphasizing relevant information, we recommend a minimum of 3 votes each
from a different stage (Vote3Stages in Fig. 12). This yields a low false posi-
tive rate, reducing the risk of inducing any cognitive dissonance by emphasizing
irrelevant information. At the same time, it ensures that only the most relevant
information is emphasized.
• For suppressing irrelevant information, we recommend a minimum of any 2 votes
(Vote2Metrics in Fig. 12). This yields a high true positive rate, ensuring that
relevant information is not suppressed in any way, which could lead to higher
interaction costs. A high false positive rate is acceptable in this case, which means
that some less relevant content is not suppressed.
Eye Gaze Estimation and Its Applications 125
The improvements in AI have given rise to a new class of learning-based gaze estima-
tion methods which make eye tracking more practicable and more widely applicable
in everyday computer interaction. In contrast to traditional gaze estimation meth-
ods, the recently developed learning-based approaches do not require specialized
hardware and can operate with just a single webcam and at a much larger operation
distance. As these methods further improve, they will allow for HCI applications to
consequently use gaze outside the lab and in the everyday interaction with computers.
Two major challenges remain open in enabling out-of-the-box learning-based gaze
estimation solutions. One is in improving the generalization of models to previously
unseen users, environments, eyeglasses, makeup, camera specifications, and other
confounding factors. This can be tackled by the non-trivial task of collecting datasets
with high-quality ground-truth annotations from a large number of people [36, 60]
and designing novel neural network architectures for better generalization [33, 55]—
both directions which we have extensively studied. The other challenge is due to
person-specific biases, which must be accounted for when higher performance is
required by the interactive application. This challenge exists not only because of the
kappa angle but also the variations in the appearance of the eye and face regions
in the real world. While we have explored several methods to this end in terms of
few-shot adaptation [35, 59], further research must be conducted to efficiently collect
data from the end-user without compromising user experience, such as via so-called
implicit calibration [57].
A problem in developing gaze-based interfaces is that the accuracy and precision
of the tracked gaze vary largely depending on many factors, such as the tracking
method, the environment, human features, and others. The application receiving the
gaze information must process a series of noisy data points. We have shown how,
to some extent, a signal can be stabilized by filtering data. However, this does not
account for its inaccuracy. For that, we have made recommendations for designing
gaze-aware applications in a robust way such that they are usable under most con-
ditions [7]. However, such a conservative approach might unnecessarily slow down
or complicate interaction in cases where the gaze is tracked well. An alternative
approach is to develop error-aware applications that recognize the uncertainty in the
signal and adapt to it [1, 7]. As tracking quality decreases, a gaze-aware UI element
could be enlarged, replaced by a more robust alternative, or deactivated entirely to
avoid errors that might be hard to recover from. For such an approach to be useful,
it is crucial to optimize for the time-point of UI adaptation. To this end, future work
is needed that investigates how to trade-off potential gains through adaptation with
the cost for the user to get used to a new interface. For taking into account personal
preferences, such adaptations could even be done after explicitly querying the user.
We have seen that data about where a user is looking can not only be used for
explicit interaction but also to make predictions about the user’s cognitive processes,
abilities, or intentions. Such attentive applications do not require the user to con-
sciously control their gaze which can be cumbersome. Instead, they process the
126 X. Zhang et al.
natural gaze behavior of the user with the goal to facilitate interaction. However,
approaches to interpreting the eye gaze are often tailored to specific application
cases and general solutions are rare. The voting scheme presented in this chapter
(Sect. 5.2) is a first attempt to develop a more general approach for estimating the
relevance of displayed information to a specific user and was shown to work across
different decision-making tasks [8]. More work is needed though to develop general
methods for inferring a user’s intent, difficulties, or preferences from their gaze data
and thus facilitate the design of intelligent user interfaces.
Once we can reliably derive information about the user’s attention and intention
from the estimated gaze, it is important to consider how to make effective use of this
data in practice. In a user study conducted in [8], the large majority of participants
confirmed that the tested application could correctly detect content relevant for their
decision-making. Many also preferred the adapted version of the interface. However,
the specific highlighting and suppression adaptations (see Fig. 11) did not lead to
measurable improvements in terms of task execution time, users’ perceived informa-
tion load, or their confidence in their decision. Future work needs to develop better
approaches to utilize such relevant information and develop UI adaptation schemes
that facilitate the decision-making process for the user [14, 26]. Such work should
also consider how adaptive interfaces can build trust to resolve users’ concern of
being manipulated by the interface [8, 32].
7 Conclusion
References
1. Barz M, Daiber F, Sonntag D, Bulling A (2018) Error-aware gaze-based interfaces for robust
mobile gaze interaction. In: Proceedings of the 2018 ACM symposium on eye tracking research
& applications, association for computing machinery, New York, NY, USA, ETRA’18. https://
doi.org/10.1145/3204493.3204536
2. Blignaut P (2009) Fixation identification: the optimum threshold for a dispersion algorithm.
Atten Percept Psychophys 71(4):881–895
3. Bulling A (2016) Pervasive attentive user interfaces. IEEE Comput 49(1):94–98
4. Chen Z, Shi B (2020) Offset calibration for appearance-based gaze estimation via gaze decom-
position. In: Proceedings of the IEEE/CVF winter conference on applications of computer
vision (WACV)
5. Cheng Y, Zhang X, Lu F, Sato Y (2020) Gaze estimation by exploring two-eye asymmetry.
IEEE Trans Image Process 29:5259–5272
6. Eckstein MK, Guerra-Carrillo B, Singley ATM, Bunge SA (2017) Beyond eye gaze: what
else can eye tracking reveal about cognition and cognitive development? Dev Cogn Neurosci
25:69–91
7. Feit AM, Williams S, Toledo A, Paradiso A, Kulkarni H, Kane S, Morris MR (2017) Toward
everyday gaze input: accuracy and precision of eye tracking and implications for design. In:
Proceedings of the 2017 CHI conference on human factors in computing systems, association
for computing machinery, New York, NY, USA, CHI’17, pp 1118–1130. https://doi.org/10.
1145/3025453.3025599
8. Feit AM, Vordemann L, Park S, Berube C, Hilliges O (2020) Detecting relevance during
decision-making from eye movements for ui adaptation. In: ACM symposium on eye track-
ing research and applications, association for computing machinery, New York, NY, USA,
ETRA’20 Full Papers. https://doi.org/10.1145/3379155.3391321
128 X. Zhang et al.
9. Findlater L, Gajos KZ (2009) Design space and evaluation challenges of adaptive graphical
user interfaces. AI Mag 30(4):68–73. https://doi.org/10.1609/aimag.v30i4.2268
10. Fischer T, Jin Chang H, Demiris Y (2018) Rt-gene: real-time eye gaze estimation in natural
environments. In: Proceedings of the European conference on computer vision (ECCV), pp
334–352
11. Fuhl W, Santini T, Kasneci G, Rosenstiel W, Kasneci E (2017) Pupilnet v2. 0: Convolutional
neural networks for CPU based real time robust pupil detection. arXiv:171100112
12. Funes Mora KA, Monay F, Odobez JM (2014) Eyediap: a database for the development and
evaluation of gaze estimation algorithms from RGB and RGB-D cameras. In: Proceedings of
the symposium on eye tracking research and applications, pp 255–258
13. Ganin Y, Kononenko D, Sungatullina D, Lempitsky V (2016) Deepwarp: Photorealistic image
resynthesis for gaze manipulation. In: European conference on computer vision. Springer, pp
311–326
14. Gebhardt C, Hecox B, van Opheusden B, Wigdor D, Hillis J, Hilliges O, Benko H (2019) Learn-
ing cooperative personalized policies from gaze data. In: Proceedings of the 32nd annual ACM
symposium on user interface software and technology, association for computing machinery,
New York, NY, USA, UIST’19, pp 197–208. https://doi.org/10.1145/3332165.3347933
15. Gidlöf K, Wallin A, Dewhurst R, Holmqvist K (2013) Using eye tracking to trace a cognitive
process: gaze behaviour during decision making in a natural environment. J Eye Mov Res 6(1).
https://doi.org/10.16910/jemr.6.1.3, https://bop.unibe.ch/index.php/JEMR/article/view/2351
16. Hansen DW, Ji Q (2009) In the eye of the beholder: a survey of models for eyes and gaze. IEEE
Trans Pattern Anal Mach Intell 32(3):478–500
17. He J, Pham K, Valliappan N, Xu P, Roberts C, Lagun D, Navalpakkam V (2019a) On-device few-
shot personalization for real-time gaze estimation. In: Proceedings of the IEEE international
conference on computer vision workshops, pp 0–0
18. He Z, Spurr A, Zhang X, Hilliges O (2019b) Photo-realistic monocular gaze redirection using
generative adversarial networks. In: Proceedings of the IEEE international conference on com-
puter vision, pp 6932–6941
19. Hirzle T, Gugenheimer J, Geiselhart F, Bulling A, Rukzio E (2019) A design space for gaze
interaction on head-mounted displays. In: Proceedings of the 2019 CHI conference on human
factors in computing systems, association for computing machinery, New York, NY, USA,
CHI’19, pp 1–12. https://doi.org/10.1145/3290605.3300855
20. Howard IP, Rogers BJ et al (1995) Binocular vision and stereopsis. Oxford University Press,
USA
21. Kellnhofer P, Recasens A, Stent S, Matusik W, Torralba A (2019) Gaze360: physically uncon-
strained gaze estimation in the wild. In: Proceedings of the IEEE international conference on
computer vision, pp 6912–6921
22. Khamis M, Oechsner C, Alt F, Bulling A (2018) Vrpursuits: interaction in virtual reality
using smooth pursuit eye movements. In: Proceedings of the 2018 international conference
on advanced visual interfaces, association for computing machinery, New York, NY, USA,
AVI’18. https://doi.org/10.1145/3206505.3206522
23. Kim J, Stengel M, Majercik A, De Mello S, Dunn D, Laine S, McGuire M, Luebke D (2019)
Nvgaze: an anatomically-informed dataset for low-latency, near-eye gaze estimation. In: Pro-
ceedings of the 2019 CHI conference on human factors in computing systems, pp 1–12
24. Krafka K, Khosla A, Kellnhofer P, Kannan H, Bhandarkar S, Matusik W, Torralba A (2016)
Eye tracking for everyone. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp 2176–2184
25. Lindén E, Sjostrand J, Proutiere A (2019) Learning to personalize in appearance-based gaze
tracking. In: Proceedings of the IEEE international conference on computer vision workshops,
pp 0–0
26. Lindlbauer D, Feit AM, Hilliges O (2019) Context-aware online adaptation of mixed reality
interfaces. In: Proceedings of the 32nd annual ACM symposium on user interface software and
technology, pp 147–160
Eye Gaze Estimation and Its Applications 129
27. Majaranta P (2011) Gaze interaction and applications of eye tracking: advances in assistive
technologies. IGI Global
28. Majaranta P, Bulling A (2014) Eye tracking and eye-based human–computer interaction.
Springer, London, pp 39–65. https://doi.org/10.1007/978-1-4471-6392-3_3
29. Moshirfar M, Hoggan RN, Muthappan V (2013) Angle kappa and its importance in refractive
surgery. Oman J Ophthalmol 6(3):151
30. Orquin JL, Loose SM (2013) Attention and choice: a review on eye movements in decision
making. ACTPSY 144:190–206. https://doi.org/10.1016/j.actpsy.2013.06.003
31. Papismedov D, Fink L (2019) Do consumers make less accurate decisions when they use
mobiles? In: International conference on information systems, Munich
32. Park S, Gebhardt C, Rädle R, Feit A, Vrzakova H, Dayama N, Yeo HS, Klokmose C, Quigley
A, Oulasvirta A, Hilliges O (2018a) AdaM: adapting multi-user interfaces for collaborative
environments in real-time. In: SIGCHI conference on human factors in computing systems.
ACM, New York, NY, USA, CHI’18
33. Park S, Spurr A, Hilliges O (2018b) Deep pictorial gaze estimation. In: Proceedings of the
European conference on computer vision (ECCV), pp 721–738
34. Park S, Zhang X, Bulling A, Hilliges O (2018c) Learning to find eye region landmarks for
remote gaze estimation in unconstrained settings. In: Proceedings of the 2018 ACM symposium
on eye tracking research & applications, pp 1–10
35. Park S, Mello SD, Molchanov P, Iqbal U, Hilliges O, Kautz J (2019) Few-shot adaptive gaze
estimation. In: Proceedings of the IEEE international conference on computer vision, pp 9368–
9377
36. Park S, Aksan E, Zhang X, Hilliges O (2020) Towards end-to-end video-based eye-tracking.
In: European conference on computer vision. Springer, pp 747–763
37. Qvarfordt P, Zhai S (2005) Conversing with the user based on eye-gaze patterns. In: Proceedings
of the SIGCHI conference on human factors in computing systems, association for comput-
ing machinery, New York, NY, USA, CHI’05, pp 221–230. https://doi.org/10.1145/1054972.
1055004
38. Russo JE, Leclerc F (1994) An eye-fixation analysis of choice processes for consumer non-
durables. J Cons Res 21(2):274–290. https://doi.org/10.1086/209397, https://academic.oup.
com/jcr/article-pdf/21/2/274/5093700/21-2-274.pdf
39. Salvucci DD (2001) An integrated model of eye movements and visual encoding. J Cogn Syst
Res 1:201–220. www.elsevier.com/locate/cogsys
40. Salvucci DD, Goldberg JH (2000) Identifying fixations and saccades in eye-tracking protocols.
In: Proceedings of the 2000 symposium on Eye tracking research & applications, pp 71–78
41. Sesma L, Villanueva A, Cabeza R (2012) Evaluation of pupil center-eye corner vector for gaze
estimation using a web cam. In: Proceedings of the symposium on eye tracking research and
applications, pp 217–220
42. Sibert LE, Jacob RJ (2000) Evaluation of eye gaze interaction. In: Proceedings of the SIGCHI
conference on Human Factors in Computing Systems, pp 281–288
43. Sugano Y, Matsushita Y, Sato Y (2014) Learning-by-synthesis for appearance-based 3D gaze
estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 1821–1828
44. Tan KH, Kriegman DJ, Ahuja N (2002) Appearance-based eye gaze estimation. In: Proceedings
of the sixth IEEE workshop on applications of computer vision, 2002. (WACV 2002). IEEE,
pp 191–195
45. Špakov O (2012) Comparison of eye movement filters used in HCI. In: Proceedings of the
symposium on eye tracking research and applications, association for computing machinery,
New York, NY, USA, ETRA ’12, pp 281–284. https://doi.org/10.1145/2168556.2168616
46. Wang K, Ji Q (2017) Real time eye gaze tracking with 3d deformable eye-face model. In:
Proceedings of the IEEE international conference on computer vision (ICCV)
47. Wang K, Zhao R, Ji Q (2018) A hierarchical generative model for eye image synthesis and
eye gaze estimation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 440–448
130 X. Zhang et al.
48. Wood E, Bulling A (2014) Eyetab: Model-based gaze estimation on unmodified tablet comput-
ers. In: Proceedings of the symposium on eye tracking research and applications, pp 207–210
49. Wood E, Baltrusaitis T, Zhang X, Sugano Y, Robinson P, Bulling A (2015) Rendering of
eyes for eye-shape registration and gaze estimation. In: Proceedings of the IEEE international
conference on computer vision (ICCV)
50. Yu Y, Liu G, Odobez JM (2018) Deep multitask gaze estimation with a constrained landmark-
gaze model. In: Proceedings of the European conference on computer vision (ECCV), pp 0–0
51. Yu Y, Liu G, Odobez JM (2019) Improving few-shot user-specific gaze adaptation via gaze
redirection synthesis. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 11937–11946
52. Zhai S, Morimoto C, Ihde S (1999) Manual and gaze input cascaded (magic) pointing. In:
Proceedings of the SIGCHI conference on human factors in computing systems, association
for computing machinery, New York, NY, USA, CHI’99, pp 246–253. https://doi.org/10.1145/
302979.303053
53. Zhang X, Sugano Y, Fritz M, Bulling A (2015) Appearance-based gaze estimation in the wild.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4511–
4520
54. Zhang X, Sugano Y, Bulling A (2017a) Everyday eye contact detection using unsupervised
gaze target discovery. In: Proceedings of the 30th annual ACM symposium on user interface
software and technology, pp 193–203
55. Zhang X, Sugano Y, Fritz M, Bulling A (2017b) It’s written all over your face: full-face
appearance-based gaze estimation. In: Proceedings of the IEEE conference on computer vision
and pattern recognition workshops, pp 51–60
56. Zhang X, Sugano Y, Fritz M, Bulling A (2017c) Mpiigaze: real-world dataset and deep
appearance-based gaze estimation. IEEE Trans Pattern Anal Mach Intell 41(1):162–175
57. Zhang X, Huang MX, Sugano Y, Bulling A (2018a) Training person-specific gaze estimators
from user interactions with multiple devices. In: Proceedings of the 2018 CHI conference on
human factors in computing systems, pp 1–12
58. Zhang X, Sugano Y, Bulling A (2018b) Revisiting data normalization for appearance-based
gaze estimation. In: Proceedings of the 2018 ACM symposium on eye tracking research &
applications, pp 1–9
59. Zhang X, Sugano Y, Bulling A (2019) Evaluation of appearance-based methods and impli-
cations for gaze-based applications. In: Proceedings of the 2019 CHI conference on human
factors in computing systems, pp 1–13
60. Zhang X, Park S, Beeler T, Bradley D, Tang S, Hilliges O (2020a) Eth-xgaze: a large scale
dataset for gaze estimation under extreme head pose and gaze variation. In: European confer-
ence on computer vision. Springer, pp 365–381
61. Zhang X, Sugano Y, Bulling A, Hilliges O (2020b) Learning-based region selection for end-
to-end gaze estimation. In: British machine vision virtual conference (BMVC)
62. Zheng Y, Park S, Zhang X, De Mello S, Hilliges O (2020) Self-learning transformations for
improving gaze and head redirection. Adv Neural Inf Process Syst 33
63. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-
consistent adversarial networks. In: 2017 IEEE international conference on computer vision
(ICCV)
AI-Driven Intelligent Text Correction
Techniques for Mobile Text Entry
Mingrui Ray Zhang, He Wen, Wenzhe Cui, Suwen Zhu, H. Andrew Schwartz,
Xiaojun Bi, and Jacob O. Wobbrock
1 Introduction
Text entry techniques on touch-based mobile devices today are generally well devel-
oped. Ranging from tap-based keyboard typing to swipe-based gesture typing [64],
today’s mobile text entry methods employ a range of sophisticated algorithms
designed to maximize speed and accuracy. Although the results reported from var-
ious papers [46, 55] show that mobile text entry can reach reasonably high speeds,
some even as fast as desktop keyboards [55], the daily experience of mobile text
composition is still often lacking. One bottleneck lies in the text correction process.
On mobile touch-based devices, text correction often involves repetitive backspac-
ing and moving the text cursor with repeated taps and drags over very small targets
Portions of this chapter are reproduced with permission of the ACM from the following previously
published papers [10, 66].
Fig. 1 The three correction interactions of Type, Then Correct: a Drag-n-Drop lets the user drag
the last word typed and drop it on an erroneous word or gap between words; b Drag-n-Throw lets
the user drag a word from the suggestion list and flick it into the general area of the erroneous
word; c Magic Key highlights each possible error word after the user types a correction. Directional
dragging from atop the magic key navigates among error words, and tapping the magic key applies
the correction
(i.e., the characters and spaces between them). Owing to the fat finger problem [56],
this process can be slow and tedious indeed. In this chapter, we will introduce two
projects that apply techniques in Natural Language Processing (NLP) to improve the
text correction interaction for touch screen text entry.
Correcting text is a consistent and vital activity during text entry. A study by
MacKenzie and Soukoreff showed that backspace was the second most common
keystroke during text entry (pp. 164–165) [36]. Dhakal et al. [12] found that during
typing, people made 2.29 error corrections per sentence, and that slow typists actually
made and corrected more mistakes than the fast typists.
For immediate error corrections, i.e., when an error is noticed right after it is made,
the user can press backspace to delete the error [53]. However, for overlooked error
corrections, the current cursor movement-based text correction process on smart-
phones is laborious: one must navigate the cursor to the error position, delete the
error text, re-enter the correct text, and finally navigate the cursor back. There are
three ways to position the cursor: (1) by repeatedly pressing the backspace key [53];
(2) by pressing arrow keys on some keyboards or making gestures such as swipe-left;
and (3) by using direct touch to move the cursor. The first two solutions are more
precise than the last one, which suffers from the fat finger problem [56], but they
require repetitive actions. The third option is error-prone when positioning the cursor
amid small characters, which increases the possibility of cascading errors [3]; it also
increases the cognitive load of the task and takes on average 4.5 s to perform the
tedious position-edit-reposition sequence [18].
The two projects in this chapter are based on the same premise: What if we can
skip positioning the cursor and deleting errors? Given that the de facto method of
correcting errors relies heavily on these actions, such a question is subtly quite radical.
What if we just type the correction text, and apply it to the error? The first project,
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 133
“Type, Then Correct” (TTC) contains three interactions (Fig. 1): (1) Drag-n-Drop is a
simple baseline technique that allows users to drag the last-typed word as a correction
and drop it on the erroneous text to correct substitution and omission errors [59].
(2) Drag-n-Throw is the “intelligent” version of Drag-n-Drop: it allows the user to
flick a word from the keyboard’s suggestion list toward the approximate area of the
erroneous text. The deep learning algorithm finds the most likely error within the
general target area and automatically corrects it. (3) Magic Key does not require direct
interaction with the text input area at all. After typing a correction, the user simply
presses a dedicated key on the keyboard, and the deep learning algorithm highlights
possible errors according to the typed correction. The user could then dragging atop
the key to navigate through the error candidates and tap the key again to apply the
correction. All three of our interaction techniques require no movement of the text
cursor and no use of backspace.
The second project, JustCorrect, is the evolution of the TTC project. It simpli-
fies the concept even further by reducing the need to specify the error position. To
substitute an incorrect word or insert a missing word in the sentence, the user sim-
ply types the correction at the end, and JustCorrect will automatically commit the
correction without the user’s intervention. Additional options are also provided for
better correction coverage. In this way, JustCorrect makes post hoc text correction
on the recently entered sentence as straightforward as text entry.
We evaluated the two text correction projects with multiple text entry experiments
and compared their performances. The results revealed that both TTC and JustCorrect
resulted in faster correction times, and were preferred over the de facto technique.
2 Related Work
In the following subsections, we first review research related to text entry correction
behaviors on touch screens. We then present current text correction techniques for
mobile text entry and multi-modal text input techniques. Finally, we provide a short
introduction to natural language processing (NLP) algorithms for text correction.
Many researchers have found that typing errors are common using touch-based key-
boards and that current correction techniques are left wanting in many ways. For
example, sending error-ridden messages, such as typos and errors arising from auto-
correction [27], is of greatest concern when it comes to older adults. Moreover,
Komninos et al. [28] observed and recorded in-the-wild text entry behaviors on
Android phones, and found that users made around two word-level errors per typing
session, which slowed text entry considerably. Also, participants “predominantly
employed backspacing as an error correction strategy.” Based on their observations,
134 M. R. Zhang et al.
Komninos et al. recommended that future research needed to “develop better ways
for managing correction,” which is the very focus of this chapter.
In most character-level text entry schemes, there are three types of text entry errors
[35, 59]: substitutions, where the user enters different characters than intended; omis-
sions, where the user fails to enter characters; and insertions, where the user injects
erroneous characters. Substitutions were found to be the most frequent error among
these types. In a smart watch-based text entry study [30], out of 888 phrases, partic-
ipants made 179 substitution errors, 31 omission errors, and 15 insertion errors. In a
big data study of keyboarding [12], substitution errors (1.65%) were observed more
frequently than omission (0.80%) and insertion (0.67%) errors. Our correction tech-
niques address substitution and omission errors; we do not address insertion errors
because users can just delete insertions without typing any corrections. Moreover,
insertion errors are relatively rare.
While much previous work focused on user behaviors during mobile text entry, there
have been a few projects that improved upon the text correction process. Previous
work often adopted a cursor-based editing approach. For example, previous research
proposed controlling the cursor by using magnifying lens [2], pressing hard on the
keyboard to turn it into a touchpad [2], or adding arrow keys [58]. Gestural operations
have also been proposed to facilitate positioning the cursor. Examples included using
left and right gestures [18], sliding left or right from the space key [22] to move the
cursor, or using a “scroll ring” gesture along with swipes in four directions [65].
The smart-restorable backspace [4] project had the most similar goal to that of
this chapter: to improve text correction without extensive backspacing and cursor
positioning. The technique allowed users to perform a swipe gesture on the backspace
key to delete the text back to the position of an error, and restore that text by swiping
again on the backspace key after correcting the error. To determine error positions,
the technique compares the edit distance of the text and the word in a dictionary.
The error detection algorithm is the main limitation of that work: it only detects
misspellings. It cannot detect grammar errors or word misuse. By contrast, the two
projects in this chapter could detect a wide range of errors based on deep learning
techniques.
Commercial products exhibit a variety of text correction techniques. Gboard [34]
allows a user to touch on a word and replace it by tapping on another word in a sug-
gestion list. However, this technique is only limited to misspellings. Some keyboards,
such as the Apple iOS 9 keyboard, support indirect cursor control by treating the
keyboard as a trackpad. Unfortunately, prior research [48] showed that this design
brought no time or accuracy benefits compared to direct pointing. The Grammarly
keyboard [23] will keep track of the input text, and provide corrections in the sug-
gestion list. Grammarly uses NLP algorithms to provide correction suggestions, and
it is able to detect both spelling and grammar errors. The user simply taps the sug-
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 135
Many soft keyboards (e.g., Gboard [34]) support entering text via different modalities,
such as tap typing, gesture typing, and voice input. Previous research has explored
fusing information from multiple modalities to reduce text entry ambiguity, such
as combining speech and gesture typing [41, 51], using finger touch to specify the
word boundaries in speech recognition [50], or using unistrokes together with key
landings [25] to improve input efficiency.
JustCorrect also investigated how different input modalities affected the perfor-
mance. It was particularly inspired by ReType [52], which used eye-gaze input to
estimate the text editing location. We advanced it by inferring the editing intention
based on the entered word only, making the technique suitable for mobile devices,
which typically are not equipped with eye-tracking capabilities.
The projects in this chapter use deep learning algorithms from natural language
processing (NLP) to find possible errors based on typed corrections. We therefore
provide a brief introduction to related techniques.
Traditional error correction algorithms utilize N-grams and edit distances to pro-
vide correction suggestions. For example, Islam and Inkpen [24] presented an algo-
rithm that uses the Google 1T 3-gram dataset and a string-matching algorithm to
detect and correct spelling errors. For each word in the original string, they first
search for candidates in the dictionary, and assign each possible candidate a score
derived from their frequency in the N-gram dataset and the string-matching algo-
rithm. The candidate with the highest score above a threshold is suggested as a
correction.
Recently, deep learning has gained popularity in NLP research because of its
generalizability and significantly better performance than traditional algorithms. For
NLP tasks, convolutional neural networks (CNN) and recurrent neural networks
136 M. R. Zhang et al.
(RNN) are extensively used. They often follow a structure called encoder–decoder,
where part of the model encodes the input text into a feature vector, then decodes the
vector into the result. In TTC, we utilize an RNN in this encoder–decoder pattern.
A thorough explanation of these methods is beyond the current scope. Interested
readers are directed to prior work [6, 54, 61].
Most researchers treat the error correction task as a language translation task in
deep learning because their input and output are both sentences—for error correction,
the input is a sentence with errors and the output is an error-free sentence; for language
translation, the input is a sentence in one language and the output is a sentence in
another language. For example, Xie et al. [60] presented an encoder–decoder RNN
correction model that operates input and output at the character level. Their model
was built upon a sequence-to-sequence model for translation [6], which was also
used in the algorithm of the TTC project for error detection.
We present the design and implementation of the three interaction techniques of Type,
Then Correct (TTC). The common features of these interactions are: (1) the first step
is always to type the correction text at the current cursor position, usually the end
of the current input stream; (2) all correction interactions can be undone by tapping
the undo key on the keyboard (Fig. 2, top right); (3) after a correction is applied, the
text cursor remains at the last character of the text input stream, allowing the user
to continue typing without having to move the cursor. A current, but not theoretical,
limitation is that we only allow the correction text to be contiguous alphanumeric
text without special characters or spaces.
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 137
Fig. 3 The three interaction techniques. Drag-n-Drop: a.1 Type a word and then touch it to ini-
tiate correction; a.2 Drag the correction to the error position. Touched words are highlighted and
magnified, and the correction shows above the magnifier; a.3 Drop the correction on the error to
finish. Drag-n-Throw: b.1 Dwell on a word from the suggestion list to initiate correction. The word
will display above the finger; b.2 Flick the finger toward the area of the error: here, the flick ended
on “the,” not the error text “technical”; b.3 The algorithm determines the error successfully, and
confirming animation appears. Magic Key: c.1 Tap the magic key (the circular button) to trigger
correction. Here, “error” is shown as the nearest potential error. c.2 Drag left from atop the magic
key to highlight the next possible error in that direction. Now, “magical” is highlighted. c.3 Tap the
magic key again to commit the correction “magic”
3.1 Drag-n-Drop
Fig. 4 Perceived input point: a the user views the top of the fingernail as the input point [21]; b
but today’s hardware regards the center of the contact area as the touch input point, which is not the
same. Figure adapted from [56]
Similar to Shift by Vogel and Baudisch [56], we adjusted the input point to 30
pixels above the actual contact point, to reflect the user’s perceived input point [21].
Vogel and Baudisch suggested that “users perceived the selection point of the finger
as being located near the top of the finger tip” [7, 56], while the actual touch point
was roughly at the center of the finger contact area [49], as shown in Fig. 4. After
the correction is dropped on a space (for insertion) or on a word (for substitution),
there is an animated color change from orange to black, confirming the successful
application of the correction text.
3.2 Drag-n-Throw
Similar to Drag-n-Drop, Drag-n-Throw also requires the user to drag the correction.
But unlike Drag-n-Drop, with Drag-n-Throw, the user flicks the correction from the
word suggestion list atop the keyboard, not from the text area, allowing the user’s
fingers to stay near the keyboard area. As before, the correction text shows above the
touch point as a reminder (Fig. 3b.1). Instead of dropping the correction on the error
position, the user throws (i.e., flicks) the correction to the general area of the text to
be corrected. Once the correction is thrown, our deep learning algorithm determines
the error position, and corrects the error either by substituting the correction for a
word, or by inserting the correction. Color animation is displayed to confirm the
correction. The procedure is shown in Fig. 3b.1–3.
We enable the user to drag the correction from the suggestion list because it is
quicker and more accurate than directly interacting with the text, which has smaller
targets. Moreover, our approach provides more options and saves time because of
the word-completion function. For example, if the user wants to type “dictionary,”
she can just type “dic” and “dictionary” appears in the list. Or, if the user misspells
“dictonary,” omitting an “i,” the correct word still appears in the list because of the
keyboard’s decoding algorithm.
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 139
Drag-n-Drop required interaction within the text input area; Drag-n-Throw kept the
fingers closer to the keyboard but still required some interaction in the text input
area. With Magic Key, the progression “inward” toward the keyboard is fulfilled, as
the fingers do not interact with the text input area at all, never leaving the keyboard.
Thus, round trips [15] between the keyboard and text input area are eliminated.
With Magic Key, after typing the correction, the user taps the magic key on
the keyboard (Fig. 3c.1), and the possible error text is highlighted. If a space is
highlighted, an insertion is suggested; if a word is highlighted, a substitution is
suggested. The nearest possible error to the just-typed correction will be highlighted
first; if it is not the desired correction, the user can drag from atop the magic key
to left to show the next possible error. The user can drag left or right from atop the
magic key to rapidly navigate among different error candidates. Finally, the user can
tap the magic key to commit the correction. The procedure is shown in Fig. 3c.1–3.
To cancel the operation, the user can simply tap any key (other than undo or the
magic key itself).
In this section, we present the deep learning algorithm for text correction and its natu-
ral language processing (NLP) model, the data collection and processing procedures,
and the training process and validation results.
Inspired by Xie et al. [60], we applied a recurrent neural network (RNN) encoder–
decoder model similar to the translation task for text corrections. The encoder
contains a character-level convolutional neural network (CNN) [26] and two bi-
directional gated recurrent unit (GRU) layers [9]. The decoder contains a word-
embedding layer and two GRU layers. The overall flow of the model is shown in
Fig. 5, and the encoder–decoder structure is shown in Fig. 6.1
Traditional recurrent neural networks (RNN) cannot output positional informa-
tion. Our key insight is that instead of outputting the whole error-free sentence, we
make the decoder only output five words around the proposed correction position,
e.g., the correction word and its four neighboring words (two before, two after). If
there are not enough words, the decoder will output the flags <bos> or <eos> instead
for beginning-of-sentence and end-of-sentence, respectively. To locate the correction
position, we compare the output with the input sentence word-by-word, and choose
the position that aligns with most words. For the example in Fig. 5, we first tokenize
the input and add two <bos> and two <eos> to the start and end of the tokens. Then
we compare the output with the input:
Input: <bos> <bos> thanks the reply <eos> <eos>
CS: <bos> thanks for the reply
CI: <bos> thanks the reply
Above, “CS” means compare for substitution, which finds the best alignment for
substitution (it uses all five words of the output trying to align with the input); “CI”
means compare for insertion, which finds the best alignment for insertion (it only
uses the first and last two words of the output for alignment, as the center word is the
insertion correction). In the example, CI has best alignment of four tokens (<bos>,
Fig. 5 The encoder–decoder model for text correction. The model outputs five words in which the
middle word is the correction. In this way, we get the correction’s location
thanks, the, reply), thus “for” will be inserted between “thanks” and “the.” If the
number of aligned tokens is the same in both comparisons, we would use insertion
in our implementation.
We now explain the details of the encoder and the decoder (Fig. 6). For the encoder,
because there might be typos and rare words in the input, operating on the character
level is more robust and generalizable than operating on the word level. We first
apply the character-level CNN [26] composed of Character Embedding, Multiple
Conv. Layers and Max-over-time Pool layers (Fig. 6, left). Our character-level CNN
generates an embedding for each word at the character level. The character embed-
ding layer converts the characters of a word into a vector of L × E c dimensions.
We set E c to 15, and fixed L to 18 in our implementation, which means the longest
word can contain 18 characters (longer words are discarded). Words with fewer than
18 characters are appended with zeroes in the input vector. We then apply multiple
convolution layers on the vector. After convolution, we apply max-pooling to obtain
a fixed-dimensional (E w ) representation of the word. In our implementation, we used
convolution filters with width [1, 2, 3, 4, 5] of size [15, 30, 50, 50, 55], yielding a
fixed vector with the size of 200. E c was set to 200 in the decoder.
We also needed to provide the correction information for the encoder. We achieved
this by feeding the correction into the same character-level CNN, and concatenated
the correction embedding with the embedding of the current word. This yielded a
vector of size 2E w , which was then fed into two bi-directional GRU layers. The
hidden size H of GRU was set to 300 in encoder and decoder.
142 M. R. Zhang et al.
Fig. 6 Illustration of the encoder and decoder, which is every vertical blue box in Fig. 5. L is the
length of characters in a word; E c is the character embedding size; H is the hidden size; E w is the
word embedding size; Nw is the word dictionary size
The decoder first embedded the word in a vector of size E w , which was set to
200. Then it was concatenated with the attention output. We used the same attention
mechanism as Bahdanau et al. [6]. Two GRU layers and a log-softmax layer then
followed to output the predicted word.
We used the CoNLL 2014 Shared Task [40] and its extension dataset [6] as a part
of the training data. The data contained sentences from essays written by English
learners with correction and error annotations. We extracted the errors that were
either insertion or substitution errors. In all, we gathered over 33,000 sentences for
training.
To gather even more training data, we perturbed several large datasets containing
normal text. We used the Yelp reviews (containing two million samples) and part
of the Amazon reviews dataset (containing 130,000 samples) generated by Zhang
et al. [67]. We treated these review data as if they were error-free texts, and applied
artificial perturbation to them. Specifically, we applied four perturbation methods:
1. Typo simulation. In order to simulate a real typo, we applied the simulation
method similar to Fowler et al. [16]. The simulation treated the touch point dis-
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 143
ping of [−10, +10], and a teacher forcing ratio of 0.5. We also used dropout with
probability 0.2 in all GRU layers. For the word embedding layer in the decoder, we
labeled words with frequencies less than 2 in the training set as <unk> (unknown).
4.5 Results
Table 1 shows the evaluation results on the two testing datasets. The recall is 1 because
all our testing data contained errors. We regarded a prediction as correct if the error
position predicted was correct using the comparison algorithm described above.
Table 1 The performance of our correction model on the two testing datasets
Dataset Accuracy (%)
CoNLL 2013 75.68
Wikipedia revisions 81.88
2 https://android.googlesource.com/platform/packages/inputmethods/LatinIME/.
3 https://github.com/farmerbb/Notepad.
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 145
the correction happening too far away from the throwing endpoint, we constrained
the x-coordinate of the correction to be within 250 pixels of the finger-lift endpoint.
For the Magic Key technique, the keyboard notified the notebook when the magic
key was pressed or dragged. The notebook would treat the last word typed as the
correction, and sent the last 1000 characters to the server. The server then split the
text into groups of 60 characters with overlaps of 30 characters, and predicted a
correction for each group. When the notebook received the prediction results, it first
highlighted the nearest option, and then switched to further error options when the
key was dragged left. For substitution corrections, it would highlight the whole word
to be substituted; for insertion corrections, it would highlight the space where the
correction was to be inserted.
The server running the correction model handled responses via HTTP requests.
To increase the accuracy of the model for typos, we first calculated the matching
score between each token of the input text and the correction using the Levenshtein
algorithm [31]. The score equaled the number of matches divided by the total char-
acter number of the two words. If the score of a word in the sentence was above
0.75, we treated the word as the error to be corrected. Otherwise, we fed the text and
correction into the aforementioned neural network model.
We evaluated three aspects of our correction techniques: (1) timing and efficiency; (2)
success rate of Drag-n-Throw and Magic Key; and (3) users’ subjective preferences.
We conducted an experiment containing two tasks: a correction task and a compo-
sition task. The correction task purely evaluated the efficiency of the interactions,
and the composition task evaluated the usability and success rate of the intelligent
techniques in more realistic scenarios.
5.1 Participants
5.2 Apparatus
A Google Pixel 2 XL was used for the study. The phone had a 6.0” screen with a
1440–2880 resolution. We added logging functions from the notebook application
146 M. R. Zhang et al.
to record correction time. The server running the correction model had a single GTX
1080.
Both tasks utilized a within-subjects study design. For the correction task, we chose
30 phrases from the test dataset on which the correction model had been 100% correct
because we wanted purely to evaluate the performance of the interaction technique,
not of the predictive model. We split the phrases evenly into three categories: typos,
word changes, and insertions. Typos required replacement of a few characters in a
word; word changes required replacing a whole word in a phrase; and insertions
required inserting a correction. For each category, we had five near-error phrases
where the error positions were within the last three words; and five far-error phrases
where the error positions were farther away. The reason was to see whether error
positions would affect correction efficiency. Examples of phrases in each category
are provided in the appendix.
5.4 Procedure
a c
Fig. 7 a The notebook application showing the test phrase. b The intended correction displayed
on the computer screen. c After each correction, a dialog box appeared
When participants were correcting errors with Drag-n-Throw and Magic Key, the
experimenter recorded whether any failure happened in order to calculate the error
rate.
When the two tasks ended, participants filled out a NASA-TLX survey [47] and
a usability survey adapted from the SUS questionnaire [8] for each interaction.
For the correction task, 2400 phrases were collected in total. For the correction task,
we focus on task completion times; for the composition task, we focus on the success
rate of the two intelligent interaction techniques and users’ preferences.
Figure 8 shows correction times for the four techniques. In addition to overall times,
the correction times for near-error and far-error phrases are also shown. We log-
transformed correction times to comply with the assumption of conditional nor-
mality, as is often done with time measures [32]. We used linear mixed model
analyses of variance [17, 33], finding that there was no order effect on correc-
tion time (F(3, 57) = 1.48, n.s.), confirming that our counter-balancing worked.
Furthermore, technique had a significant effect on correction time for all phrases
(F(3, 57) = 26.49, p < 0.01), near-error phrases (F(3, 57) = 29.02, p < 0.01),
and far-error phrases (F(3, 57) = 17.04, p < 0.01), permitting us to investigate
post hoc pairwise comparisons.
We performed six post hoc paired-sample t-tests with Holm’s sequential Bonfer-
roni procedure [20] to correct for Type I error rates, finding that for all phrases,
148 M. R. Zhang et al.
Fig. 8 Average correction times in seconds for different interaction techniques (lower is better).
Drag-n-Throw was the fastest for all phrases and far-error phrases, while Magic Key was the fastest
for near-error phrases. Error bars are +1 SD
Fig. 9 Average correction times in seconds for different correction types (lower is better). Drag-
n-Throw was the fastest for all three types. Error bars are +1 SD
0.01); also, the de facto cursor-based method was significantly slower than Drag-n-
Throw (t (19) = 5.45, p < 0.01) and Magic Key (t (19) = 5.20, p < 0.01).
In the text composition task, we recorded errors when participants were using Drag-
n-Throw and Magic Key. With Drag-n-Throw, participants made 108 errors in all,
and 95 of them were successfully corrected, a success rate of 87.9%. Among the
successfully corrected errors, nine were attempted more than once because the cor-
rections were not applied to expected error positions. With Magic Key, participants
made 101 errors in all, and 98 of them were successfully corrected, a success rate of
97.0%.
The composite scores of the SUS usability [8] and TLX [47] surveys for different
interaction techniques are shown in Fig. 10. Participants generally enjoyed using
Magic Key and Drag-n-Throw more than the de facto cursor-based method and
Drag-n-Drop. Also, the two deep learning techniques were perceived to have lower
workload than the other two.
150 M. R. Zhang et al.
Fig. 10 Composite usability (higher is better) and NASA-TLX (lower is better) scores for different
techniques. Magic Key was rated as the most usable and having the lowest workload
The key to JustCorrect lies in successfully inferring a user’s editing intention based
on the entered word and the prior context. The post hoc correction algorithm takes
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 151
Fig. 11 This figure shows how JustCorrect works. 1. The user enters a sentence with an error jimo
using tap typing; 2. To correct jimo to jumps, they can either tap-type jumps and press the editing
button (2a), or switch to gesture type jumps (2b). 3. JustCorrect then substitutes jimo with jumps.
Two alternative correction options are also presented. The editing procedure involves no manual
operations except entering the correct text
152 M. R. Zhang et al.
Table 2 An example of eight substitution candidates. They are generated by replacing a word in
the sentence “a quick fox jimo over a lazy dog” with “jumps.” Si means that ith word in the sentence
wi is replaced by w∗ . SubScor ei is substitution score for ranking substitution candidates. SSi , E Si ,
and W Si are scores from Sentence channels, Edit Distance, and Word Embedding, respectively
Substitution SubScor ei SSi E Si W Si
candidates
S1 : jumps quick 0.56 0 0 0.56
fox jimo over a
lazy dog
S2 : a jumps fox 0.89 0.2 0.2 0.48
jimo over a lazy
dog
S3 : a quick 0.42 0.42 0 0
jumps jimo over
a lazy dog
S4 : a quick fox 1.71 1 0.6 0.11
jumps over a lazy
dog
S5 : a quick fox 0.75 0.18 0 0.57
jimo jumps a
lazy dog
S6 : a quick fox 0.56 0 0 0.56
jimo over jumps
lazy dog
S7 : a quick fox 1.11 0.11 0 1
jimo over a
jumps dog
S8 : a quick fox 0.48 0.18 0 0.31
jimo over a lazy
jumps
the current entered sentence S and an editing word w∗ as input, and revises S by
either substituting a word wi in S with w∗ , or inserting w∗ at an appropriate position.
The post hoc correction algorithm offers three post hoc correction suggestions, with
the top suggestion automatically adopted by default and the others easily selected
with only one additional tap.
Take the sentence S = a quick fox jimo over a lazy dog. The user inputs jumps as
the editing word w∗ . Because the sentence has eight words, there are eight substi-
tution and nine insertion possibilities: _a_quick_fox_jimo_over_a_lazy_dog_. The
nine possible insertion positions are indicated by the underscores. The post hoc cor-
rection algorithm then generates eight substitution candidates (S1 − S8 ), as shown
in Table 2, and nine insertion candidates (I1 − I9 ) as shown in Table 3.
The algorithm then ranks the substitution candidates according to the substitution
scores, and ranks the insertion candidates according to the insertion scores. These
scores are later compared to generate ultimate correction suggestions.
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 153
Table 3 An example of nine insertion candidates. They are generated by inserting “jumps” before
or after every word in the sentence “a quick fox jimo over a lazy dog.” Ii means w∗ is inserted at
the ith insertion location. I nser Scor ei is insertion score for ranking insertion candidates
Insertion candidates I nser Scor ei
I1 : jumps a quick fox jimo over a lazy dog 0.06
I2 : a jumps quick fox jimo over a lazy dog 0.04
I3 : a quick jumps fox jimo over a lazy dog 0.52
I4 : a quick fox jumps jimo over a lazy dog 1
I5 : a quick fox jimo jumps over a lazy dog 0.91
I6 : a quick fox jimo over jumps a lazy dog 0.24
I7 : a quick fox jimo over a jumps lazy dog 0
I8 : a quick fox jimo over a lazy jumps dog 0
I9 : a quick fox jimo over a lazy dog jumps 0.5
The substitution score reflects how likely a substitution candidate represents the user’s
actual editing intention. We look for robust evidence of the substituted word along
three dimensions: orthographic (i.e., character) distance, syntactosemantic (i.e.,
meaning) distance, and sequential coherence (i.e., making sense in context). More
specifically, for the ith substitution candidate Si , its substitution score SubScor ei is
defined as
SubScor ei = E Si + W Si + SSi , (1)
The edit distance channel calculates the editing similarity for each substitution candi-
date. The Levenshtein edit distance [31] between two strings is the minimum number
of single-character edits including deletions, insertions, or substitutions needed to
transform one string into another string. The editing similarity E Si is defined as
154 M. R. Zhang et al.
L(w∗ , wi )
E Si = , (2)
max(|w∗ |, |wi |)
where w∗ is the correction and wi is the ith word in the previous text. max(|w∗ |, |wi |)
denotes the max length of w∗ and wi .
The word embedding channel estimates the semantic and syntactic similarity W Si
between the editing word w∗ and the substituted word wi in Si . In this channel,
words from the vocabulary are mapped to vectors derived from statistics on the co-
occurrence of words within documents [13]. The distance between two vectors can
then be used as a measure of syntactic and semantic difference [1].
We trained our word embedding model over the “Text8” dataset [37] using the
Word2Vec skip-gram approach [38]. The cosine similarity W SC (w∗ , wi ) is then cal-
culated as the W Si [1]. W Si was normalized in the range [0, 1].
The sentence channel estimates the normalized sentence score of Si using a language
model—a model that estimates the probability of a certain sequence of words.
To compute the language model probability for a given sentence, we trained a 3-
gram language model using the KenLM Language Model Toolkit [19]. The language
model takes each substitution candidate sentence Si as the input, and outputs its
estimated log probability P(Si ). By normalizing P(Si ) in the range of 0 to 1, we get
the normalized sentence score SSi :
P(Si ) − min(P(S j ))
SSi = , ( j = 1, 2, . . . , N ), (3)
max(P(S j )) − min(P(S j ))
where min(P(S j )) and max(P(S j )) are the minimum and maximum sentence chan-
nel scores among all the N substitution possibilities, assuming the sentence S has
N words. The language model itself was trained over the Corpus of Contemporary
American English (COCA) [11] (2012 to 2017), which contains over 500 million
words.
For insertion candidates, we only use the sentence channel for insertion scores, as
there are no word-to-word comparisons for insertion candidates. Assuming S has N
words and therefore N + 1 candidates for insertion, the insertion score I nser Scor ei
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 155
P(Ii ) − min(P(I j ))
I nser Scor ei = , ( j = 1, 2, . . . , N + 1), (4)
max(P(I j )) − min(P(I j ))
where min(P(I j )) and max(P(I j )) are the minimum and maximum sentence chan-
nel scores among all the N + 1 insertion possibilities (I1 , I2 , . . . , I N +1 ). I nser Scor ei
is also normalized in [0,1].
The post hoc correction algorithm combines the substitution and insertion candidates
to generate correction suggestions by calculating each candidate’s scores. We com-
pare substitution and insertion candidates by their sentence channel scores because it
is the common component in both score calculations. The candidates with the highest
top-3 scores would be shown on the interface. If all three candidates are of the same
error kind (substitution/insertion), we change the last candidate with the top one of
the other kind to ensure the diversity of the suggestions. The top suggestion will be
automatically committed to the text (see Sect. 7, Part 3).
8 JustCorrect: Experiment
8.1 Participants
8.2 Apparatus
A Google Nexus 5X device (Qualcomm Snapdragon 808 Processor, 1.8 GHz hexa-
core 64-bit Adreno 418 GPU, RAM: 2 GB LPDDR3, Internal storage: 16 GB) with
a 5.2-inch screen (1920×1080 LCD at 423 ppi) was used for the experiment.
8.3 Design
The study was a within-subjects design. The sole independent variable was the text
correction method with four levels:
• Cursor-based Correction. This was identical to the existing de facto cursor-based
text correction method on the stock Android keyboard.
• JustCorrect-Tap. After entering a word with tap typing, the user presses the editing
button to invoke the post-hoc correction algorithm (see Fig. 11).
Taking the sentence “a quick fox jimo over a lazy dog,” for example, if the user
wants to replace “jimo” with “jumps,” she tap types the editing word “jumps” and
then presses the editing button (see Fig. 11). The post-hoc correction algorithm
takes “jumps” as the editing word and outputs “a quick fox jumps over a lazy
dog.”
• JustCorrect-Gesture. A user performed JustCorrect with gesture typing [29, 63,
64]. After entering the correction word w∗ with a gesture and the finger lifts
off, the system applied the post-hoc correction algorithm to correct the existing
phrase with the word. The other interactions were the same as those in JustCorrect-
Tap. The only difference is that in JustCorrect-Tap, a button was used to trigger
JustCorrect because tap typing required a signal to indicate the end of inputting
a word, while this step is omitted in JustCorrect-Gesture because gesture typing
naturally indicates the end of entering a word when the finger lifts.
• JustCorrect-Voice. A user performed JustCorrect with voice input. The user first
pressed the voice input button on the keyboard and spoke the editing word. The
post-hoc correction algorithm took the recognized word from a speech-to-text
recognition engine as the editing word w∗ to edit the phrase. We used the Microsoft
Azure speech-to-text engine [5] for speech recognition. The remaining interactions
were identical to the previous two conditions.
8.4 Procedure
Each participant was instructed to correct errors in the same set of 60 phrases in
each condition, and the orders of the sentences were randomized. We randomly
chose 60 phrases with omission and substitution errors from mobile typing dataset
of Palin et al. [42]. This dataset included actual input errors from 37, 370 users when
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 157
Table 4 Examples of phrases in the experiment. The first three sentences contained substitution
errors. The last sentence contained an omission error
Sentences with errors Target sentences
1. Tjank for sending this Thanks for sending this
2. Should systematic manage the migration Should systems manage the migration
3. Try ir again and let me know Try it again and let me know
4. Kind like silent fireworks Kind of like silent fireworks
typing with smartphones and their target sentences. We focused on omission and
substitution errors since the post-hoc correction algorithm was designed to handle
these two types of errors. We also filtered out sentences with punctuation or number
errors because our focus was on word correction. Among 60 phrases, 8 contained
omission errors and the rest contained substitution errors. The average (SD) edit
distance between the sentence with errors and target sentences was 1.9(1.2). Each
phrase contained an average (SD) of 1.1(0.3) errors. The average length of a target
phrase in this experiment was 37 ± 14 characters. The largest phrase length was
68 characters, and the shortest was 16 characters. Table 4 shows four examples of
phrases in experiment.
In each trial, participants were instructed to correct errors in the “input phrase”
so that it matched the “target phrase” using the designated editing method. Both the
input phrase and the target phrase were displayed on the screen. The errors in the
input phrase were underlined to minimize the cognitive effort required to identify
errors across conditions, as shown in Fig. 12. The participants were required to correct
errors in their current trial before advancing to their next trial.
Should a participant fail to correct the errors in the current trial, they could use
the undo button to revert the correction and redo it, or use the de facto cursor-
based editing method. We kept the cursor-based method as a fallback in each editing
condition because our JustCorrect techniques were proposed to augment rather than
replace it. We recorded the number of trials corrected by this fallback mechanism in
order to measure the effectiveness of each JustCorrect technique.
Prior to each condition, each participant completed a warm-up session to famil-
iarize themselves with each method. The sentences in the warm-up session were
different from those in the formal test. After the completion of each condition, par-
ticipants took a 3-minute break. The order of the four conditions was counterbalanced
using a balanced Latin Square.
In total, the experiment included: 16 participants × 4 conditions × 60 trials =
3,840 trials.
158 M. R. Zhang et al.
Fig. 12 A user editing a sentence using JustCorrect-Gesture. The target sentence is displayed at
the top of the screen, and the sentence with errors is displayed below. The underlines show two
errors in the phrase: this –> that, making –> working. The user is shown gesture typing the word
that to correct the first error
Fig. 13 Mean (95% CI) text correction times for each method for successful trials
8.5 Results
We defined the “text correction time” as the duration from when a sentence was
displayed on the screen to when it was submitted and completely revised. Thus, this
metric conveys the efficiency of each JustCorrect text correction technique.
Figure 13 shows text correction time for trials that were successfully corrected
using the designated editing method in each condition (unsuccessful trials are
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 159
Fig. 14 Mean (95% CI) text correction times for the tasks successfully completed on the first
attempt
described below in the next subsection). The mean ± 95% CI of text correc-
tion time was 6.21 ± 0.59 s for the de facto cursor-based technique, 6.05 ± 0.83 s
for JustCorrect-Gesture, 5.62 ± 0.70 s for JustCorrect-Tap, and 10.22 ± 1.14 s for
JustCorrect-Voice. A repeated measures ANOVA showed that the text correction
technique had a significant main effect on overall trial time (F3,45 = 71.96, p <
0.001). Pairwise comparisons with Bonferroni correction showed that differences
were statistically significant between all pairs ( p < 0.001) except for JustCorrect-Tap
versus JustCorrect-Gesture ( p = 0.17) and JustCorrect-Gesture versus the cursor-
based technique ( p = 0.67).
To understand the effectiveness of the algorithm under different conditions, we
analyzed cases that were successfully edited in the first editing attempt. In total,
there were 3328 such trials, among 3840 total trials. We grouped these trials by edit
distance between the target sentence and the incorrect sentence. The average text
correction times on different methods are shown in Fig. 14. When the edit distance
was 1, the correction times in de facto cursor-based technique were close to those in
the gesture-based and tap-based techniques. When the edit distance was 2, 3, or 4,
the gesture- and tap-based techniques were faster than the de facto baseline.
We define the success rate as the percentage of correct trials out of all trials for
a given correction technique. Figure 15 shows success rates across conditions. The
mean ± 95% CI for success rate for each input technique was: 100.0 ± 0% for the de
facto cursor-based technique, 96.2 ± 2.2% for JustCorrect-Gesture, 97.1 ± 0.03%
for JustCorrect-Tap, and 95.1 ± 0.03% for JustCorrect-Voice. A repeated measures
160 M. R. Zhang et al.
Fig. 15 Success rate by input technique. While not 100%, the three interactions of JustCorrect
achieved pretty close success rate to the cursor-based interaction
ANOVA showed that text editing technique had a significant effect on the overall
success rate (F3,45 = 14.31, p < 0.001). Pairwise comparisons with Bonferroni cor-
rection showed the difference was significant between JustCorrect-Tap versus cursor-
based, JustCorrect-Gesture versus cursor-based, JustCorrect-Voice versus cursor-
based ( p < 0.01). All other pairwise comparisons were not statistically significant.
At the end of the study, we asked the participants to rate each method on a scale of 1 to
5 (1: dislike, 5: like). As shown in Fig. 16, the median rating for cursor-based editing,
JustCorrect-Gesture, JustCorrect-Tap, and JustCorrect-Voice were 3.0, 4.0, 5.0, and
2.5, respectively. A non-parametric Friedman test of differences among repeated
measure was carried out to compare the ratings for the four conditions. There was a
significant difference between the methods (X r 2 (3) = 17.29, p < 0.001).
Participants were also asked which method(s) they would like to use during text
entry on their phones. Twelve participants mentioned they would use JustCorrect-Tap
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 161
and eight would also like to use JustCorrect-Gesture. Six participants also consid-
ered the de facto cursor-based method useful, especially for revising short words
or character-level errors. Only two participants liked to use JustCorrect-Voice for
text editing, while most participants had privacy concerns about using it in a public
environment.
8.6 Discussion
We introduced two projects in this chapter: Type, Then Correct (TTC), and JustCor-
rect. We first discuss the user study results of TTC, followed by the discussion of
JustCorrect.
In TTC, Drag-n-Throw performed fastest among different correction types. More-
over, its performance was not affected by whether the error was far away or not
(Fig. 8). Magic Key also achieved reasonable speeds across different correction types.
For near-errors within the last three words, it even surpassed Drag-n-Throw, because
the errors would be highlighted and corrected with just two taps. For far-errors,
participants had to drag atop the Magic Key a few times to highlight the desired
error, leading to longer correction times. Drag-n-Drop performed the slowest over all
phrases, which was mainly caused by the insertion corrections. As shown in Fig. 9,
it was faster than the de facto cursor-based method for typos and word changes,
but significantly slower than other interactions for insertions. To insert a correction
between two words, a user had to highlight the narrow space between those words.
Many participants spent a lot of time adjusting their fingers in order to highlight
the desired space. They also had to redo the correction if they accidentally made a
substitution instead of an insertion. Our undo key proved to be vital in such cases. To
evaluate the performance of our algorithm in more realistic scenarios, we analyzed
the results from text composition tasks. Drag-n-Throw achieved a success rate of
87.9%. A failure was when two possible error candidates were too close to each
other. For example, if the user wanted to insert “the” in the phrase “I left keys in
room,” there were two possible positions (before keys and before room), but only
one of them would be corrected. Magic Key achieved a higher success rate of 97.0%,
as it searched every possible error in the text.
As for participants’ subjective preferences, 12 of 20 participants liked Magic
Key the most. The major reason was convenience: all the actions were done on the
keyboard. P1 commented, “Just one button handles everything. I don’t need to touch
the text anymore. It was also super intelligent. I am lazy, and that’s why I enjoyed it
so much.” Another reason was that Magic Key provided feedback (highlights) before
committing the correction, making the user confident about the target of their actions.
As P4 pointed out, “It provides multiple choices, and the uncertain feeling is gone.”
The main critique of Magic Key was about the dragging interaction required to move
among error candidates. P5 commented: “If the text is too long and the error is far
away, I have to drag a lot to highlight the error. Also, the button is kinda small, and
hard to drag.”
162 M. R. Zhang et al.
Interestingly, we found that all three participants above age 40 had positive feed-
back about the two intelligent correction techniques, and negative feedback about the
de facto cursor-based method. P14, aged 52, commented, “I dislike the cursor-based
method most. I have a big finger, and it is hard to tap the text precisely. Throw is
easy and works great. I also like Magic Key, because I don’t need to interact with the
text.” Older adults are known to perform touch screen interactions more slowly and
with less precision than younger adults [14], and the intelligent correction techniques
might benefit them by removing the requirement of precise touch. Moreover, people
walking on the street or holding the phone with one hand might also benefit from the
interactions, because touching precisely is difficult in such situations.
In JustCorrect, our investigation led to the following findings. First, both JustCorrect-
Gesture and JustCorrect-Tap showed good potential as correction methods. Both
JustCorrect-Gesture and JustCorrect-Tap successfully corrected more than 95% of
the input phrases. They both saved average correction time over the de facto cursor-
based correction method. These two methods were especially beneficial for cor-
recting sentences with large editing distances relative to the target sentences. As
shown in Fig. 14, for sentences with an editing distance of 4, JustCorrect-Gesture
and JustCorrect-Tap reduced correction time by nearly 30% over the cursor-based
method.
Second, JustCorrect-Gesture and JustCorrect-Tap exhibited their own pros and
cons. Participants had differing preferences: users who were familiar with gesture
typing liked JustCorrect-Gesture because it did not require pressing the editing but-
ton, while other participants preferred JustCorrect-Tap because they mostly used tap
typing for text entry. JustCorrect-Gesture saved the editing button-tap compared to
JustCorrect-Tap because gesture typing naturally signals the end of entering a word
by the lifting of the finger. On the other hand, in JustCorrect-Gesture, gesture typing
is used to correct text only, limiting its scope of usage.
Third, contrary to the promising performance of JustCorrect-Gesture and JustCorrect-
Tap, JustCorrect-Voice under-performed. The reason was that JustCorrect required
a user to first enter the editing word. However, the existing speech-to-text recog-
nition engine often performed poorly when recognizing a single word in isolation,
especially for short words. We discovered that entering common words such as for,
to, and are are challenging when using voice, which caused difficulty in correcting
phrases with errors on these words.
There is an exciting point in both projects: employing the power of machine
learning to automate the text correction and realize interactions that were not possible
before. The advantage of deep learning is that longer context can be incorporated in
the language models than the traditional n-gram-based methods, which enables the
models to “understand” the intention of the user on a deeper level.
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 163
On the basis of our work here, we propose four possible future directions: (1) Punc-
tuation handling: currently both TTC and JustCorrect do not handle punctuation, so
errors like “lets” (let’s) currently cannot be corrected. (2) Adding better feedback
mechanics to reduce the uncertainty of the outcome: although the interactions were
intelligent and did the work right most of the time, they were not transparent to the
user, and the outcome of the interactions was not obvious. For example, participants
felt unconfident when using the Drag-n-Throw, as there was a lack of feedback as to
where the corrections would be applied. Adding proper feedback, such as highlight-
ing the surrounding text of the throwing position to provide cues about where the
correction will occur. (3) Multilingual correction support: the two interaction tech-
niques could be applied to other languages as well, such as the Chinese language.
(5) Interactions beyond keyboard correction: the concept can also be applied to other
correction scenarios, such as voice input and handwriting.
9 Conclusion
In this chapter, we demonstrated how artificial intelligence could be applied to the text
correction interaction on touch screens. The first project, Type, Then Correct (TTC),
includes three novel interaction techniques with one common concept: to type the
correction and apply it to the error, without needing to reposition the text cursor or use
backspace, which break the typing flow and slow touch-based text entry. The second
project, JustCorrect, brought the concept further by removing the need to manually
specify the error position. Both projects utilized machine learning algorithms in
NLP fields to help identify the possible error text. The user studies showed that both
TTC and JustCorrect were significantly faster than de facto cursor-based correction
methods and garnered more positive user feedback. They provide examples of how,
by breaking from the desktop paradigm of arrow keys, backspacing, and mouse-based
cursor positioning, we can rethink text entry on mobile touch devices and develop
novel methods better suited to this paradigm.
References
1. Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W (2013) *SEM 2013 shared task: semantic
textual similarity. In: Second joint conference on lexical and computational semantics (*SEM),
Volume 1: Proceedings of the main conference and the shared task: semantic textual similarity,
Association for Computational Linguistics, Atlanta, Georgia, USA, pp 32–43. https://www.
aclweb.org/anthology/S13-1004
2. Apple (2018) About the keyboards settings on your iphone, ipad, and ipod touch. https://
support.apple.com/en-us/HT202178. Accessed 22 Aug 2019
164 M. R. Zhang et al.
3. Arif AS, Stuerzlinger W (2013) Pseudo-pressure detection and its use in predictive text entry on
touchscreens. In: Proceedings of the 25th australian computer-human interaction conference:
augmentation, application, innovation, collaboration, Association for Computing Machinery,
New York, NY, USA, OzCHI ’13, p 383–392. https://doi.org/10.1145/2541016.2541024
4. Arif AS, Kim S, Stuerzlinger W, Lee G, Mazalek A (2016) Evaluation of a smart-restorable
backspace technique to facilitate text entry error correction. In: Proceedings of the 2016 CHI
conference on human factors in computing systems, Association for Computing Machinery,
New York, NY, USA, CHI ’16, pp 5151–5162. https://doi.org/10.1145/2858036.2858407
5. Azure M (2019) Text to speech api. https://azure.microsoft.com/en-us/services/cognitive-
services/text-to-speech/. Accessed 25 Aug 2019
6. Bahdanau D, Cho K, Bengio Y (2016) Neural machine translation by jointly learning to align
and translate. 1409.0473
7. Benko H, Wilson AD, Baudisch P (2006) Precise selection techniques for multi-touch screens.
In: Proceedings of the SIGCHI conference on human factors in computing systems, Association
for Computing Machinery, New York, NY, USA, CHI ’06, pp 1263–1272. https://doi.org/10.
1145/1124772.1124963
8. Brooke J (2013) Sus: a retrospective. J Usability Studies 8(2):29–40
9. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y
(2014) Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In: Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1724–1734.
https://doi.org/10.3115/v1/D14-1179. https://www.aclweb.org/anthology/D14-1179
10. Cui W, Zhu S, Zhang MR, Schwartz A, Wobbrock JO, Bi X (2020) Justcorrect: Intelligent
post hoc text correction techniques on smartphones. In: Proceedings of the 33rd annual ACM
symposium on user interface software and technology, Association for Computing Machinery,
New York, NY, USA, UIST ’20, pp 487–499. https://doi.org/10.1145/3379337.3415857
11. Davies M (2018) The corpus of contemporary American english: 1990-present
12. Dhakal V, Feit AM, Kristensson PO, Oulasvirta A (2018) Observations on Typing from 136
Million Keystrokes, Association for Computing Machinery, New York, NY, USA, pp 1–12.
https://doi.org/10.1145/3173574.3174220
13. Erk K (2012) Vector space models of word meaning and phrase meaning: a survey. Lang Ling
Compass 6(10):635–653
14. Findlater L, Froehlich JE, Fattal K, Wobbrock JO, Dastyar T (2013) Age-related differences
in performance with touchscreens compared to traditional mouse input. In: Proceedings of
the SIGCHI conference on human factors in computing systems, Association for Computing
Machinery, New York, NY, USA, CHI ’13, pp 343–346. https://doi.org/10.1145/2470654.
2470703
15. Fitzmaurice G, Khan A, Pieké R, Buxton B, Kurtenbach G (2003) Tracking menus. In: Pro-
ceedings of the 16th Annual ACM symposium on user interface software and technology,
Association for Computing Machinery, New York, NY, USA, UIST ’03, pp. 71–79. https://doi.
org/10.1145/964696.964704
16. Fowler A, Partridge K, Chelba C, Bi X, Ouyang T, Zhai S (2015) Effects of language mod-
eling and its personalization on touchscreen typing performance. In: Proceedings of the 33rd
Annual ACM conference on human factors in computing systems, Association for Computing
Machinery, New York, NY, USA, CHI ’15, pp. 649–658. https://doi.org/10.1145/2702123.
2702503
17. Frederick BN (1999) Fixed-, random-, and mixed-effects anova models: a user-friendly guide
for increasing the generalizability of anova results. Advances in social science methodology,
Stamford. JAI Press, CT, pp 111–122
18. Fuccella V, Isokoski P, Martin B (2013) Gestures and widgets: performance in text editing
on multi-touch capable mobile devices. In: Proceedings of the SIGCHI conference on human
factors in computing systems, ACM, New York, NY, USA, CHI ’13, pp 2785–2794. https://
doi.org/10.1145/2470654.2481385, http://doi.acm.org/10.1145/2470654.2481385
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 165
19. Heafield K (2011) KenLM: faster and smaller language model queries. In: Proceedings of the
EMNLP 2011 sixth workshop on statistical machine translation, Edinburgh, Scotland, United
Kingdom, pp 187–197. https://kheafield.com/papers/avenue/kenlm.pdf
20. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70.
http://www.jstor.org/stable/4615733
21. Holz C, Baudisch P (2011) Understanding touch. In: Proceedings of the SIGCHI conference
on human factors in computing systems, Association for Computing Machinery, New York,
NY, USA, CHI ’11, pp 2501–2510. https://doi.org/10.1145/1978942.1979308
22. Inc E (2018) Messagease - the smartest touch screen keyboard. https://www.exideas.com/ME/
index.php. Accessed 22 Aug 2019
23. Inc G (2020) Grammarly keyboard. https://en.wikipedia.org/wiki/Grammarly. Accessed May
2020
24. Islam A, Inkpen D (2009) Real-word spelling correction using google web it 3-grams. In:
Proceedings of the 2009 conference on empirical methods in natural language processing:
Volume 3 - Volume 3, Association for Computational Linguistics, USA, EMNLP ’09, pp
1241–1249
25. Isokoski P, Martin B, Gandouly P, Stephanov T (2010) Motor efficiency of text entry in a
combination of a soft keyboard and unistrokes. In: Proceedings of the 6th Nordic confer-
ence on human-computer interaction: extending boundaries, ACM, New York, NY, USA,
NordiCHI ’10, pp 683–686. https://doi.org/10.1145/1868914.1869004. http://doi.acm.org/10.
1145/1868914.1869004
26. Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In:
Proceedings of the Thirtieth AAAI conference on artificial intelligence, AAAI Press, AAAI’16,
pp 2741–2749
27. Komninos A, Nicol E, Dunlop MD (2015) Designed with older adults to supportbetter error
correction in smartphone text entry: the maxiekeyboard. In: Proceedings of the 17th interna-
tional conference on human-computer interaction with mobile devices and services adjunct,
Association for Computing Machinery, New York, NY, USA, MobileHCI ’15, pp 797–802.
https://doi.org/10.1145/2786567.2793703
28. Komninos A, Dunlop M, Katsaris K, Garofalakis J (2018) A glimpse of mobile text entry errors
and corrective behaviour in the wild. In: Proceedings of the 20th international conference
on human-computer interaction with mobile devices and services adjunct, Association for
Computing Machinery, New York, NY, USA, MobileHCI ’18, pp 221–228. https://doi.org/10.
1145/3236112.3236143
29. Kristensson PO, Zhai S (2004) Shark2: a large vocabulary shorthand writing system for pen-
based computers. In: Proceedings of the 17th annual ACM symposium on user interface soft-
ware and technology, ACM, New York, NY, USA, UIST ’04, pp 43–52. https://doi.org/10.
1145/1029632.1029640. http://doi.acm.org/10.1145/1029632.1029640
30. Leiva LA, Sahami A, Catala A, Henze N, Schmidt A (2015) Text entry on tiny qwerty soft
keyboards. In: Proceedings of the 33rd annual ACM conference on human factors in computing
systems, Association for Computing Machinery, New York, NY, USA, CHI ’15, pp 669–678.
https://doi.org/10.1145/2702123.2702388
31. Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions, and reversals.
Soviet Phys Doklady 10:707–710
32. Limpert E, Stahel WA, Abbt M (2001) Log-normal distributions across the sciences: keys
and clues: on the charms of statistics, and how mechanical models resembling gambling
machines offer a link to a handy way to characterize log-normal distributions, which can pro-
vide deeper insight into variability and probability–normal or log-normal: that is the question.
BioScience 51(5):341–352. https://doi.org/10.1641/0006-3568(2001)051[0341:LNDATS]2.
0.CO;2. https://academic.oup.com/bioscience/article-pdf/51/5/341/26891292/51-5-341.pdf
33. Littell R, Henry P, Ammerman C (1998) Statistical analysis of repeated measures data using
sas procedures. J Animal Sci 76(4):1216–1231. https://doi.org/10.2527/1998.7641216x
34. LLC G (2020) Gboard. URLhttps://en.wikipedia.org/wiki/Gboard. Accessed May 2020
166 M. R. Zhang et al.
35. MacKenzie IS, Soukoreff RW (2002) A character-level error analysis technique for evaluat-
ing text entry methods. In: Proceedings of the second nordic conference on human-computer
interaction, Association for Computing Machinery, New York, NY, USA, NordiCHI ’02, pp
243–246. https://doi.org/10.1145/572020.572056
36. MacKenzie IS, Soukoreff RW (2002) Text entry for mobile computing: Models
and methods,theory and practice. Hum-Comput Int 17(2-3):147–198. https://doi.org/
10.1080/07370024.2002.9667313. https://www.tandfonline.com/doi/abs/10.1080/07370024.
2002.9667313
37. Mahoney M (2011) About text8 file. http://mattmahoney.net/dc/textdata.html. Accessed May
2020
38. Mikolov T, Chen K, Corrado GS, Dean J (2013) Efficient estimation of word representations
in vector space. arXiv:1301.3781
39. Ng HT, Wu SM, Wu Y, Hadiwinoto C, Tetreault J (2013) The CoNLL-2013 shared task on
grammatical error correction. In: Proceedings of the seventeenth conference on computational
natural language learning: shared task, Association for Computational Linguistics, Sofia, Bul-
garia, pp 1–12. https://www.aclweb.org/anthology/W13-3601
40. Ng HT, Wu SM, Briscoe T, Hadiwinoto C, Susanto RH, Bryant C (2014) The CoNLL-2014
shared task on grammatical error correction. In: Proceedings of the eighteenth conference
on computational natural language learning: shared task, Association for Computational Lin-
guistics, Baltimore, Maryland, pp 1–14. https://doi.org/10.3115/v1/W14-1701. https://www.
aclweb.org/anthology/W14-1701
41. Ola Kristensson P, Vertanen K (2011) Asynchronous multimodal text entry using speech and
gesture keyboards. In: Proceedings of the international conference on spoken language pro-
cessing, pp 581–584
42. Palin K, Feit A, Kim S, Kristensson PO, Oulasvirta A (2019) How do people type on mobile
devices? Observations from a study with 37,000 volunteers. In: Proceedings of 21st interna-
tional conference on human-computer interaction with mobile devices and services (Mobile-
HCI’19), ACM
43. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein
N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy
S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance
deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox
E, Garnett R (eds) Advances in neural information processing systems 32, Curran Asso-
ciates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-
high-performance-deep-learning-library.pdf
44. Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation.
In: Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://
doi.org/10.3115/v1/D14-1162. https://www.aclweb.org/anthology/D14-1162
45. Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In:
Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA,
Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en
46. Ruan S, Wobbrock JO, Liou K, Ng A, Landay JA (2018) Comparing speech and keyboard text
entry for short messages in two languages on touchscreen phones. Proc ACM Interact Mob
Wearable Ubiquitous Technol 1(4). https://doi.org/10.1145/3161187
47. Rubio S, Díaz EM, Martín J, Puente J (2004) Evaluation of subjective mental workload: a
comparison of swat, nasa-tlx, and workload profile methods. Appl Psychol 53:61–86
48. Schmidt D, Block F, Gellersen H (2009) A comparison of direct and indirect multi-touch
input for large surfaces. In: Gross T, Gulliksen J, Kotzé P, Oestreicher L, Palanque P, Prates
RO, Winckler M (eds) Human-computer interaction - INTERACT 2009. Springer, Berlin, pp
582–594
49. Sears A, Shneiderman B (1991) High precision touchscreens: design strategies and comparisons
with a mouse. Int J Man Mach Stud 34:593–613
AI-Driven Intelligent Text Correction Techniques for Mobile Text Entry 167
50. Sim KC (2010) Haptic voice recognition: Augmenting speech modality with touch events for
efficient speech recognition. In: 2010 IEEE spoken language technology workshop, pp 73–78.
https://doi.org/10.1109/SLT.2010.5700825
51. Sim KC (2012) Speak-as-you-swipe (says): A multimodal interface combining speech and ges-
ture keyboard synchronously for continuous mobile text entry. In: Proceedings of the 14th ACM
international conference on multimodal interaction, ACM, New York, NY, USA, ICMI ’12,
pp 555–560. https://doi.org/10.1145/2388676.2388793. http://doi.acm.org/10.1145/2388676.
2388793
52. Sindhwani S, Lutteroth C, Weber G (2019) Retype: Quick text editing with keyboard and gaze.
In: Proceedings of the 2019 CHI conference on human factors in computing systems, ACM,
New York, NY, USA, CHI ’19, pp 203:1–203:13. https://doi.org/10.1145/3290605.3300433.
http://doi.acm.org/10.1145/3290605.3300433
53. Soukoreff RW, MacKenzie IS (2004) Recent developments in text-entry error rate measure-
ment. In: CHI ’04 extended abstracts on human factors in computing systems, Association for
Computing Machinery, New York, NY, USA, CHI EA ’04, pp 1425–1428. https://doi.org/10.
1145/985921.986081
54. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks.
In: Proceedings of the 27th international conference on neural information processing systems
- Volume 2, MIT Press, Cambridge, MA, USA, NIPS’14, pp 3104–3112
55. Vertanen K, Memmi H, Emge J, Reyal S, Kristensson PO (2015) Velocitap: Investigating fast
mobile text entry using sentence-based decoding of touchscreen keyboard input. In: Proceed-
ings of the 33rd annual ACM conference on human factors in computing systems, Association
for Computing Machinery, New York, NY, USA, CHI ’15, pp 659–668. https://doi.org/10.
1145/2702123.2702135
56. Vogel D, Baudisch P (2007) Shift: a technique for operating pen-based interfaces using touch.
In: Proceedings of the SIGCHI conference on human factors in computing systems, Association
for Computing Machinery, New York, NY, USA, CHI ’07, pp 657–666. https://doi.org/10.1145/
1240624.1240727
57. Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM (JACM)
21(1):168–173
58. Weidner K (2018) Hackers keyboard. http://code.google.com/p/hackerskeyboard/. Accessed
22 Aug 2019
59. Wobbrock JO, Myers BA (2006) Analyzing the input stream for character- level errors in
unconstrained text entry evaluations. ACM Trans Comput-Hum Interact 13(4):458–489. https://
doi.org/10.1145/1188816.1188819
60. Xie Z, Avati A, Arivazhagan N, Jurafsky D, Ng A (2016) Neural language correction with
character-based attention. ArXiv:1603.09727
61. Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural
language processing [review article]. IEEE Comput Intell Mag 13:55–75
62. Zesch T (2012) Measuring contextual fitness using error contexts extracted from the Wikipedia
revision history. In: Proceedings of the 13th conference of the European chapter of the asso-
ciation for computational linguistics, Association for Computational Linguistics, Avignon,
France, pp 529–538. https://www.aclweb.org/anthology/E12-1054
63. Zhai S, Kristensson PO (2003) Shorthand writing on stylus keyboard. In: Proceedings of
the SIGCHI conference on human factors in computing systems, Association for Computing
Machinery, New York, NY, USA, CHI ’03, pp 97–104. https://doi.org/10.1145/642611.642630
64. Zhai S, Kristensson PO (2012) The word-gesture keyboard: reimagining keyboard interaction.
Commun ACM 55(9):91–101. https://doi.org/10.1145/2330667.2330689
65. Zhang MR, Wobbrock OJ (2020) Gedit: keyboard gestures for mobile text editing. In: Proceed-
ings of graphics interface (GI ’20), Canadian information processing society, Toronto, Ontario,
GI ’20, pp 97–104
66. Zhang MR, Wen H, Wobbrock JO (2019) Type, then correct: intelligent text correction tech-
niques for mobile text entry using neural networks. In: Proceedings of the 32nd annual ACM
symposium on user interface software and technology, Association for Computing Machinery,
New York, NY, USA, UIST ’19, pp 843–855. https://doi.org/10.1145/3332165.3347924
168 M. R. Zhang et al.
67. Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classifi-
cation. In: Proceedings of the 28th international conference on neural information processing
systems - Volume 1, MIT Press, Cambridge, MA, USA, NIPS’15, pp 649–657
68. Zhu S, Luo T, Bi X, Zhai S (2018) Typing on an invisible keyboard. In: Proceedings of the
2018 CHI Conference on human factors in computing systems, Association for Computing
Machinery, New York, NY, USA, CHI ’18, pp 1–13. https://doi.org/10.1145/3173574.3174013
Deep Touch: Sensing Press Gestures from
Touch Image Sequences
1 Introduction
Fig. 1 An illustration of a press gesture: a user’s finger contacts a touch sensor and deforms as the
force behind it increases (top); this deformation is observed as a unilateral expansion on the sensor
image (bottom)
ing (touch and hold), and scrolling (panning, dragging, flicking, and surface-stroke
gestures, etc.), which are modelled using a set of three heuristics: (1) if the distance
from the initial contact location exceeds a hysteresis threshold, the gesture is a scroll;
(2) if the duration since the initial contact exceeds a time threshold, the gesture is a
long press; (3) otherwise, the gesture is a tap when the contact is lifted. Although this
model has nurtured a broad and successful design space for interaction, it belies the
rich signal that touch sensors produce. In particular, capacitive touch sensors capture
an ‘image’ of the finger’s contact shape that can reveal the evolution of a finger’s
posture during its contact (Fig. 1).
As long press relies on a latency threshold, it is the least direct, discover-
able, usable, or expressive of the three gestures described. Force sensing has been
explored—both academically and commercially—as a method for rectifying these
problems by creating a force press gesture that is directly connected to an active
parameter of the user’s input: their finger’s force. However, force sensing requires
additional hardware that suffers from practical challenges in its cost and integration.
Observations and analyses of the human finger’s biomechanics and the underlying
touch sensor technology (capacitive sensing) suggest a complementary approach to
the latency-based long press. In many cases, a press gesture is manifest in the subtle
Deep Touch: Sensing Press Gestures from Touch Image Sequences 171
signals that are captured by the image sequence from a capacitive touch sensor: as a
user increases their finger’s force on a screen, its contact mass increases unilaterally
as the strain on the most distal finger joint increases and the finger rolls downward
(Figs. 1 and 6).
These raw images are difficult to analyse heuristically due to the high dimension-
ality of the data, temporal dependencies in the gesture, sensor noise under different
environmental conditions, and the range of finger sizes and postures that may be
used. However, modern neural learning techniques present an opportunity to analyse
touch sensor images with a data-driven approach to classification that is robust to
these variances.
We call this approach deep touch: a neural model for sensing touch gestures
based on the biomechanical interaction of a user’s finger with a touch sensor. To
differentiate between gestures we use a neural network to combine complex spatial
(convolutional) features from individual images with temporal (recurrent) features
across a sequence of touch sensor images. We identify a set of biomechanical pat-
terns to shape the learned features and minimise the number of parameters so that
the resultant model can be used in real time without impairing system latency or
responsiveness. Although this does not allow force sensing per se, it can recognise a
user’s intention to press as a discrete gesture.
In this chapter, we first present an overview of touch sensing hardware, finger-
surface interaction, and the touch input design space. The overview highlights a
weakness of the current touch interaction system: the lack of a direct pressing gesture.
We then describe the deep touch model: the biomechanical patterns, neural model
design, data collection methodology, and training procedure. Finally, we describe
how this model was integrated into the Android gesture classification algorithm as
part of Google Pixel 4 and 5 devices without incurring additional input latency.
Of the techniques for detecting the presence of a human finger near an object [22, 55,
63], the most common for small-medium size mobile devices is Projected Capacitive
Touch (PCT or ‘p-cap’). PCT is based on the principle of capacitive coupling [7, 55]:
when two conductive objects (electrodes) are brought close together, they can hold a
charge between them—their capacitance—which becomes disrupted when another
conductive object encroaches. The capacitance C of two such electrodes, separated
by a dielectric material (usually glass or plastic), is given by
A
C = k 0 , (1)
d
172 P. Quinn et al.
Fig. 2 A projected
capacitive touch (PCT)
sensor: two electrodes
separated by a dielectric. A
capacitive coupling is created
with a field projected from a
drive electrode and measured
on a sense electrode. This
field is disrupted when
another conductive object
(e.g. a finger) comes close
1Sensing C is known as sensing the mutual capacitance between electrodes. In a related (but
more limited) technique, the self capacitance C of each electrode is measured individually [7, 55].
Deep Touch: Sensing Press Gestures from Touch Image Sequences 173
Fig. 3 A touch sensor’s response to a 25 mm metal coin (left) and with a 0.5 mm plastic shim under
its base (right). Each cell is 4.5 mm2 ; brighter cells indicate higher C values
Touch sensors are protected by a covering glass and are tuned to maximise their
sensitivity to objects touching this glass—but they do not require an object to be in
direct contact to produce a signal. Figure 3 (right) shows the sensor’s response with a
plastic shim lifting the coin at a small angle. The signal produced is a smooth gradient
as the distance between the touch sensor and the coin increases. Some sensor designs
can amplify this effect for sensing objects that are up to 30 cm from the sensor (e.g.
[30]).
PCT does not inherently detect the force applied to a sensor as changes in force do
not normally alter the capacitive properties of an object. That is, orthogonal forces
applied to the finger in Fig. 2 will not change the C.2 Rather, the force on a touch
surface is typically measured using a layer of Force Sensing Resistors (FSRs) or strain
gauges that change their electrical resistance with forces upon them that deform the
surface [45, 61].
2If the electrodes are allowed the ‘float’ with respect to each other, then changes in the distance
between them from external forces can be detected by Eq. 1.
174 P. Quinn et al.
Serina and colleagues [48, 49] measured and modelled the vertical compression
and contact area responses of a fingertip pressing against a flat surface at different
angles and forces. They found the fingertip pulp was very responsive to changes at
force levels under 1 N, and quickly saturated at higher levels (e.g. 69% of the contact
area at 5.2 N of force was achieved by 1 N). The effects were robust across angles,
and were invariant to subjects’ age and sex. In similar experiments, Birznieks et al.
[10] reported that most of the changes to the finger’s structure occur in the fingertip
pulp, and not between the fingernail and the bone.
Sakai and Shimawaki [47] examined a finger’s contact area at acute angles and
found that contact length (along the axis of the finger) increased non-linearly at force
levels under 3 N, and saturated thereafter. The change in contact length between force
levels was more pronounced as the angle of contact became more acute—caused by
a difference in dorsal and proximal pulp compliance.
Goguey et al. [19] characterised finger pitch, roll, and yaw in a series of funda-
mental touch operations (e.g. tapping, dragging, and flicking—see Sect. 2.3) with
each finger and the thumb. They found touch operations generally occurred at a low
pitch (less than 45◦ ), but with significant effects for the digit used and the orientation
of the touch surface. Finger posture has also been studied in specific task contexts—
for example, Azenkot and Zhai [5] reported systematic shifts in touch distribution
patterns with different typing patterns, which were later used to improve text entry
performance [18, 62].
Even when force is not a parameter to an action, touch gestures necessarily convey
a certain force level and profile—particularly for gestures that involve extended
motion. Taher et al. [53] analysed these inherent force profiles (but explicitly not
the force level) for common interactions: tapping, typing, zooming, rotating, and
panning. They found typing and panning were generally characterised by a sharp
increase and decrease in force, with a slightly extended force plateau at the peak for
tapping actions (hypothesised to be a confirmation phase). For gestures that involved
extended interaction, force varied with the distance between the fingers (e.g. when
zooming), and with substantial variation in the profiles that included use of the thumb.
Deep Touch: Sensing Press Gestures from Touch Image Sequences 175
Presuming that the force of a contact can be reliably measured (either with dedi-
cated force sensors or a synthesised approximation using other available sensors),
researchers have experimented with interaction use cases such as continuous input
controls for scrolling [3, 6, 36], zooming [36, 52], selecting between modes of
operation [13], context menus [20, 25, 60], and gesture operations [43, 44].
Researchers have also studied the benefits and limitations of ‘pseudo-force’ indi-
cated by an overt ‘rocking’ or ‘rolling’ gesture [4, 8, 12, 16, 26].
Boceck et al. [11] used a neural network on individual touch sensor images to
estimate static touch force. However, despite limiting their model to index-finger
data at a fixed posture (with the device resting on a flat surface), their model suffered
from substantial variance.
176 P. Quinn et al.
The touch and hold gesture is the weakest of the common touch gestures due to its
indirectness: it is difficult for users to discover or perform if the time threshold is too
long, or prone to misclassification as a tap gesture if the threshold is too short. These
problems are particularly acute on mobile devices where there is strong demand
for providing a wide range of interactive functionality within the limited physical
bounds of the display and input space. Force sensing offers a possible mechanism for
creating a direct press gesture, but has been challenged by the practical difficulties
of providing it in commercial hardware.
The biomechanical interaction between the finger and a touch sensor reviewed in
Sect. 2.2 suggests that there is an opportunity to use the dynamic properties of the
finger to sense a more natural, direct pressing gesture. As touch sensor data does
not inherently contain force information (Eq. 1), it is not possible to quantify the
force at which a user is pressing. However, the temporal changes in the touch sensor
data due to the biomechanical effects should reveal whether the force is qualitatively
changing. The remainder of this chapter describes the design and development of a
deep learning approach to sensing this change to provide a force-based direct press
gesture.
Fig. 5 A tap gesture observed by a touch sensor: the signal values (cell brightness) change sym-
metrically around the centre of mass. The top row shows the raw frames; the bottom row shows
each frame’s difference from the preceding frame
size of the model so that it is possible to use the network for real-time inference
within the constraints of a production environment (see Sect. 4).
In the remainder of this section, we detail these biomechanical patterns of touch
sensor signals and the design of the model. We then describe a data collection pro-
cedure for gathering labelled data to train the model and the results of an offline
evaluation.
Although we aim to sense a force-based press gesture with our model, it must be
able to reliably discriminate this gesture from other touch gestures, namely, tapping
and scrolling. We therefore describe analyses for these three gestures in terms of the
features that can be used to discriminate them.
A tap is conceptually the simplest touch interaction: a finger comes into symmetri-
cal contact with a touch sensor, reaches a stable saturation point, and then disengages
(lifts) from it. From the perspective of a touch sensor, the finger’s contact expands
symmetrically around its centre of mass (Fig. 5). There is little modulation of force
after it saturates [53], and therefore the contact size or area will not further increase
after the first few frames.
As a user evolves a contact into a press by applying more force behind their finger,
the biomechanical literature informs us this will be conducive to an asymmetric
contact expansion along the axis of the finger [10, 39, 47–49, 51]. This will be
prominent since touch interactions generally occur at a low pitch [19]. The touch
sensor will therefore observe an expansion of the contact mass in one direction, while
remaining ‘anchored’ at one edge (Figs. 1 and 6).
178 P. Quinn et al.
Fig. 6 A press gesture observed by a touch sensor: the signal values (cell brightness) change
asymmetrically around the centre of mass. The top row shows the raw frames; the bottom row
shows each frame’s difference from the preceding frame
It is important to note this assumes that either tapping gestures will be performed
with a force of less than 1 or 2 N (i.e. before the contact area saturates—[47–49]), or
the speed of the force onset will be significantly different during a tap. However, an
advantage of this design is its invariance to the finger used to make the contact—that
is, although a thumb and little finger will have substantially different contact areas,
the relative changes as force is applied will be similar.
Scrolling interactions—both dragging and flicking [42]—are primarily charac-
terised by their contact displacement. This is facilitated in current systems by a
‘touch slop’ or hysteresis threshold to engage a scrolling mode (and exclude the
possibility of tapping or pressing). Such a threshold is required because a contact
point will rarely be stationary during tap and press interactions: jitter from the user’s
muscle tremor and from the unfolding contact area will retard the centroid location
[56, 57].
This displacement is conveyed in the touch sensor image through changes at
the fringes of a touch. If the image is held at the calculated centroid, then motion
is conveyed through a consistent decrease in signal at one edge, with a matched
increase in signal at the other.
Classification of touch gestures must occur online, in real time, from continuous,
variable-length time series data (i.e. without waiting for the finger to lift from the
touch sensor). That is, the identification of a touch gesture should be made as soon
as the user’s intent is sufficiently expressed—without further perceptible delay in
time (in the case of tapping or pressing) or in space (in the case of scrolling). The
identification also needs to be incremental, and not based on the entire gesture after its
completion. Such a task lends itself to classification with a recurrent neural network
(RNN) (e.g. [21]): touch sensor images are input to the network as they are received,
with the network’s state preserved between each image. The output probability of
each gesture class is updated and compared against a threshold after each image is
Deep Touch: Sensing Press Gestures from Touch Image Sequences 179
3 All convolutional filters have a depth of 16, with ReLU activation between each operation [17].
180 P. Quinn et al.
We collected a data set of labelled capacitive touch sensor image sequences that
were representative of tap, press, and scroll operations on mobile devices in three
tasks: (1) a target selection task, (2) a dragging task, and (3) a search and select task.
The first two tasks asked for mechanical performances of common interactions, while
the third was indicative of actual interaction, and interleaved sequences of scrolling
with tapping.
We collected data for both long press and deep press tasks separately: long press
was defined using the system’s standard time-based threshold of 400 ms, while deep
press was a user-defined force gesture. Similarly, we divided scroll tasks into flicking
and dragging tasks: flicking tasks were scrolling actions of medium-long distances,
while dragging tasks were micro-scrolling movements. The data for each of these
tasks were labelled with their respective categories: tap, deep press, long press, flick,
and drag. The reason for these divisions was to separate different finger motions (e.g.
drag and flick) that generate the same touch gesture (e.g. scroll), and ensure that the
principal features of the underlying motions can be identified by the model.
Nineteen volunteers (11 male; 8 female) with an age range of 18–60 participated in
the experiment and received a gift voucher for doing so. The experiment was run on
a Google Pixel 4 device with a 144 × 67 mm display with a resolution of 3040 ×
1440 px. The touch sensor had a resolution of 36 × 17 cells, and reported a 7 × 7
cell image centred on the cell that contained the calculated touch centroid 120 Hz.
selection. The dock was placed at varying distances (35, 53 or 70 px) and directions
(up, down, left, right) from the target. Targets were initially blue, and turned orange
with haptic feedback when they neared the dock—indicating that the task could be
completed by lifting the finger. Data from this task were labelled as drag samples.
The search-and-select task (Fig. 8c) included tapping and flicking actions. Each
task required participants to scroll and locate a circular target (using a combination of
dragging and flicking), and perform a tap on the target. The 12 targets were created
and displayed by random sampling their radius (140, 175 or 210 px), horizontal
location (20, 50 or 80% of the screen width), and vertical location (25, 38 or 63% of
the screen height between targets) to ensure interactions were distributed across the
display. Each target had a randomly assigned label between 1 and 12, and the task
proceeded sequentially through them (cued to participants in the top-right corner of
the display). Data from this task were labelled as requested.
3.3.3 Procedure
Participants were encouraged to perform a deep press using their preferred force and
duration (potentially shorter than the current long-press duration), and to perform
the tap, long press, drag, and flick operations as usual.
Each task (one target selection task for each gesture, one dragging task, and one
search-and-select task) was performed in a counter-balanced order, and repeated as
three blocks. The tasks were repeated in each of four postures (counter-balanced):
(1) one-handed using a thumb to interact, (2) two-handed using either thumb to
interact, (3) one-handed using the opposing index finger to interact, and (4) in a
landscape orientation with both hands, using either thumb to interact. In all postures
the phone was hand-held. Participants could rest between blocks.
This procedure was repeated twice: once with a rubber case on the device, and once
without a case. This ensured that touch sensor data were collected in both electrically
grounded and ungrounded conditions—which have substantially different signal and
noise characteristics.
182 P. Quinn et al.
3.4 Training
To train the model described above, the collected data were randomly divided into
training (15 participants) and evaluation (4 participants) sets. No participant con-
tributed samples to both sets. The model’s output was configured to estimate proba-
bilities for five classes: tap, deep press, long press, flick, and drag.
To isolate the portion of each sample that contained the gesture performance, the
trailing 25% of each sequence was discarded. This effectively removed the portion
where the participant’s finger lifted from the touch sensor (i.e. after the gesture had
been performed).
Each training sample was also extended to a minimum duration of 48 ms (6 frames)
by linear interpolation, and truncated to a maximum duration of 120 ms (15 frames).
This prevented certain touch gestures from being discriminable purely by their dura-
tion (in practice we expect to observe more variance in duration than captured in the
laboratory). This processing was not applied to the samples used for evaluation.
We used the summed cross-entropy across each sequence as the loss function
to minimise, with a linear temporal weight. That is, given a sequence of frames t ∈
[1, T ] with a true class distribution at each frame pt , and a predicted class distribution
at each frame qt , the loss over the classes X was
T
t
L( p, q, T ) = − pt (x) log qt (x) .
t=1
T (1 + T )/2 x∈X
As with other weighted cross-entropy methods, this encourages the model to produce
classifications with an increasing probability for the correct class as input is received
[2, 35]. However, in our formulation the weights always summed to 1 in order to
make the total loss invariant to the length of the sequence, and avoid a potential bias
in the model towards classes with shorter sequences.
To reflect the temporal ambiguity in the sequence, the true class distribution was
defined at each frame with a logistic function. As the first few frames of a sequence
for all classes are likely to be substantially similar (i.e. at the moments a finger first
contacts the touch sensor), it is unreasonable to claim there is a high likelihood in
the sequence’s ultimate classification for such frames (i.e. with a one-hot encoded
probability distribution). Similarly, it is unreasonable to penalise the model with
a high cross-entropy if it does not produce a confident prediction at these early
frames. Therefore, the distribution pt was defined to start at 1/|X| for all classes,
and transition towards either 1 or 0, depending on the true label X for the sequence
⎧
⎪
⎪ 1
⎨ x = X.
(|X| − 1)e−t+1 + 1
pt (x) = −t+1 (2)
⎪
⎪ e
⎩ x = X.
(|X| − 1)e−t+1 + 1
Deep Touch: Sensing Press Gestures from Touch Image Sequences 183
Table 1 Overall model accuracy and deep press precision/recall for the model component ablations
Accuracy (%) Deep press
Precision (%) Recall (%)
Complete model 83 88 78
Without row/column 73 69 48
filters
Without temporal 76 66 67
labels
Defining the true class distribution in this manner also helps calibrate the output
probabilities and avoid spurious values in the first few frames during inference.
3.5 Results
To verify our patterns of axial change in sensor images and demonstrate the impor-
tance of using temporal weights in the loss function, we conducted ablation studies
with three model variations: (1) the complete model, as described above; (2) the
model trained without the ‘row’ and ‘column’ convolutional filters, (3) the model
trained without the temporal labels (Eq. 2). Table 1 shows the overall accuracy and
deep press precision/recall for these three models: the removal of the row/column
filters or the temporal weights has a substantial negative effect on the model’s per-
formance.
Table 2 shows the confusion matrix for the evaluation data set, with an overall
accuracy of 83%. When considering deep press as a binary class (i.e. deep press vs.
not-deep-press), the overall accuracy is 95% with a precision of 89% and a recall of
75%.
In general, there is good separation between the classes, with the primary areas
of confusion being between deep press and the long press/drag classes. However,
a significant caveat with the reported deep press accuracy is the lack of feedback
given to participants during the data collection procedure. The collected dataset was
184 P. Quinn et al.
deliberately harder to classify than gestures with feedback will be in practice. That
is, any action on the deep press targets were accepted and labelled as such, without
constraint or validation. There are, likely to be poor samples in the data from either
accidental touches or postures that do not produce a distinct ‘press’ (e.g. fingers
approaching orthogonal to the display).
Creating a feedback loop would give users the opportunity to learn the distin-
guishing characteristics of the gesture, and drive them towards discriminating their
own actions [31]. This issue is addressed in the following section.
The prior section demonstrated that temporal changes in touch sensor images convey
distinct signals that can be used to discern a force-based press gesture. However, the
deep touch model does not eliminate the need for heuristic classification of touch
gestures as not all touch intentions involve predictable finger-based interactions (e.g.
a conductive stylus or certain finger postures would not exhibit the biomechanical
properties described). Rather, the model provides a method for accelerating the
recognition of a user’s intentions when they are clear from the contact posture.
In this section we describe incorporating the deep touch model into the Android
input system to enable its practical use [41]. This involves combining the neural
deep touch model with the existing heuristic classification algorithm—allowing the
neural model’s signals to accelerate classification when they become manifest in the
touch sensor data, but falling back to traditional classification for unusual postures
or when there is ambiguity in the signal. We then describe a user study to examine
the practical classification performance of this algorithm.
The Android input system provides signals about touch gestures to applications in the
three categories discussed: tap, press, and scroll.4 Applications typically map press
signals to secondary invocation functions (e.g. context menus or word selection), and
so we decided to supplement this signal with our neural model classification—that
is, allowing a press gesture to be triggered by either a long press or a deep press. This
supports our goal of providing a direct touch gesture, but does not require existing
applications to modify how they handle touch gestures in order to benefit from it.
Combining a probabilistic model with a heuristic algorithm also offers two benefits
to users: (1) lower latency for interactions when the intention is clear from the touch
expression, and (2) the certainty and reliability of a baseline in the presence of input
4 https://developer.android.com/reference/android/view/GestureDetector.
Deep Touch: Sensing Press Gestures from Touch Image Sequences 185
Fig. 9 An overview of the inference algorithm: the neural model is integrated into the heuristic
classification pipeline to provide an acceleration for press gestures when the model’s output indicates
high confidence
ambiguity. We, therefore, prefaced the existing gesture detection algorithm (steps
2–4 below) with a decision point for the neural model’s classification (Fig. 9):
1. If the neural model indicates the sequence is a ‘deep press’ with a probability
greater than 75%, the gesture is classified as a press. (If the neural model indicates
similar confidence in another gesture classification, then only the heuristics are
used for the remainder of the sequence.)
2. If the duration since the initial contact exceeds a time threshold of 400 ms, the
gesture is a press.
3. If the distance from the initial contact location exceeds a hysteresis threshold, the
gesture is a scroll.
4. Otherwise, the gesture is a tap when the contact is lifted.
Once a sequence has been classified, it is never reconsidered.
The hysteresis threshold for scroll classification was dynamically set based on the
output of the model: 56 px while the neural model’s output was below the probability
threshold of 75% for any of the gesture classes, and 28 px thereafter. That is, the
threshold was doubled while the neural model expressed that the sequence was
ambiguous—as it is common for some erroneous shift in the touch centroid as a
finger’s area expands into the touch sensor during a press.
This algorithm allows the heuristic criteria to identify tap and long-press gestures
when the model is unable to confirm an interaction as a press. However, the model’s
training on five classes allows it to learn the discriminating characteristics of all
possible interactions.
The neural model was implemented using TensorFlow Lite with an on-disk size
of 167 kB, and a runtime memory load of less than 1 MB. When executing the model
on a Google Pixel 4 device, inference time averages 50 µs per input frame. This
allows the model to execute for each image received from the touch sensor (i.e.
at 120 Hz) and report its results to applications without impacting touch latency or
system performance.
186 P. Quinn et al.
4.2 Evaluation
5 Discussion
The deep touch neural model uses biomechanical signals captured by a touch sensor
to identify force-based press gestures from users without dedicated force-sensing
hardware. By extracting spatial and temporal features in the touch image sequence,
the deep touch neural network can enhance the modern touchscreen gesture expe-
rience beyond what conventional heuristics-based gesture classification algorithms
could do alone. The model can be executed in a production environment (delivered
with Google Pixel 4 and Pixel 5 devices) without increasing touch input latency or
impairing system performance.
Instead of creating a new interaction modality, we focussed on improving the user
experience of long press interactions by accelerating them with force-induced deep
press in a unified press gesture. A press gesture has the same outcome as a long press
gesture, whose time threshold remains effective, but provides a stronger connection
between the outcome and the user’s action when force is used. This allowed us to
create a more natural and direct gesture to supplement the conventional, indirect
touch and hold gesture.
Combining a neural model with the existing heuristic method of gesture detection
allows biomechanical information to be identified and utilised when it is present,
but without harming the usability of touch input for other finger postures. However,
this means that the relationship between the heuristic criteria and the probabilistic
output of a neural model needs to be carefully considered. Specifically, in cases of
ambiguity the system may want to err towards the least costly or most consistent
classification for the user, rather than the most accurate.
This is most visible in the confusion between a press and a scroll, where the
expanding contact area of the press gesture erroneously induces a change in the touch
centroid that triggers a heuristic scroll classification. There are further opportunities
here to either tune the scroll hysteresis threshold, or to leverage the neural model to
aid in classification of a scroll gesture as well.
While data curation and training are key to any successful neural network devel-
opment, they are particularly important and challenging in solving low-level HCI
problems with neural networks where the human actions and their effects and feed-
back are linked in a tight interaction loop. Lacking naturally existing datasets that
can be labelled offline, we took a data-elicitation approach in developing the deep
touch model by asking human participants to intuitively perform touch gestures as
they expect and against a set of tasks. However, this data collection procedure for
training samples lacked haptic feedback for the deep press gesture, which might
have affected its offline classification performance (Table 2). Potentially, the training
datasets and the network’s performance can be further enhanced by a closed-loop
data collection with haptic feedback for all touch gestures, with the feedback driven
by the current deep touch model.
Neural models are also well-suited for touch interactions beyond those studied
here—and the human–computer interaction literature has many examples. For exam-
ple, finger rolling [46], ‘pushing’ and ‘pulling’ shear forces [23, 24, 26, 32], and
188 P. Quinn et al.
‘positive’ and ‘negative’ force gestures [43] might be supported with similar biome-
chanical patterns. This style of analysis may also provide insight into perceived input
location issues [28] and improved touch contact location algorithms by capturing
more information about the contact mass.
6 Conclusion
This work demonstrates that combining capacitive touch sensing with modern neural
network algorithms is a practical direction to improve the usability and expressivity
of touch-based user interfaces. The work was motivated by a deep touch hypothesis
that (1) the human finger performs richer expressions on a touch surface than simple
pointing; (2) such expressions are manifested in touch sensor image sequences due
to finger-surface biomechanics; and (3) modern neural networks are capable of dis-
criminating touch gestures using these sequences. In particular, a deep press gesture,
accelerated from long press based on an increase in a finger’s force could be sensed
by a neural model in real time without additional hardware, and reliably discrimi-
nated from tap and scroll gestures. The press classification has a precision of 97%
and a recall of 88%, with an average time reduced to 235 ms from the conventional
400–500 ms) long press.
More broadly, input sensors often capture rich streams of high-dimensionality
data that are typically summarised to a few key metrics to simplify the development
of heuristic analyses and classifications. Neural methods permit the analysis of the
raw data stream to find more complex relationships than can be feasibly expressed
with heuristics, and computational advances have made it feasible to operationalise
these models in real time. This chapter has described a practical instance of this—
deep touch—where a neural model has enhanced existing heuristic methods, and
been deployed widely to enable a richer user experience.
Acknowledgements We thank many Google and Android colleagues in engineering, design, and
product management for their direct and indirect contributions to the project.
References
1. Albinsson PA, Zhai S (2003) High precision touch screen interaction. In: Proceedings of
the SIGCHI conference on human factors in computing systems, association for computing
machinery, New York, NY, CHI ’03, pp 105–112. https://doi.org/10.1145/642611.642631
2. Aliakbarian MS, Saleh FS, Salzmann M, Fernando B, Petersson L, Andersson L (2017) Encour-
aging LSTMs to anticipate actions very early. In: 2017 IEEE international conference on com-
puter vision (ICCV), pp 280–289. https://doi.org/10.1109/ICCV.2017.39
3. Antoine A, Malacria S, Casiez G (2017) Forceedge: Controlling autoscroll on both desktop
and mobile computers using the force. In: Proceedings of the 2017 CHI conference on human
factors in computing systems, ACM, New York, NY, CHI ’17, pp 3281–3292. https://doi.org/
10.1145/3025453.3025605
Deep Touch: Sensing Press Gestures from Touch Image Sequences 189
4. Arif AS, Stuerzlinger W (2013) Pseudo-pressure detection and its use in predictive text entry on
touchscreens. In: Proceedings of the 25th australian computer-human interaction conference:
augmentation, application, innovation, collaboration, ACM, New York, NY, OzCHI ’13, pp
383–392. https://doi.org/10.1145/2541016.2541024
5. Azenkot S, Zhai S (2012) Touch behavior with different postures on soft smartphone keyboards.
In: Proceedings of the 14th international conference on Human-computer interaction with
mobile devices and services, ACM, New York, NY, MobileHCI ’12, pp 251–260. https://doi.
org/10.1145/2371574.2371612
6. Baglioni M, Malacria S, Lecolinet E, Guiard Y (2011) Flick-and-Brake: Finger control over
inertial/sustained scroll motion. In: CHI ’11 Extended abstracts on human factors in computing
systems, ACM, New York, NY, CHI EA ’11, pp 2281–2286. https://doi.org/10.1145/1979742.
1979853
7. Barrett G, Omote R (2010) Projected-capacitive touch technology. Inf Display 26(3):16–21
8. Benko H, Wilson AD, Baudisch P (2006) Precise selection techniques for multi-touch screens.
In: CHI ’06, ACM, New York, NY, pp 1263–1272. https://doi.org/10.1145/1124772.1124963
9. Bi X, Li Y, Zhai S (2013) FFitts law: Modeling finger touch with Fitts’ law. In: Proceedings of
the SIGCHI conference on human factors in computing systems, ACM, New York, NY, CHI
’13, pp 1363–1372. https://doi.org/10.1145/2470654.2466180
10. Birznieks I, Jenmalm P, Goodwin AW, Johansson RS (2001) Encoding of direction of finger-
tip forces by human tactile afferents. J Neurosci 21(20):8222–8237. https://doi.org/10.1523/
jneurosci.21-20-08222.2001
11. Boceck T, Le HV, Sprott S, Mayer S (2019) Force touch detection on capacitive sensors using
deep neural networks. In: Proceedings of the 21st international conference on human-computer
interaction with mobile devices and services. https://doi.org/10.1145/3338286.3344389
12. Boring S, Ledo D, Chen XA, Marquardt N, Tang A, Greenberg S (2012) The fat thumb: using
the thumb’s contact size for single-handed mobile interaction. In: Proceedings of the 14th
international conference on human-computer interaction with mobile devices and services,
ACM, New York, NY, MobileHCI ’12, pp 39–48. https://doi.org/10.1145/2371574.2371582
13. Brewster SA, Hughes M (2009) Pressure-based text entry for mobile devices. In: Proceedings
of the 11th international conference on human-computer interaction with mobile devices and
services, ACM, New York, NY, MobileHCI ’09, pp 9:1–9:4. https://doi.org/10.1145/1613858.
1613870
14. Buxton W (1995) Touch, gesture, and marking. In: Baecker RM, Grudin J, Buxton W, Greenberg
S (eds) Human-computer interaction: toward the year 2000, Morgan Kaufmann Publishers, San
Francisco, CA, chap 7, pp 469–482
15. Buxton W, Hill R, Rowley P (1985) Issues and techniques in touch-sensitive tablet input. In:
Proceedings of the 12th annual conference on Computer graphics and interactive techniques,
ACM, New York, NY, SIGGRAPH ’85, pp 215–224. https://doi.org/10.1145/325334.325239,
http://doi.acm.org/10.1145/325334.325239
16. Forlines C, Wigdor D, Shen C, Balakrishnan R (2007) Direct-touch vs. mouse input for tabletop
displays. In: Proceedings of the SIGCHI conference on human factors in computing systems,
association for computing machinery, New York, NY, CHI ’07, pp 647–656. https://doi.org/
10.1145/1240624.1240726
17. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Gordon G,
Dunson D, Dudík M (eds) Proceedings of the fourteenth international conference on artifi-
cial intelligence and statistics, PMLR, Fort Lauderdale, FL, Proceedings of machine learning
research, vol 15, pp 315–323
18. Goel M, Jansen A, Mandel T, Patel SN, Wobbrock JO (2013) ContextType: using hand posture
information to improve mobile touch screen text entr. In: Proceedings of the SIGCHI conference
on human factors in computing systems, ACM, New York, NY, CHI ’13, pp 2795–2798. https://
doi.org/10.1145/2470654.2481386
19. Goguey A, Casiez G, Vogel D, Gutwin C (2018) Characterizing finger pitch and roll orientation
during atomic touch actions. In: Proceedings of the 2018 CHI conference on human factors
in computing systems, ACM, New York, NY, CHI ’18, pp 589:1–589:12. https://doi.org/10.
1145/3173574.3174163
190 P. Quinn et al.
20. Goguey A, Malacria S, Gutwin C (2018) Improving discoverability and expert performance
in force-sensitive text selection for touch devices with mode gauges. In: Proceedings of the
2018 CHI conference on human factors in computing systems, ACM, New York, NY, CHI ’18.
https://doi.org/10.1145/3173574.3174051
21. Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer,
Berlin. https://doi.org/10.1007/978-3-642-24797-2
22. Grosse-Puppendahl T, Holz C, Cohn G, Wimmer R, Bechtold O, Hodges S, Reynolds MS, Smith
JR (2017) Finding common ground: A survey of capacitive sensing in human-computer inter-
action. In: Proceedings of the 2017 CHI conference on human factors in computing systems,
ACM, New York, NY, CHI ’17, pp 3293–3315. https://doi.org/10.1145/3025453.3025808
23. Harrison C, Hudson S (2012) Using shear as a supplemental two-dimensional input channel
for rich touchscreen interaction. In: Proceedings of the sigchi conference on human factors in
computing systems, ACM, New York, NY, CHI ’12, pp 3149–3152. https://doi.org/10.1145/
2207676.2208730
24. Heo S, Lee G (2011) Force gestures: augmenting touch screen gestures with normal and tangen-
tial forces. In: Proceedings of the 24th annual ACM symposium on User interface software and
technology, ACM, New York, NY, UIST ’11, pp 621–626. https://doi.org/10.1145/2047196.
2047278
25. Heo S, Lee G (2011) ForceTap: extending the input vocabulary of mobile touch screens by
adding tap gestures. In: Proceedings of the 13th international conference on human computer
interaction with mobile devices and services, ACM, New York, NY, MobileHCI ’11, pp 113–
122. https://doi.org/10.1145/2037373.2037393
26. Heo S, Lee G (2013) Indirect shear force estimation for multi-point shear force operations. In:
Proceedings of the SIGCHI Conference on human factors in computing systems, ACM, New
York, NY, CHI ’13, pp 281–284. https://doi.org/10.1145/2470654.2470693
27. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–
1780. https://doi.org/10.1162/neco.1997.9.8.1735
28. Holz C, Baudisch P (2010) The generalized perceived input point model and how to double
touch accuracy by extracting fingerprints. In: Proceedings of the 28th international conference
on Human factors in computing systems, ACM, New York, NY, CHI ’10, pp 581–590. https://
doi.org/10.1145/1753326.1753413
29. Holz C, Baudisch P (2011) Understanding touch. In: Proceedings of the 2011 annual conference
on Human factors in computing systems, ACM, New York, NY, CHI ’11, pp 2501–2510. https://
doi.org/10.1145/1978942.1979308
30. Hu Y, Huang L, Rieutort-Louis W, Sanz-Robinson J, Wagner S, Sturm JC, Verma N (2014) 3D
gesture-sensing system for interactive displays based on extended-range capacitive sensing.
In: 2014 IEEE international solid-state circuits conference digest of technical papers, ISSCC,
pp 212–213. https://doi.org/10.1109/ISSCC.2014.6757404
31. Kaaresoja T, Brewster S, Lantz V (2014) Towards the temporally perfect virtual button: Touch-
feedback simultaneity and perceived quality in mobile touchscreen press interactions. ACM
Trans Appl Percep 11(2):9:1–9:25, https://doi.org/10.1145/2611387
32. Lee B, Lee H, Lim SC, Lee H, Han S, Park J (2012) Evaluation of human tangential force
input performance. In: Proceedings of the SIGCHI conference on human factors in computing
systems, ACM, New York, NY, CHI ’12, pp 3121–3130. https://doi.org/10.1145/2207676.
2208727
33. Lee J, Cole MT, Lai JCS, Nathan A (2014) An analysis of electrode patterns in capacitive touch
screen panels. J Display Technol 10(5):362–366. https://doi.org/10.1109/JDT.2014.2303980
34. Lee S, Buxton W, Smith KC (1985) A multi-touch three dimensional touch-sensitive tablet. In:
Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New
York, NY, CHI ’85, pp 21–25. https://doi.org/10.1145/317456.317461
35. Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in LSTMs for activity detection
and early detection. In: 2016 IEEE conference on computer vision and pattern recognition
(CVPR), pp 1942–1950. https://doi.org/10.1109/CVPR.2016.214
Deep Touch: Sensing Press Gestures from Touch Image Sequences 191
36. Miyaki T, Rekimoto J (2009) GraspZoom: Zooming and scrolling control model for single-
handed mobile interaction. In: Proceedings of the 11th international conference on human-
computer interaction with mobile devices and services, ACM, New York, NY, MobileHCI ’09,
pp 11:1–11:4. https://doi.org/10.1145/1613858.1613872
37. Miyata N, Yamaguchi K, Maeda Y (2007) Measuring and modeling active maximum fingertip
forces of a human index finger. In: 2007 IEEE/RSJ international conference on intelligent
robots and systems, pp 2156–2161. https://doi.org/10.1109/IROS.2007.4399243
38. O’Connor T (2010) mTouch projected capacitive touch screen sensing theory of operation.
Technical Report, TB3064, Microchip Technology Inc
39. Pawluk DTV, Howe RD (1999) Dynamic contact of the human fingerpad against a flat surface.
J Biomech Eng 121(6):605–611. https://doi.org/10.1115/1.2800860
40. Potter RL, Weldon LJ, Shneiderman B (1988) Improving the accuracy of touch screens: an
experimental evaluation of three strategies. In: CHI ’88, ACM, New York, NY, pp 27–32.
https://doi.org/10.1145/57167.57171
41. Quinn P, Feng W (2020) Sensing force-based gestures on the Pixel 4. Google AI Blog, https://
ai.googleblog.com/2020/06/sensing-force-based-gestures-on-pixel-4.html
42. Quinn P, Malacria S, Cockburn A (2013) Touch scrolling transfer functions. In: Proceedings
of the 26th annual ACM symposium on user interface software and technology, ACM, New
York, NY, UIST ’13, pp 61–70, https://doi.org/10.1145/2501988.2501995
43. Rekimoto J, Schwesig C (2006) PreSenseII: Bi-directional touch and pressure sensing inter-
actions with tactile feedback. In: CHI ’06 extended abstracts on human factors in computing
systems, ACM, New York, NY, CHI EA ’06, pp 1253–1258. https://doi.org/10.1145/1125451.
1125685
44. Rendl C, Greindl P, Probst K, Behrens M, Haller M (2014) Presstures: exploring pressure-
sensitive multi-touch gestures on trackpads. In: Proceedings of the SIGCHI conference on
human factors in computing systems, ACM, New York, NY, CHI ’14, pp 431–434. https://doi.
org/10.1145/2556288.2557146
45. Rosenberg I, Perlin K (2009) The unmousepad: an interpolating multi-touch force-sensing
input pad. ACM Trans Graph 28(3):65:1–65:9. https://doi.org/10.1145/1531326.1531371
46. Roudaut A, Lecolinet E, Guiard Y (2009) Microrolls: expanding touch-screen input vocabulary
by distinguishing rolls vs. slides of the thumb. In: Proceedings of the SIGCHI conference on
human factors in computing systems, ACM, New York, NY, CHI ’09, pp 927–936. https://doi.
org/10.1145/1518701.1518843
47. Sakai N, Shimawaki S (2006) Mechanical responses and physical factors of the fingertip pulp.
Appl Bionics Biomech 3(4):273–278. https://doi.org/10.1533/abbi.2006.0046
48. Serina ER, Mote CD Jr, Rempel D (1997) Force response of the fingertip pulp to repeated
compression–effects of loading rate, loading angle and anthropometry. J Biomech 30(10):1035–
1040. https://doi.org/10.1016/S0021-9290(97)00065-1
49. Serina ER, Mockensturm E, Mote CD Jr, Rempel D (1998) A structural model of the forced
compression of the fingertip pulp. J Biomech 31(7):639–646. https://doi.org/10.1016/S0021-
9290(98)00067-0
50. Srinivasan MA, LaMotte RH (1995) Tactual discrimination of softness. J Neurophysiol
73(1):88–101. https://doi.org/10.1152/jn.1995.73.1.88
51. Srinivasan MA, Gulati RJ, Dandekar K (1992) In vivo compressibility of the human fingertip.
Adv Bioeng 22:573–576
52. Suzuki K, Sakamoto R, Sakamoto D, Ono T (2018) Pressure-sensitive zooming-out interfaces
for one-handed mobile interaction. In: Proceedings of the 20th international conference on
human-computer interaction with mobile devices and services, ACM, New York, NY, Mobile-
HCI ’18, pp 30:1–30:8. https://doi.org/10.1145/3229434.3229446
53. Taher F, Alexander J, Hardy J, Velloso E (2014) An empirical characterization of touch-gesture
input-force on mobile devices. In: Proceedings of the ninth ACM international conference on
interactive tabletops and surfaces, ACM, New York, NY, ITS ’14, pp 195–204. https://doi.org/
10.1145/2669485.2669515
192 P. Quinn et al.
54. Vogel D, Baudisch P (2007) Shift: a technique for operating pen-based interfaces using touchs.
In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM,
New York, NY, CHI ’07, pp 657–666. https://doi.org/10.1145/1240624.1240727
55. Walker G (2012) A review of technologies for sensing contact location on the surface of a
display. J Soc Inf Display 20(8):413–440. https://doi.org/10.1002/jsid.100
56. Wang F, Ren X (2009) Empirical evaluation for finger input properties in multi-touch inter-
action. In: Proceedings of the SIGCHI conference on human factors in computing systems,
ACM, New York, NY, CHI ’09, pp 1063–1072. https://doi.org/10.1145/1518701.1518864
57. Wang F, Cao X, Ren X, Irani P (2009) Detecting and leveraging finger orientation for interac-
tion with direct-touch surfaces. In: Proceedings of the 22nd annual ACM symposium on user
interface software and technology, ACM, New York, NY, UIST ’09, pp 23–32. https://doi.org/
10.1145/1622176.1622182
58. Wang T, Blankenship T (2011) Projected-capacitive touch systems from the controller point
of view. Inf Display 27(3):8–11
59. Westerman W (1999) Hand tracking, finger identification, and chordic manipulation on a multi-
touch surface. PhD thesis, University of Delaware
60. Wilson G, Stewart C, Brewster SA (2010) Pressure-based menu selection for mobile devices. In:
Proceedings of the 12th international conference on human computer interaction with mobile
devices and services, ACM, New York, NY, MobileHCI ’10, pp 181–190. https://doi.org/10.
1145/1851600.1851631
61. Yaniger SI (1991) Force sensing resistors: a review of the technology. In: Electro international,
pp 666–668. https://doi.org/10.1109/ELECTR.1991.718294
62. Yin Y, Ouyang TY, Partridge K, Zhai S (2013) Making touchscreen keyboards adaptive to keys,
hand postures, and individuals: a hierarchical spatial backoff model approach. In: Proceedings
of the SIGCHI conference on human factors in computing systems, ACM, New York, NY, CHI
’13, pp 2775–2784. https://doi.org/10.1145/2470654.2481384
63. Zimmerman TG, Smith JR, Paradiso JA, Allport D, Gershenfeld N (1995) Applying electric
field sensing to human-computer interfaces. In: Proceedings of the SIGCHI conference on
human factors in computing systems, ACM Press/Addison-Wesley Publishing Co., New York,
NY, CHI ’95, pp 280–287. https://doi.org/10.1145/223904.223940
Deep Learning-Based Hand Posture
Recognition for Pen Interaction
Enhancement
Abstract This chapter examines how digital pen interaction can be expanded by
detecting different hand postures formed primarily by the hand while it grips the pen.
Three systems using different types of sensors are considered: an EMG armband,
the raw capacitive image of the touchscreen, and a pen-top fisheye camera. In each
case, deep neural networks are used to perform classification or regression to detect
hand postures and gestures. Additional analyses are provided to demonstrate the
benefit of deep learning over conventional machine-learning methods, as well as
explore the impact on model accuracy resulting from the number of postures to
be recognised, user-dependent versus user-independent models, and the amount of
training data. Examples of posture-based pen interaction in applications are discussed
and a number of usability aspects resulting from user evaluations are identified. The
chapter concludes with perspectives on the recognition and design of posture-based
pen interaction for future systems.
1 Introduction
Digital pens and styli are popular tools to write, sketch, and design on tablet-like
devices. In many ways, the interaction experience can be like using a pen on paper,
where touching the surface with the nib makes a “mark” on the digital canvas. How-
ever, pen-based systems are capable of much more than pen-on-paper marking, but
using these capabilities requires different kinds of input. In a standard setting with
a conventional graphical user interface (GUI), a pen can work like a mouse, where
tapping or dragging on widgets can change input modes, adjust application param-
eters, or execute commands. If the pen device supports hover detection, then it is
F. Matulic (B)
Preferred Networks Inc., Tokyo, Japan
e-mail: fmatulic@preferred.jp
D. Vogel
Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
e-mail: dvogel@uwaterloo.ca
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 193
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_7
194 F. Matulic and D. Vogel
machine-learning recognisers for HCI techniques: the benefit of deep learning com-
pared to more traditional machine-learning techniques and how recognition accuracy
is impacted by the vocabulary size (number of postures to be classified) and data
quantity (fewer or more people contributing more or less data).
We show how posture-based interaction in these cases can be leveraged in pen-
driven applications and, informed by user evaluations, we discuss various usability
aspects and limitations of such techniques. An outlook on future enhancements to
hand-posture driven pen interaction concludes our discussion.
2 Background
Research on pen input and interaction is abundant. We focus on sensing methods for
enhanced pen interaction and hand pose estimation techniques that could be used for
pen input.
The pen itself offers numerous possibilities for extending input channels. This can
be done using active pens with additional sensors or external tracking. Common
approaches in commercial devices include nib pressure [54] and barrel or tip buttons
[38]. The 3D orientation of the pen has been used for rolling and tilt-based input for
menu operation while in contact with the screen [6, 24, 67] as well as in mid-air
[26, 31]. A related approach is FlexStylus, a deformable pen that uses bend input
to increase artistic expressivity [16]. In these cases, mappings between the angles
and the associated interface action are direct, for instance, the rolling or bend angle
determines which item of a marking menu is selected. The different pen grips or
hand postures used to manipulate the pen are not detected by the system.
Since pen input occurs on touch screens, one of the most straightforward methods to
expand the vocabulary of pen interaction is by combining it with touch on the surface
device. Differentiating pen and touch input enables hybrid interaction patterns that
can be effectively leveraged with both hands. In this configuration, the non-dominant
hand uses touch to support the task of the dominant-hand holding the pen, for instance,
to set pen modes. These bimanual “pen+touch” scenarios have been extensively
explored [7, 17, 22, 28, 44]. In these works, touch input mostly consists of classic
single and two-finger touch operations, but a few utilise more complex postures
such as multi-finger chords [22, 45] and contact-shape-based touch patterns to set
196 F. Matulic and D. Vogel
modes and invoke menus [47]. In the latter example, seven postures are detected with
84% accuracy using Hu moments applied to templates from other users, a form of
“user-independent” model.
In Sect. 4, we describe a method using raw capacitive data from the tablet surface
to detect hand postures performed by the pen-gripping hand with a deep neural
network.
Computer vision is perhaps the most common way to determine relative hand and
finger positions. Markerless hand pose estimation using RGB and depth data (RGB-
D) have improved considerably with deep learning [9], and current techniques are
able to recover the full 3D shape of a hand using a single monocular RGB image [18,
62]. The majority of the proposed algorithms deal only with bare hands, but some
approaches also consider hands manipulating objects such as cans and bottles [25,
64, 66]. None of these works are specifically designed or tested for a hand holding a
pen or pen-like object. Capturing hands with external cameras can introduce mobil-
ity constraints since hands can only be detected within the camera viewing range,
and detection accuracy can suffer from changing backgrounds and potential occlu-
sions. Datasets and models using egocentric views have been proposed to enable
more mobility [3]), but they typically require head-worn cameras, like smartglasses
or a headset, which may not be practical for everyday pen activities. Similarly, sys-
tems using wrist-worn cameras such as Digits [36] and Back-Hand-Pose [75], while
supporting mobility, require the user’s hand to be instrumented and thus can be an
impediment for pen tasks.
With regard to using the other hand to complement pen input, mid-air gestures
detected by a depth camera have been proposed by Aslan et al. [2]. The technique
only recognises gestures and no advanced posture recognition is performed.
In Sect. 5, we describe a method to use an RGB camera directly mounted on a
pen, which supports mobile vision-based capturing without any wearable sensor.
smartwatches. These can be used to recognise gestures like hand and finger motions
[73, 77], but do not produce sufficient information to recognise a large number
of static postures. SmartGe detects correct and incorrect pen grips for handwriting
using a CNN [5] and a classification accuracy of 98% for nine pen grips is reported.
However, users have to write slowly for the system to work, and the model was
trained on a fully random dataset split (without distinguishing between users and
time occurrence) so the evaluation is not ecologically valid.
Several types of wrist or forearm biometric signal sensors have been proposed
to detect postures and microgestures using machine learning. Detection techniques
include force sensitive resistors [12], electrical impedance tomography [79], ultra-
sound [32, 49], intradermal infrared sensing [48], electromyography (EMG) [23,
33, 78], skin stretching [39], and thermal imaging [30]. These sensors do not suffer
from external visual disturbances, but the models used for classification are typi-
cally very sensitive to sensor position on the arm and environmental factors like
humidity (such as sweat [56]), so they are considerably user- and session-dependent.
Combining different types of sensors increases recognition robustness [50], but also
adds complexity, as well as weight and power demands, for the wearable sensing
device. In Sect. 3, we describe a method using a commercial EMG armband to detect
a moderate vocabulary of pen-holding postures.
In summary, pen input enhancement techniques have not used deep learning so
they are limited to direct mappings between pen input channels and UI actions (tilt,
roll, bend) and for posture-based approaches, detection is limited to a few poses. Deep
learning has been used to recognise bare hand gestures and postures and the models
could potentially be fine-tuned to detect pen-holding hands, but sensors considered
so far are either fixed in the environment (external cameras), insufficiently precise
(smartwatches) or use DIY hardware (biometric sensors), so these approaches have
limited applicability in mobile pen tasks. The sensing environments in the three pen
posture detection settings that we consider are based on commercial devices (touch
screen and EMG armband) or require just a 3D-printed mount (pen-top camera)
without any custom electronics. We apply deep learning techniques for the specific
purpose of triggering application events based on detected postures, and therefore,
we do not aim to fully reconstruct accurate 3D models of the hand.
with finger extensions (Extended Index finger and Extended Pinkie), which are just
slight variations of a normal pen grip (Fig. 2).
EMG data for each of these postures is collected from 30 participants (10 female,
20 female with 4 left-handed people) who perform a set of simple tracing, tapping,
sketching, and writing tasks on a Wacom Cintiq 22HD Touch tablet while wearing
the armband (Fig. 2). Please see [8, 46] for further detail about the tasks and data-
gathering procedure.
(a) Dynamic Tripod (b) Squeeze Pen (c) Ring+Pinkie on Palm (d) Extended Index (e) Extended Pinkie
Fig. 3 Five pen grips considered for detection with EMG armband
The different pen grips are intended as triggers for mode switches in applications
where users form the desired posture just before touching the tablet with the pen.
Since the contact of the pen with the tablet may affect the EMG response, we need to
validate our posture recognition approach with windows of data captured around pen-
down events. We choose a window of 1060 ms, with 1000 ms before pen-down event
and 60 ms after the event to reflect possible system reaction time within acceptable
latency. Since the Myo has a sampling rate of 250 Hz and 8 electrodes, this gives us
2120 raw sensor values per window. While these windows around pen-down events
need to be used for validation to better reflect when postures are formed for mode
switching, there is no such restriction regarding training data. Any data that may
contribute to increasing model accuracy can be used. Since data was acquired while
continuously maintaining postures, we can use as much of the collected sensor data as
possible for model training. For consistency with validation and testing, we similarly
sample 1060 ms windows of data, but we use sliding windows with 75% overlap over
the entire data sequence for each posture.
We consider three machine-learning approaches to classify these postures. The
first technique is based on a Convolutional Neural Network (CNN) applied to the
data transformed into spectrograms and two methods using hand-crafted features
of the data, Support Vector Machines (SVM) and Random Forests (RF). The latter
two techniques represent baselines for “classic machine learning” used for similar
scenarios in HCI [57, 58].
We briefly describe the CNN approach. For full details, please refer to [46].
EMG data is best converted into the frequency domain to use for classification,
with spectrograms, in particular, exhibiting useful features for convolutional neural
networks [11]. We convert our segmented raw EMG data into spectrograms using a
200 F. Matulic and D. Vogel
Fast Fourier Transform (FFT) size of 64 samples (corresponding to 256 ms) and a
hop distance of 8 samples (32 ms), which results in 33 frequency bins and 26 time
slices. We further normalise the values using min-max scaling over all participants’
data to ensure all features have equal range.
3.5 Results
Table 1 shows classification accuracy for a variety of combinations of the five postures
using the three classifiers.
Unsurprisingly, the CNN gives the best results overall with roughly a 19% higher
accuracy compared to SVM and a 10% gain over RF on average. Interestingly, RF
achieves significantly higher accuracy than SVM for the within-user analysis, but
performs worse than SVM for between-user comparisons. This pattern was con-
sistent even after trying to optimise model hyperparameters for both techniques.
The between-user values are very low compared to within-users results. For the full
five-posture set, classification accuracy is just slightly above 32% between-users ver-
sus 73% within-users. This confirms the high user-dependence of posture detection
using the Myo and reflects why the commercial Myo gesture recogniser also requires
users to calibrate before use. A three-posture set consisting of Dynamic Tripod,
the pressure-based Ring+Pinkie on Palm, and the finger-extension posture Ext.
Pinkie can be recognised with almost 86% accuracy with a user-dependent model.
This potentially makes this set practical for some applications. Recognition can likely
202 F. Matulic and D. Vogel
Table 1 Posture recognition accuracy for different posture sets using within- and between-user
datasets with CNN, SVM, and RF classifiers
be made more robust by introducing temporal consistency between data frames and
using mitigation strategies such as transfer learning and normalisation [11].
The second hand-posture sensing system that we describe uses the tablet itself.
Specifically, the raw capacitive sensors used to detect touch input in the form of
a full image of the contact imprint of the hand against the tablet display (Fig. 5)
[8]. Unlike the EMG armband, which could detect mid-air finger movements and
pressure on the pen, the tablet can only detect direct touch. This means the pen grip
posture variations in this context are limited to contact patterns, i.e. how the hand,
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 203
Fig. 5 Explicit “pinkie out” pen-gripping posture recognised by the pen and hand contact pattern
on the screen. Raw image shown on the left
fingers, and pen touch the display. We consider pen interaction enhancements in the
form of unimanual pen and touch interaction a complement to the bimanual pen and
touch input design space explored in the HCI literature [7, 22, 28, 45].
To be able to easily distinguish postures based on their contact with the tablet display,
we could choose postures that form clearly identifiable shapes or patterns in the raw
capacitive images. For instance, the number of distinct areas (i.e. “blobs”) created
by the palm and fingers [45]. This would, however, limit the number of possible
postures, as fingers have reduced mobility when simultaneously holding a pen. Deep
learning, which excels at image classification, allows us to consider a broader range
of postures that can also, to some extent, accommodate dexterity constraints and user
preferences.
For our analysis, we consider postures gathered from 18 study participants per-
forming tracing and sketching tasks on a Wacom Cintiq 22HD Touch tablet. The
device and tasks are the same as those used for the EMG-based postures described
above and illustrated in Fig. 2.
The potential design space of our unimanual postures is defined by how the palm
and specific fingers touch the screen. The palm can contact at the heel, the side, or
be “floating” with no touch. Fingers can contact outside the hand, such as when the
pinkie is stretched out, contact inside the hand, such as when the pinkie is curled
in, or contact near the nib, such as sliding the index down to touch the surface right
beside the nib. Theoretically, there are many permutations of these palm and finger
states, but manual dexterity and comfort limit the viable set to 34 potential postures.
From these 34 different postures, we narrow down to the 12 postures shown in Fig. 6.
204 F. Matulic and D. Vogel
Fig. 6 Pen postures used for classification (shown in decreasing order of preference)
These were rated the most comfortable in our study and did not significantly overlap
with a normal hand writing posture. More detail about the different postures and
their subjective ratings is available in the original article [8].
4.2 Classification
Since we use posture changes mainly for pen mode switches, we again only consider
touch data around pen-down events. Specifically, we select data 100 ms before and
after pen down, giving a 200 ms capture window. From this data, we only take frames
that include detected pen input (cursor coordinates), while the pen is in contact with
the screen or hovering within 2 cm, as detected by the Wacom tablet.
The touch frames yield 120 × 77 single-channel (greyscale) raw images and
pen data including x, y coordinates labelled with a hovering or touching state. We
combine this information in a 3-channel (RGB) image. The first channel contains
the raw touch image. A blob is drawn at the pen position either in the second channel
if hovering or in the third if touching. Examples of resulting images are shown in
Fig. 7.
Fig. 7 Examples of images with combined pen and touch data for four postures. From left to
right: Normal, FloatRinInPinIn, SideRinOutPinOut, Heel (Images cropped and colour
saturation enhanced for better visibility)
We train our VGG network with the RGB images combining pen and touch infor-
mation in a pre-processing step and two training stages. First, images are resized to
224 × 224 squares, which is the standard input size for VGG. Since the network has
pre-trained weights, we replace its final layer to match our desired output and train
that layer while freezing the others for a single epoch. In a second training stage, the
full network is trained for 10 epochs using discriminative learning rates (lower learn-
ing rates for first layers). A batch size of 64 is used in both stages. Training uses cross
entropy as the loss function and recognition performance is measured using the error
rate. For implementation, we use the fastai deep learning framework [29], which
provides a direct function (fine_tune) with optimised default hyperparameters
for this process.
For validation, we consider only between-user evaluations, since we anticipate that
pen and touch traces are sufficiently similar between users and sufficiently different
between postures to build robust general models. We use the data of 15 participants
for training and the remaining 3 participants as the validation set (i.e. a “leave-3-out”
scheme). There are three left-handed people among our participants, so we mirror
their images for analysis. We use data augmentation techniques on the training set to
artificially increase the amount of data, in the form of random affine transformations
such as translations, scale, and rotations. This yields roughly 22,000 training and
2,400 validation images for each posture.
To evaluate the robustness of our VGG model with an increasingly large posture
vocabulary (i.e. more classes), we perform successive training iterations. We start
with the top two postures of our 12-posture set, then add the next preferred posture to
train the next model, and continue until we reach the full 12-posture set for the final
model. For each training iteration, we randomly subsample our dataset following our
leave-3-out scheme to create 5 training and validation set splits for cross-validation.
We use the same seeds for each posture set to keep the same partitions between
iterations. We average the error rates in each of these 5 rounds to obtain a measure
of classification accuracy for the given posture set.
206 F. Matulic and D. Vogel
4.5 Results
Figure 8 shows the classification accuracy for each iteration with increasing number
of postures and Fig. 9 shows the confusion matrix of the final set with all 12 postures.
Starting at over 97% with two postures, the classification accuracy decreases
approximately linearly as postures are added to the set to reach an average accuracy
of 78.6% with the full 12-posture set. The confusion matrix shows that Normal has
only a 51% recognition accuracy. “Normal” postures may vary among participants as
each person has a different handwriting grip so lower recognition performance is not
surprising. Furthermore, we observe it is often confused with SidePinIn, a relatively
similar posture. Looking at the images that caused the highest losses, we notice that
some Heel postures exhibited touch traces that are very similar to Normal, hinting at
difficulties for some participants to maintain the Heel posture throughout the tracing
tasks. Other errors with a high loss include images of float postures with fingers
curled in mistaken for postures with fingers stretched out. The confusion matrix
confirms that postures with fingers in have a lower accuracy than those with fingers
out, which may be due to the former being more difficult to maintain and leaving a
lighter or less consistent touch trace compared to the latter. Removing FloatPinIn and
SidePinIn, two postures causing significant misclassifications, increases accuracy for
10 postures to 85%. Two postures that also enjoy relatively high recognition accuracy
are Heel and FloatIndexTouch as they differ significantly from other poses.
In practice, people may only want to use up to five different postures in applica-
tions. These would be recognised with above 92% accuracy with even higher rates
likely possible if using temporal consistency checks and training per-user models.
Fig. 8 Classification
accuracy for increasing
number of postures
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 207
The performance cost of adding a few more postures is relatively low, so this can be
considered if needed.
While we only consider touch postures formed by the pen-holding hand in this
analysis, it would also be possible to use a CNN classifier on the raw capacitive
image to detect contact patterns of the other hand touching the screen. In this case,
only the original grey image would be used, as added channels with pen data are not
needed since the the pen moves independently from the other hand. If both hands
can touch the screen simultaneously, it may be necessary to first isolate which touch
blobs belong to which hand in the raw capacitive image. Our data was captured
with only the pen-holding hand contacting the display and we did not consider the
other hand resting on the tablet that could potentially cause interference. To support
detection when both hands can touch the tablet screen, either hand segmentation or
a completely data-driven approach covering resting hand cases would be required.
208 F. Matulic and D. Vogel
Raw capacitive images of touch screens provide reliable input data for hand-
posture detection independently of physiological factors and without any wearable
sensor, but postures are limited to hands and fingers contacting the display. In the
next section, we consider postures and gestures captured by a small downward-
facing RGB camera attached to the top of the pen and we examine how the amount
of training data affects recognition accuracy.
The third and the final example of hand-posture detection that we discuss uses a
small downward-facing camera with a fisheye lens fixed to the top end of the pen [43]
(Fig. 10a). This allows the camera to capture a top-down view of the pen-holding hand
and the surrounding environment, including the other hand (Fig. 10b). Contrary to
systems with cameras fixed to the environment, a pen-top camera preserves mobility
and does not suffer from occlusions caused by objects in between the cameras and
the hand. Cameras built into the tablet also maintain mobility, but they can only see
hands placed directly in front or behind the device. Furthermore, hands moving on
or just below the tablet plane are beyond capturing range.
While in an ideal mobile setting, the pen-top camera should be small and operate
wirelessly (similar to so-called “spy pens” with cameras integrated in the barrel),
our proof-of-concept prototype uses a tethered USB camera, which streams image
data via a cable to a server. The camera needs to be elevated above the pen to have a
sufficiently open view of the hands and the environment in order to properly capture
different postures. However, any increased length also adds weight and moves the
centre of gravity upwards, making the pen more unwieldy and uncomfortable to use.
We create a 6.2 cm high mount as a compromise between these two considerations.
The combined weight of the mount and camera without the cable is 18 g. The camera
streams images at 30 fps in 1920 × 1080 resolution over USB2.
Fig. 10 Downward-facing fisheye camera attached to a pen via a 3D-printed mount (a) that can
“see” hands and surrounding environment (b). This enables various posture-based interactions using
both hands (c)
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 209
Fig. 11 Postures used to trigger discrete actions (classification) (a–f) and two-hand sawing gesture
used for continuous parameter control (regression) (j)
For our analysis of deep learning-based posture detection models in this setting, we
consider two different posture sets corresponding to poses formed by either the “pen
hand” or the “other hand”. These are both classification problems. We further include
a two-hand “sawing” gesture consisting in placing the index finger orthogonally
against the pen barrel and rubbing the finger towards or away. The distance between
the pen (i.e. the centre of the image) and the index fingertip (blue circle in Fig. 18c)
can be mapped to a continuous parameter used by the application. This is a regression
problem.
Note that the sawing gesture’s associated mode first needs to be activated so that the
system knows when parameter mapping should occur. The trigger for mode activation
is when the user hits the pen with their index finger. To correctly identify the moment
when the mode should be engaged and disengaged, a classification network needs to
be trained. Such a neural network needs to be exposed to multiple images showing
the other hand approaching the pen just before touching it, so that it can distinguish
“touch” from “no touch” images. In this analysis, we do not consider the classification
of these two states for mode engagement and only focus on the regression problem
once the sawing gesture mode is enabled. All postures and gesture are illustrated in
Fig. 11.
We directly feed the images captured by the fisheye camera to our deep neural
networks. As can be seen in Fig. 10b, three stripes belonging to the support blades of
the mount appear in the images captured by the fisheye camera. Depending on how
the pen is held, these stripes may overlap with hands and figures forming postures
thereby causing partial occlusions. In the original article, we describe possible pen
designs to deal with this problem [43].
210 F. Matulic and D. Vogel
Image data is gathered from 11 participants seated at a table and randomly drawing
on an iPad Pro with an Apple Pencil while performing each posture and gesture,
one after the other. Participants are instructed to move their hands and continuously
rotate and tilt the pen to cover multiple positions and angles as the networks need to
recognise postures and gestures independently of these factors.
The location of the index fingertip in the images of the pen-sawing gesture is
manually coded by human annotators. The distance between the fingertip and the
centre of the image, corresponding to the location of the pen, is then computed to
form the labelled continuous data.
For both classification and regression on these natural RGB images, we use a ResNet-
50 architecture. This is a deeper network than the VGG-16 used for the system
described in the previous section, since classification is on natural images containing
more complex detection patterns. Like the previous system, we start with a network
that is pre-trained on ImageNet to bootstrap training.
Previously, we examined how gradually increasing the number of postures (i.e. num-
ber of classes) in the set affects recognition accuracy. In this setting, we instead anal-
yse the impact of the amount of training data by training networks with the data of
an increasing number of participants. Specifically, we train using data from one to
nine participants (i.e. [P1], [P1, P2], [P1, P2, P3],…, [P1, P2, …, P9]) and always
test with the last two participants [P10, P11]. P10 was left-handed so their images are
mirrored. While this fixed partitioning does not fully account for potential feature
interactions among our participants’ data, it gives us an idea of how much data and
how many people are required to achieve a certain level of recognition performance.
We create three different sets from the postures and gesture of Fig. 11:
• Pen Hand set, which includes the Pen Hand and Normal postures, i.e. classification
among 5 postures.
• Other Hand set, which includes the Other Hand and Normal postures, i.e. classi-
fication among 4 postures.
• Pen-Sawing set, which includes the Pen-Sawing gesture and its associated fingertip
distance used for regression.
Since participants contributed different amounts of data for each posture due to
the nature of the task, we need to select an equal number of images per posture
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 211
to ensure data balance for our analysis. Across all participants and postures, the
minimum number of images contributed for one posture per participant was 587, so
we select 587 images per posture per participant. For participants that contributed
more images, we take 587 equally spaced images for each posture in the capturing
sequence. For example, if the participant has 800 images for a posture, we select
images 1, 3, 4, 5, 7, …, and 800. This provides better data distribution compared to
simply selecting the first 587 images in the set. We apply the same procedure for the
sawing gesture, where the minimum number of images contributed by a participant
is 924.
For regression, since we only consider a single gesture and the minimum number
of images for it is larger, in addition to gradually increasing the number of partici-
pants, we also vary the number of images contributed by each person. This allows
us to compare the impact of more data contributed by fewer participants vs more
participants contributing less data. For instance, given 3696 training images per pos-
ture, we can compare the accuracy of the ResNet when those images come from four
participants (4×924), six participants (6×616), and eight participants (8×462).
We train all networks using the same procedure and framework used in the pre-
vious section: With fastai’s fine_tune function [29], we first fine-tune the final
layer of the pre-trained network for one epoch and then unfreeze all layers to train
further 10 epochs with discriminative learning rates. We perform three runs for each
setting and average the results to smooth differences due to randomness in the train-
ing process. We use again a batch size of 64. For classification, we use cross entropy
for the loss and the error rate as metric and for regression, mean squared error loss,
and root mean square error (RMSE).
5.5 Results
We report results of experiments for varying training data size first for the two posture
classification models, then the sawing gesture regression model.
Figure 12 shows the classification performance of the ResNet-50 model for the two
posture datasets with increasing number of participants contributing training data. We
observe that both sets start with accuracy just below 60% and significantly increase
their performance with three to four participants. After four participants, classifica-
tion performance only marginally increases and stagnates after 7 participants. The
final rate for Pen Hand is roughly 90% and 78% for Other Hand. Pen Hand con-
tains more postures, but exhibits a higher recognition accuracy likely because Pen
Hand postures occupy a larger amount of pixel space and are centrally located in the
images. The other hand is further away from the camera and appears less prominently,
therefore, small differences in hand postures are harder to distinguish.
212 F. Matulic and D. Vogel
Confusion matrices for both sets (Figs. 13 and 16) reveal once again that the
normal hand postures are the most error prone, no doubt due to the diversity of pen-
holding and pen-resting poses. Networks likely need to see a large variety of these
“normal” poses in order to robustly identify them.
Delving deeper into the data to analyse misrecognised cases, we again examine the
images that caused the highest losses. We generate Grad-CAM heatmaps [60] using
the final layer of the CNN to visualise class activations. These activation maps show
which regions of the image the network focussed on to classify it. Three examples of
misclassified postures for each of the two sets are shown in Fig. 14. As can be seen,
five of these wrong predictions are likely due to fingers being occluded by the support
blades ((a) and (e)), badly formed postures ((b)) and normal poses resembling input
postures ((c), (d), and (f)). The activation map of the (f) case further reveals that
the model mainly focussed on a spot on the forearm instead of the hand, which is
placed on the top left corner of the tablet. The latter pattern is underrepresented in
our data, which mostly consists of other hands resting at the side or below the device
without touching it. This confirms the need to include more samples of other hands in
different locations around and on the tablet in the training data to expose the network
to these cases.
As can be seen in Fig. 15, with one and two participants the RMSE is around 160,
regardless of the number of images per participant. With 924 images per person, the
error rate starts improving with the third participant and decreases significantly until it
reaches a plateau with 7 participants, achieving an RMSE between 15 and 20. Adding
more participant data after that does not further lead to lower rates, suggesting that a
limit has been attained. With lower amounts of images per participant, the inflexion
point comes at a later stage with the start of the error rate dip occurring from the
Fig. 12 Classification
accuracy for Pen-Hand and
Other-Hand posture sets with
increasing number of
participants contributing
data. Each participant
contributes 587 images
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 213
Fig. 14 Misclassified postures with Grad-CAM heatmaps applied to the final layer of the CNN.
Posture pairs in each image caption are predicted and actual posture
214 F. Matulic and D. Vogel
sixth participant. There is not enough data to reach a plateau in these cases, but we
presume that it would likely be similar to the previous case, with RMSE values just
under 20.
Interestingly, there seems to be little effect whether a specific amount of data
comes from few or many participants with this dataset. Comparing the RMSE within
the major gains region of the curves, accuracy with four participants each contributing
924 images is roughly the same as with six participants contributing 616 images and
with eight participants providing 462 images, i.e. the source of the 3696 images does
not seem to significantly matter. This is interesting since there is some diversity in our
participants, with male female representation, people with darker skin tones, hands
of different sizes, with and without body hair. This suggests that the network is able
to reliably capture general hand features with only a few different participants.
Fig. 17 Examples of unimanual pen and touch postures to perform various operations in vector
drawing and PDF annotation applications: a switching from Normal posture for inking to Heel
posture for highlighting (document annotation); b invoking a colour chooser with SideRinOutPinOut
(document annotation); c invoking an object creation menu with SidePinOut (vector drawing); d
creating a text object with handwriting recognition using SidePinIn (vector drawing); e making
stroke gestures to issue commands with FloatIndTouch (vector drawing)
The main purpose of detecting hand postures and gestures is to quickly switch pen
modes, trigger actions, and modify parameters in applications. These interactions can
function as expert shortcuts in addition to traditional UI widgets and tools, similar
to keyboard shortcuts complementing widget-based selections in a traditional GUI.
Figures 17 and 18 show a few examples of how hand postures and gestures can be
used for such kinds of interactions in pen-driven applications.
Typical examples of pen applications are drawing, sketching, and note-taking,
where the pen can take on multiple roles. For instance, when viewing a document,
the pen creates annotations in its default mode, but the mode can be switched so that
the pen can also highlight text, scroll, invoke menus, etc. These modes or commands
can be activated by forming specific hand postures. Setting continuous parameters
can be supported with hand gestures mapped to a continuous feature such as the
sawing gesture, which uses the distance between the index finger and the pen.
If the tablet is placed on a support such as a table or the user’s lap, the other hand
can also be used to perform postures and gestures. Even if the other hand is subject to
movement constraints, such as when holding the tablet, it can still be used to perform
a limited number of mode-setting postures, for instance, holding up the index finger
to switch to zooming mode in a map application (Fig. 18d). However, if that finger
is used to support the tablet, lifting it can result in reduced grip stability.
216 F. Matulic and D. Vogel
Fig. 18 Examples of interactions enabled by hand postures detected by a pen-top camera: a setting
rectangle-input mode in a sketching application by forming a vertical flat hand with the other hand;
b undoing an action by extending the index finger of the pen-holding hand; c pen “sawing” gesture
with the distance between the index finger and the pen (blue circle) to set a continuous parameter
like stroke width; d map application with raised finger of the tablet-gripping hand to enable zooming
mode
Fig. 19 Pen tucked in the hand to temporarily perform touch-drag panning. a All-direction panning
with flexed thumb. b Axis-constrained panning with extended thumb
The pen-holding hand can occasionally be used to perform finger touch interactions
in between pen input actions, for example, when the user wants to scroll the canvas
with panning finger gestures in between penning annotations. While the other hand
could be used to perform these operations, it might not always be available because
it may already be holding the device, or the user prefers to interact with the device
using only one hand. The pen itself could be used after a mode switch, but panning
with fingers is common on touchscreens. In these unimanual modality switches,
the pen is temporarily tucked in the palm so that the index finger can be used for
touch manipulations. We can support mode switching in these situations as well by
distinguishing between different pen-stowing postures, such as triggering a different
mode depending on whether the thumb is extended or not. This could be used, for
example, to constrain scrolling or panning to specific directions (Fig. 19), similar to
pressing a modifier key on the keyboard to apply constraints to a freeform operation.
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 217
Fig. 20 Index finger pointing at off-tablet content to capture with camera. a External view of
user pointing at photo near tablet while tilting the pen forward. b Camera view showing identified
fingertip and object to capture
In the case of the pen-top camera sensor, off-tablet content, such as documents and
photos around the tablet, can be captured for pick-and-drop operations [55] and to
form search queries. The user can tilt the pen forward and point at elements to capture
with their extended index finger (Fig. 20). The fingertip location in the camera image
is detected using the same type of neural network as for the pen-sawing gesture,
the output being two continuous values for the x,y coordinates (keypoint regression)
instead of one. The system can then infer which element in the image is designated. In
our proof-of-concept implementation of this feature, we use simple computer vision
techniques, a Canny edge detector and contour finder, to locate objects in the vicinity
of the detected fingertip [43]. We also support the capture and conversion of isolated
text elements using the EAST detector [80] and Google’s OCR engine. The obtained
machine-readable text can then be used in the tablet applications, for example, to
perform text searches.
As described above, deep learning models can also detect finger movements for the
control of continuous parameters, such as the pen-sawing gesture (Fig. 18c). The
distance between the fingertip and the pen can be mapped to a continuous variable
such as stroke thickness in a drawing application or zoom level in a map application.
Other possible gestures for continuous parameter control include finger flexion and
extension, with the flexion angle mapped to the desired variable [47]. However, these
types of gestures might not achieve high precision using the sensing techniques above
if captured from a distance (e.g. fingers of the other hand appearing too small) or
when only partially visible (e.g. partially occluded fingers of the pen hand).
218 F. Matulic and D. Vogel
For each system above, qualitative user studies were conducted to evaluate the com-
fort, suitability, and general usability of the proposed postures and gestures in exam-
ple applications [8, 43, 46]. We summarise the main findings and their implications
for the design of posture-based interactions in these contexts.
The choice of possible hand postures and interactions for a particular context depends
on the sensing constraints, user preferences, and application functions to which the
postures should be mapped. Our evaluations showed that user preferences for action-
triggering postures can considerably vary, but some general trends can be identified.
Pressure-based postures were disliked if pressure has to be constantly applied
to maintain a mode. However, brief squeezes to trigger quick actions were deemed
acceptable. Similarly, postures requiring continuous contact with the surface were
found unsuitable for maintained modes, especially when the hand needs to move, as
dragging on the display causes friction. Touch-based postures are reasonable if the
hand remains in the same position, such as a palm firmly planted while the pen is
used locally, or if the postures consist of quick finger taps to invoke menus or execute
commands.
Within contact-based postures, poses with the pinkie out, with or without the
ring finger out as well, were preferred over others. Floating palm postures were also
acceptable as long as precision was not required, since the palm is not planted on the
display to afford stability.
Regarding raised fingers, preferences were mixed. People who adopt the dynamic
quadrupod as a normal writing posture (four fingers gripping the pen) can easily
raise their index finger while maintaining pen grip stability with their remaining
three fingers. However, people who use a dynamic tripod grip (three fingers gripping
the pen) cannot easily do this since lifting the index means that only two fingers
remain to grip the pen. For these users, extending the pinkie is preferable, although
that also implies a certain amount of dexterity.
Finally, pen tucking postures were considered comfortable overall, although a few
people noted their pen grip became less stable when extending the thumb.
Participants generally found this kind of two-handed gesture to be very intuitive and
easy to remember, but it also demands more effort as it requires moving the other
hand which is resting next to the tablet towards the pen and back. Furthermore, many
participants found it was difficult to find the precise location on the pen to place the
index finger in order to obtain a specific value. This type of gesture might be more
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 219
suitable for choosing among discrete, step-wise or coarse levels rather than setting a
continuous parameter on a fine-grained spectrum.
Transitioning from one pen hand posture to another can be difficult when the two
postures are very different. For touch-based postures, the tablet surface can be an
obstacle to finger movement. For example, when trying to extend the pinkie, some-
times the tablet is in the way so the hand needs to be raised. Another issue that can
arise is when an unintentional posture is recognised midway during the action of
forming the desired posture. For instance, when both pinkie out and pinkie out + ring
out postures are included, if the user places the pinkie first to form the pinkie out +
ring out posture, the action associated with pinkie out might be inadvertently trig-
gered. Consideration for these conflicts and false activations should be given when
designing posture sets.
Switching modes using different pen grips requires only a single hand, which can
be practical in case the other hand is busy, like when holding the tablet, or if the
user does not want to interact with two hands [42, 43]. If the other hand is free and
detectable, it is usually preferred for posture-based interaction as its movements are
not constrained. In particular, it is a good option for the activation of maintained
modes or quasimodes. A minority of people, however, do not want the other hand to
participate in any action at all so they tend to prefer pen hand postures for everything.
If there are differences in pen-gripping styles, there is even more variability in
ways people idly rest their hand on or near the tablet. In our experiments, we assumed
that most people naturally rest their other hand more or less flatly on the table while
engaged in a pen task (in the scenario where the tablet is placed on a table), but this is
not necessarily the case for everybody. For example, some users who naturally curl
their other hand when writing might accidentally trigger the action associated with
the fist posture. Furthermore, hand position varies considerably while using pens and
people naturally move their other hand to different locations, sometimes outside of
detection range (e.g. when resting their head on it). Of course, users can change their
habits if they are aware that their other hand is used for first-class input, but neural
networks also need to be exposed to a large number of these various idle poses to be
able to robustly distinguish them from explicit input postures.
Similar to how stroke gestures are more easily memorised and recalled than keyboard
shortcuts [1], it is helpful to try to semantically match certain types of postures to
220 F. Matulic and D. Vogel
specific function categories. For instance, when using unimanual touch postures in
a vector graphics editor, poses using the side of the palm can be assigned to creation
modes such as sketching, erasing, and drawing shapes, while floating palm pos-
tures can be associated with “macro” interactions such as selecting and transforming
shapes. Furthermore, dual actions can be assigned to two opposite positions of the
same finger, such as a posture with a finger in for undo, and a finger out for redo.
Memorisation of the mode-to-posture mapping can further be helped if the shape
of the posture roughly matches the associated input geometric shape, especially for
the other hand, which is unconstrained. For example, associating a fist with circle
input and a flat hand with rectangle input (Fig. 18a) was rated intuitive and easy to
remember by our participants.
7 Conclusion
Wacom also recently releasing special digital pens for VR [40, 72]. When using
the pen in mid-air, grip changes become easier to some degree as there is no screen
underneath hindering finger extension. In this context, pens can not only be used for
3D drawing, but also as pointing instruments to select objects in 3D space [52]. Pen
grip style has been shown to affect pointing precision [4], which suggests potential
for grip detection not only to support mode switching, but also to make sure the user
adopts the most efficient posture for a particular task. With hand tracking being an
increasingly common feature of VR systems, the technical and practical hurdles for
posture-based pen interaction in these contexts seem relatively low. We are hopeful
that further novel and exciting research based on work for pen and tablet devices will
emerge in this new space.
References
1. Appert C, Zhai S (2009) Using strokes as command shortcuts: cognitive benefits and toolkit
support. In: Proceedings of the SIGCHI conference on human factors in computing systems,
pp 2289–2298
2. Aslan I, Buchwald I, Koytek P, André E (2016) Pen + Mid-Air: an exploration of mid-air
gestures to complement pen input on tablets. In: Proceedings of the 9th Nordic conference on
human-computer interaction, NordiCHI ’16, pp 1:1-1:10, New York, NY, USA. ACM
3. Bandini A, Zariffa J (2020) Analysis of the hands in egocentric vision: a survey. IEEE Trans
Pattern Anal Mach Intell
4. Batmaz AU, Mutasim AK, Stuerzlinger W (2020) Precision vs. power grip: a comparison of
pen grip styles for selection in virtual reality. In: 2020 IEEE conference on virtual reality and
3D user interfaces abstracts and workshops (VRW), pp 23–28. IEEE
5. Hongliang B, Jian Z, Yanjiao C (2020) Smartge: identifying pen-holding gesture with smart-
watch. IEEE Access 8:28820–28830
6. Bi X, Moscovich T, Ramos G, Balakrishnan R, Hinckley K (2008) An exploration of pen
rolling for pen-based interaction. In: Proceedings of the 21st annual ACM symposium on User
interface software and technology, pp 191–200
7. Brandl P, Forlines C, Wigdor D, Haller M, Shen C (2008) Combining and measuring the benefits
of bimanual pen and direct-touch interaction on horizontal interfaces. In: Proceedings of the
working conference on advanced visual interfaces, pp 154–161, Napoli, Italy. ACM
8. Cami D, Matulic F, Calland RG, Vogel B, Vogel D (2018) Unimanual Pen+Touch input using
variations of precision grip postures. In: Proceedings of the 31st annual ACM symposium on
user interface software and technology, UIST ’18, pp 825–837, New York, NY, USA. ACM
9. Theocharis C, Andreas S, Dimitrios K, Kosmas D, Petros D (2020) A comprehensive study on
deep learning-based 3d hand pose estimation methods. Appl Sci 10(19):6850
10. Weiya C, Yu C, Tu C, Zehua L, Jing T, Ou S, Fu Y, Zhidong X (2020) A survey on hand pose
estimation with wearable sensors and computer-vision-based methods. Sensors 20(4):1074
11. Côté-Allard U, Fall CL, Drouin A, Campeau-Lecours A, Gosselin C, Glette K, Laviolette F,
Gosselin B (2019) Deep learning for electromyographic hand gesture signal classification using
transfer learning. IEEE Trans Neural Syst Rehab Eng 27(4):760–771
222 F. Matulic and D. Vogel
12. Dementyev A, Paradiso JA (2014) Wristflex: low-power gesture input with wrist-worn pres-
sure sensors. In: Proceedings of the 27th annual ACM symposium on user interface software
and technology, UIST ’14, pp 161–166, New York, NY, USA. Association for Computing
Machinery
13. Drey T, Gugenheimer J, Karlbauer J, Milo M, Rukzio E (2020) Vrsketchin: exploring the
design space of pen and tablet interaction for 3d sketching in virtual reality. In: Proceedings of
the 2020 CHI conference on human factors in computing systems, pp 1–14
14. Du H, Li P, Zhou H, Gong W, Luo G, Yang P (2018) Wordrecorder: accurate acoustic-based
handwriting recognition using deep learning. In: IEEE INFOCOM 2018-IEEE conference on
computer communications, pp 1448–1456. IEEE
15. Elkin LA, Beau J-B, Casiez G, Vogel D (2020) Manipulation, learning, and recall with tangible
pen-like input. In: Proceedings of the 2020 CHI conference on human factors in computing
systems, CHI ’20, pp 1–12, New York, NY, USA. Association for Computing Machinery
16. Fellion N, Pietrzak T, Girouard A (2017) Flexstylus: leveraging bend input for pen interaction.
In: Proceedings of the 30th annual ACM symposium on user interface software and technology,
UIST ’17, pages 375–385, New York, NY, USA. ACM
17. Frisch M, Heydekorn J, Dachselt R (2009) Investigating multi-touch and pen gestures for
diagram editing on interactive surfaces. Proc ITS 2009:149–156
18. Ge L, Ren Z, Li Y, Xue Z, Wang Y, Cai J, Yuan J (2019) 3d hand shape and pose estimation from
a single rgb image. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 10833–10842
19. Gesslein T, Biener V, Gagel P, Schneider D, Kristensson PO, Ofek E, Pahud M, Grubert J
(2020) Pen-based interaction with spreadsheets in mobile virtual reality. arXiv:2008.04543
20. Oliver G, Wu S, Daniele P, Otmar H, Olga S-H (2019) Interactive hand pose estimation using
a stretch-sensing soft glove. ACM Trans Graph (TOG) 38(4):1–15
21. Grossman T, Hinckley K, Baudisch P, Agrawala M, Balakrishnan R (2006) Hover widgets:
using the tracking state to extend the capabilities of pen-operated devices. In Proceedings
of the SIGCHI conference on Human Factors in computing systems, pp 861–870, Montréal,
Québec, Canada. ACM
22. Hamilton W, Kerne A, Robbins T (2012) High-performance pen+ touch modality interactions:
a real-time strategy game esports context. In: Proceedings of the 25th annual ACM symposium
on user interface software and technology, pp 309–318
23. Haque F, Nancel M, Vogel D (2015) Myopoint: pointing and clicking using forearm mounted
electromyography and inertial motion sensors. In: Proceedings of the 33rd annual ACM con-
ference on human factors in computing systems, CHI ’15, pp 3653–3656, New York, NY, USA.
Association for Computing Machinery
24. Hasan K, Yang X- D, Bunt A, Irani P (2012) A-coord input: coordinating auxiliary input streams
for augmenting contextual pen-based interactions. In: Proceedings of the SIGCHI conference
on human factors in computing systems, CHI ’12, pp 805–814, New York, NY, USA. ACM
25. Hasson Y, Varol G, Tzionas D, Kalevatykh I, Black MJ, Laptev I, Schmid C (2019) Learning
joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 11807–11816
26. Hinckley K, ’Anthony’ Chen X, Benko H (2013) Motion and context sensing techniques for
pen computing. In: Proceedings of graphics interface 2013, GI ’13, pp 71–78, Toronto, Ont.,
Canada, Canada. Canadian Information Processing Society
27. Hinckley K, Pahud M, Benko H, Irani P, Guimbretière F, Gavriliu M, ’Anthony’ Chen X,
Matulic F, Buxton W, Wilson A (2014) Sensing techniques for tablet+stylus interaction. In:
Proceedings of the 27th annual ACM symposium on user interface software and technology,
UIST ’14, pp 605–614, New York, NY, USA. ACM
28. Hinckley K, Yatani K, Pahud M, Coddington N, Rodenhouse J, Wilson A, Benko H, Buxton
B (2010) Pen + touch = new tools. In: Proceedings of the 23nd annual ACM symposium on
User interface software and technology, pp 27–36, New York, New York, USA. ACM
29. Howard J, Gugger S (2020) Fastai: a layered api for deep learning. Information 11(2):108
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 223
48. McIntosh J, Marzo A, Fraser M (2017) Sensir: detecting hand gestures with a wearable bracelet
using infrared transmission and reflection. In: Proceedings of the 30th annual ACM symposium
on user interface software and technology, UIST ’17, pp 593–597, New York, NY, USA.
Association for Computing Machinery
49. McIntosh J, Marzo A, Fraser M, Phillips C (2017) Echoflex: hand gesture recognition using
ultrasound imaging. In: Proceedings of the 2017 CHI conference on human factors in computing
systems, CHI ’17, pp 1923–1934, New York, NY, USA. Association for Computing Machinery
50. McIntosh J, McNeill C, Fraser M, Kerber F, Löchtefeld M, Krüger A (2016) Empress: practical
hand gesture classification with wrist-mounted emg and pressure sensing. In: Proceedings of
the 2016 CHI conference on human factors in computing systems, CHI ’16, pp 2332–2342,
New York, NY, USA. Association for Computing Machinery
51. Panteleris P, Oikonomidis I, Argyros A (2018) Using a single rgb frame for real time 3d hand
pose estimation in the wild. In: 2018 IEEE winter conference on applications of computer
vision (WACV), pp 436–445. IEEE
52. Pham D-M, Stuerzlinger W (2019) Is the pen mightier than the controller? A comparison of
input devices for selection in virtual and augmented reality. In: 25th ACM symposium on virtual
reality software and technology, VRST ’19, New York, NY, USA. Association for Computing
Machinery
53. Protalinski E (2019) Ctrl-labs ceo: we’ll have neural interfaces in less than 5 years. VentureBeat
54. Ramos G, Boulos M, Balakrishnan R (2004) Pressure widgets. In: Proceedings of the SIGCHI
conference on Human factors in computing systems, pp 487–494, Vienna, Austria. ACM
55. Rekimoto J (1997) Pick-and-drop: a direct manipulation technique for multiple computer envi-
ronments. In: Proceedings of the 10th annual ACM symposium on user interface software and
technology, UIST ’97, pp 31–39, New York, NY, USA. ACM
56. Roland T, Wimberger K, Amsuess S, Russold MF, Baumgartner W (2019) An insulated flex-
ible sensor for stable electromyography detection: application to prosthesis control. Sensors
19(4):961
57. Saponas TS, Tan DS, Morris D, Balakrishnan R (2008) Demonstrating the feasibility of using
forearm electromyography for muscle-computer interfaces. In: Proceedings of the SIGCHI
conference on human factors in computing systems, CHI ’08, pp 515–524, New York, NY,
USA. Association for Computing Machinery
58. Saponas TS, Tan DS, Morris D, Turner J, Landay JA (2010) Making muscle-computer interfaces
more practical. In: Proceedings of the SIGCHI conference on human factors in computing
systems, CHI ’10, pp 851–854, New York, NY, USA. Association for Computing Machinery
59. Schrapel M, Stadler M-L, Rohs M (2018) Pentelligence: combining pen tip motion and writing
sounds for handwritten digit recognition. In: Proceedings of the 2018 CHI conference on human
factors in computing systems, pp 1–11
60. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual
explanations from deep networks via gradient-based localization. In Proceedings of the IEEE
international conference on computer vision, pp 618–626
61. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. arXiv:1409.1556
62. Smith B, Wu C, Wen H, Peluse P, Sheikh Y, Hodgins JK, Shiratori T (2020) Constraining dense
hand surface tracking with elasticity. ACM Trans Graph (TOG), 39(6):1–14
63. Song H, Benko H, Guimbretiere F, Izadi S, Cao X, Hinckley K (2011) Grips and gestures on
a multi-touch pen. In: Proceedings of the SIGCHI conference on human factors in computing
systems, CHI ’11, pp 1323–1332, New York, NY, USA. ACM
64. Sridhar S, Mueller F, Zollhöfer M, Casas D, Oulasvirta A, Theobalt C (2016) Real-time joint
tracking of a hand manipulating an object from rgb-d input. In: European conference on com-
puter vision, pp 294–310. Springer
65. Suzuki Y, Misue K, Tanaka J (2009) Interaction technique for a pen-based interface using
finger motions. In: Jacko JA (ed) Human-computer interaction. Novel interaction methods and
techniques, pp 503–512. Springer, Berlin Heidelberg
Deep Learning-Based Hand Posture Recognition for Pen Interaction Enhancement 225
Abstract The Rico dataset, containing design data from more than 9.7 k Android
apps spanning 27 categories, was released in 2017. It exposes visual, textual, struc-
tural, and interactive design properties of more than 72 k unique UI screens. Over the
years since its release, the original paper has been cited nearly 100 times according
to Google Scholar and the dataset has been used as the basis for numerous research
B. Deka
McKinsey, Chicago, IL, USA
e-mail: biplab.uiuc@gmail.com
B. Doosti · C. Franzen · D. Afergan · Y. Li · T. Dong
Google, Inc., Mountain View, CA, USA
e-mail: bardiad@google.com
C. Franzen
e-mail: cfranzen@google.com
D. Afergan
e-mail: afergan@google.com
Y. Li
e-mail: liyang@google.com
T. Dong
e-mail: taodong@google.com
F. Huang
University of California, Berkeley, CA, USA
e-mail: forrest_huang@berkeley.edu
J. Hibschman
Northwestern University, Evanston, IL, USA
e-mail: jh@u.northwestern.edu
R. Kumar
University of Illinois at Urbana-Champaign, Champaign, IL, USA
e-mail: ranjitha@illinois.edu
J. Nichols (B)
Apple, Inc., Seattle, WA, USA
e-mail: jeff@jeffreynichols.com
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 229
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_8
230 B. Deka et al.
projects. In this chapter, we describe the creation of Rico using a system that com-
bined crowdsourcing and automation to scalably mine design and interaction data
from Android apps at runtime. We then describe two projects that we conducted
using the dataset: the training of an autoencoder to identify similarity between UI
designs, and an exploration of the use of Google’s Material Design within the dataset
using machine learned models. We conclude with an overview of other work that
has used Rico to understand our mobile UI world and build data-driven models that
assist users, designers, and developers.
1 Introduction
We created the Rico1 dataset and released it publicly in 2017 [21]. We believe it is
still the largest repository of Android mobile app designs, comprising visual, textual,
structural, and interactive property data for 72,219 UIs from 9,772 apps, spanning
27 Google Play categories. For each app, Rico presents a collection of individual
user interaction traces, as well as a collection of unique UIs determined by a novel
content-agnostic similarity heuristic. In total, the dataset contains more than 72 k
unique UI screens.
To download the Rico dataset and learn more about the project, please visit http://
interactionmining.org/rico.
To understand what makes Rico unique, it is helpful to consider it in the context of
other Android app datasets. Existing datasets expose different kinds of information:
Google Play Store metadata (e.g., reviews, ratings) [2, 25], software engineering and
security related information [24, 52], and design data [7, 22, 45]. Rico captures both
design data and Google Play Store metadata.
Mobile app designs comprise several different components, including user inter-
action flows (e.g., search, login), UI layouts, visual styles, and motion details. These
components can be computed by mining and combining different types of app data.
For example, combining the structural representation of UIs—Android view hierar-
chies [3]—with the visual realization of those UIs—screenshots—can help explicate
app layouts and their visual stylings. Similarly, combining user interaction details
with view hierarchies and screenshots can help identify the user flows that apps are
designed to support.
Figure 1 compares Rico with other popular datasets that expose app design infor-
mation. Design datasets created by statically mining app packages contain view
hierarchies, but cannot capture data created at runtime such as screenshots or inter-
action details [7, 45]. ERICA’s dataset, on the other hand, is created by dynamically
mining apps and captures view hierarchies, screenshots, and user interactions [22].
Like the ERICA dataset, Rico is created by mining design and interaction data
from apps at runtime. Rico’s data was collected via a combination of human-powered
and programmatic exploration, as shown in Fig. 2. Also like ERICA, Rico’s app
Fig. 2 Rico is a design dataset with 72 k UIs mined from 9.7 k free Android apps using a combination
of human and automated exploration. The dataset can power a number of design applications,
including ones that require training state-of-the-art machine learning models
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 233
designs, and an exploration of the use of Google’s Material Design within the dataset
using machine learned models. We conclude with an overview of other work that
has used Rico to understand our mobile UI world and build data-driven models that
assist users, designers, and developers.
2 Collecting Rico
To create Rico, we developed a platform that mines design data from Android apps
at runtime by combining human-powered and programmatic exploration. Humans
rely on prior knowledge and contextual information to effortlessly interact with a
diverse array of apps. Apps, however, can have hundreds of UI states, and human
exploration clusters around common use cases, achieving low coverage over UI
states for many apps [10, 22]. Automated agents, on the other hand, can be used to
exhaustively process the interactive elements on a UI screen [13, 50]; however, they
can be stymied by UIs that require complex interaction sequences or human inputs
(Fig. 3) [8].
We developed a hybrid approach for design mining mobile apps that combine the
strengths of human-powered and programmatic exploration: leveraging humans to
unlock app states that are hidden behind complex UIs and using automated agents
to exhaustively process the interactive elements on the uncovered screens to dis-
cover new states. The automated agents leverage a novel content-agnostic similarity
heuristic to efficiently explore the UI state space. Together, these approaches achieve
higher coverage over an app’s UI states than either technique alone.
Fig. 3 Automated crawlers are often stymied by UIs that require complex interaction sequences,
such as the three shown here
234 B. Deka et al.
Fig. 4 Our crowd worker web interface. On the left, crowd workers can interact with the app
screen using their keyboard and mouse. On the right, there are provided instructions and details
such as the name, location, phone number, and email address to use in the app. The interface also
allows workers to access SMS and email messages sent to the provided phone number and email
to complete app verification processes
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 235
To move beyond the set of UI states uncovered by humans, Rico employs an auto-
mated mining system. Existing automated crawlers hard-code inputs for each app
to unlock states hidden behind complex UIs [8, 33]. We achieve a similar result by
leveraging the interaction data contained within the collected user traces: when the
crawler encounters an interface requiring human input, it replays the interactions that
a crowd worker performed on that screen to advance to the next UI state.
Similar to prior work [8, 33], the automated mining system uses a depth-first
search strategy to crawl the state space of UIs in the app. For each unique UI, the
crawler requests the view hierarchy to identify the set of interactive elements. The
system programmatically interacts with these elements, creating an interaction graph
that captures the unique UIs that have been visited as nodes and the connections
between interactive elements and their resultant screens as edges. This data structure
also maintains a queue of unexplored interactions for each visited UI state. The
system programmatically crawls an app until it hits a specified time budget or has
exhaustively explored all interactions contained within the discovered UI states.
After Rico’s crawler interacts with a UI element, it must determine whether the
interaction led to a new UI state or one that is already captured in the interaction
graph. Database-backed applications can have thousands of views that represent the
same semantic concept and differ only in their content (Fig. 5). Therefore, we employ
a content-agnostic similarity heuristic to compare UIs.
Fig. 5 Pairs of UI screens from apps that are visually distinct but have the same design. Our content-
agnostic similarity heuristic uses structural properties to identify these sorts of design collisions
236 B. Deka et al.
This similarity heuristic compares two UIs based on their visual and structural
composition. If the screenshots of two given UIs differ by fewer than α pixels, they
are treated as equivalent states. Otherwise, the crawler compares the set of element
resource-ids present on each screen. If these sets differ by more than β elements,
the two screens are treated as different states.
We evaluated the heuristic with different values of α and β on 1,044 pairs of UIs
from 12 apps. We found that α = 99.8% and β = 1 produce a false positive rate of
6% and a false negative rate of 3%. We use these parameter values for automated
crawling and computing the set of unique UIs for a given app.
Fig. 6 The Android apps used in our evaluation. Each had a rating higher than 4 stars (out of 5)
and more than 1 M downloads on the Google Play store
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 237
Fig. 7 The performance of our hybrid exploration system compared to human and automated
exploration alone, measured across ten diverse Android apps
not use them consistently: in practice, complex apps can have the same number of
activities as simple apps. In contrast, we use a coverage measure that correlates with
app complexity: computing coverage as the number of unique UIs discovered under
the similarity heuristic.
Figure 7 presents the coverage benefits of a hybrid system: combining human
and automated exploration increases UI coverage by an average of 40% over human
exploration alone and discovered several new Android activities for each app. For
example, on the Etsy app, our hybrid system uncovered screens from 7 additional
Activities beyond the 18 discovered by human exploration.
We also evaluated the coverage of the automated system in isolation, without
bootstrapping it with a human trace. The automated system achieved 26% lower
coverage across the tested apps than Rico’s hybrid approach. This poor performance
is largely attributable to experiences that are gated beyond a login screen or paywall
that our pure, automated approach cannot handle. For instance, Todoist and WeHeartIt
hide most of their features behind a login wall.
The Rico dataset comprises 10,811 user interaction traces and 72,219 unique UIs
from 9,772 Android apps spanning 27 categories (Fig. 8). We excluded from our
crawl categories that primarily involve multimedia (such as video players and photo
238 B. Deka et al.
Fig. 8 Summary statistics of the Rico Dataset: app distribution by a category, b average rating, and
c number of mined interactions. d The distribution of mined UIs by number of interactive elements
editors) as well as productivity and personalization apps. Apps in the Rico dataset
have an average rating of 4.1 stars and data pertaining to 26 user interactions.
To create Rico, we downloaded 9,772 free apps from the Google Play Store and
crowdsourced user traces for each app by recruiting 13 workers (10 from the US, 3
from the Philippines) on UpWork. We chose UpWork over other crowdsourcing plat-
forms because it allows managers to directly communicate with workers: a capability
that we used to resolve any technical issues that arose during crawling. We instructed
workers to use each app as it was intended based on its Play Store description for no
longer than 10 min.
In total, workers spent 2,450 h using apps on the platform over five months,
producing 10,811 user interaction traces. We paid US $19, 200 in compensation
or approximately two dollars to crowdsource usage data for each app. To ensure
high-quality traces, we visually inspected a subset of each user’s submissions. After
collecting each user trace for an app, we ran the automated crawler on it for one hour.
For each app, Rico exposes Google Play Store metadata, a set of user interaction
traces, and a list of all the unique discovered UIs through crowdsourced and auto-
mated exploration. The Play Store metadata includes an app’s category, average
rating, number of ratings, and number of downloads. Each user trace is composed
of a sequence of UIs and user interactions that connect them. Each UI comprises a
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 239
Having created the Rico dataset, we then made use of it in several different projects.
Two of these projects are described here: the training of a UI layout embedding using
an autoencoder, and an investigation of the usage of Material Design. We also built
a system called Swire [27], which allowed users to investigate the Rico dataset by
sketching a full or partial user interface and using that sketch as a search query over
the dataset. Swire is described in more detail in Chapter 12 of this book.
The large size of the Rico dataset makes it difficult to browse comprehensively, so in
this work, we set out to create a method that would allow users to search the dataset.
As a starting point, we allow users to use a screenshot of a mobile UI as their query.
Since the Rico dataset is large and comprehensive enough to support deep learning
applications, we trained a deep autoencoder to learn an embedding for UI layouts
and used it to annotate each UI with a 64-dimensional vector representation encoding
visual layout. This vector representation can be used to compute structurally—and
often semantically—similar UIs, supporting example-based search over the dataset
(see figures in the original Rico paper [21]).
An autoencoder is a neural network that involves two models—an encoder and
a decoder—to support the unsupervised learning of lower-dimensional represen-
tations [12]. The encoder maps its input to a lower-dimensional vector, while the
decoder maps this lower-dimensional vector back to the input’s dimensions. Both
models are trained together with a loss function based on the differences between
inputs and their reconstructions. Once an autoencoder is trained, the encoder portion
is used to produce lower-dimensional representations of the input vectors.
240 B. Deka et al.
Fig. 9 We train an autoencoder to learn a 64-dimensional representation for each UI in the repos-
itory, encoding structural information about its layout. This is accomplished by creating training
images that encode the positions and sizes of elements in each UI, differentiating between text and
non-text elements
To create training inputs for the autoencoder that embed layout information, we
constructed a new image for each UI encoding the bounding box regions of all leaf
elements in its view hierarchy, differentiating between text and non-text elements
(Fig. 9). Rico’s view hierarchies obviate the need for noisy image processing or OCR
techniques to create these inputs. In the future, we could incorporate more recent
work on predicting functional semantic labels [37] for elements such as search icon
or login button to train embeddings with even richer semantics.
The encoder has an input dimension of 11,200 and an output dimension of 64 and
uses two hidden layers of dimension 2,048 and 256 with ReLU non-linearities [41].
The decoder has the reverse architecture. We trained the autoencoder with 90%
of our data and used the rest as a validation set, and we found that the validation
loss stabilized after 900 epochs or approximately 5 h on an Nvidia GTX 1060 GPU.
Once the autoencoder was trained, we used the encoder to compute a 64-dimensional
representation for each UI, which we expose as part of the Rico dataset.
Figure 10 shows several example query UIs and their nearest neighbors in the
learned 64-dimensional space. The results demonstrate that the learned model is
able to capture common mobile and Android UI patterns such as lists, login screens,
dialog screens, and image grids. Moreover, the diversity of the dataset allows the
model to distinguish between layout nuances, like lists composed of smaller and
larger image thumbnails.
Fig. 10 The top six results obtained from querying the repository for UIs with similar layouts to
those shown on the left, via a nearest-neighbor search in the learned 64-dimensional autoencoder
space. The returned results share a common layout and even distinguish between layout nuances
such as lists composed of smaller and larger image thumbnails (a, b)
242 B. Deka et al.
work, we leverage the Rico dataset to understand how Material Design has been used
on mobile devices.
Pattern languages have been long used in Human–Computer Interaction (HCI) for
distilling and communicating design knowledge [14, 20]. According to Christopher
Alexander [6], who introduced pattern-based design to architecture, “each pattern
describes a problem which occurs over and over again in our environment, and then
describes the core of the solution to that problem, in such a way that you can use this
solution a million times over, without ever doing it the same way twice”.
HCI researchers and design practitioners have documented and introduced pattern
languages for general UI design (e.g., [42, 51]) as well as a wide variety of application
domains, such as learning management systems [9], ubiquitous computing [30],
information retrieval [53], and many more. Nonetheless, as Dearden and Finlay point
out, there have been relatively few evaluations of how useful pattern languages are in
user interface design [20]. Since Dearden and Finlay published their critical review,
more evaluations have been done on pattern languages in HCI (e.g., [18, 46, 53]).
But these evaluations are usually limited in at least one of several ways. First, the
pattern languages in those evaluations were often developed in an academic research
setting. Few have been applied to real-world applications. Second, the evaluations
were usually done in lab settings and hence lacked ecological validity. Last, those
evaluations were done at a very small scale (i.e., applying a pattern language to either
one or no more than a handful of systems). As a result of the limitations of how the
field has evaluated pattern languages, we know little about whether pattern languages
in HCI are fulfilling the promise of Alexander—providing design solutions that can
be reused “a million times over”.
The recent success of commercial UI design pattern languages offers a rare oppor-
tunity for us to evaluate the usefulness of pattern languages in HCI at scale and in the
wild. In particular, Material Design seems to have been widely adopted by developers
who build applications for Google’s Android operating system. How can we under-
stand the impact of a pattern language in one of the largest computing ecosystems
in the world? This is the first research question we seek to answer in this project.
In addition to developing a method for measuring a pattern language’s overall
impact, we also want to address questions about how and where certain patterns
should be used when they get applied to new use cases. For Material Design, few
patterns have been more controversial, yet at the same time iconic, than the Floating
Action Button (aka, FAB) and the Navigation Drawer (i.e., the hamburger menu).
Tens of thousands of words have been written about the merits and more often the
downsides of these two patterns (e.g., [11, 28, 48] for FAB and [4, 19, 43] for Ham-
burger Menu). Sometimes, the conclusions are daunting. For example, one online
critic said, “...in actual practice, widespread adoption of FABs might be detrimen-
tal to the overall UX of the app” [48]. Even when the criticisms are moderate and
well-reasoned, they are based on the writer’s examination of a limited number of
examples. It is hard to know whether these criticisms reflect the full picture, since
these patterns are likely to be used in a huge number of different apps. Thus, the
second research question driving this work is: How can we examine real-world use
of design patterns to inform debates about UI design?
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 243
There were two general stages in our data analysis. First, we detected Material Design
elements in a large number of mobile apps. We focused on six elements in Material
Design: App Bar, Floating Action Button, Bottom Navigation, Navigation Drawer,
Snack Bar, and Tab Layout. Second, we looked for relationships between usage of
Material Design elements and app popularity as well as app category.
The main challenge in our data analysis was to reliably detect Material Design
elements, which lacked standardized names in the app view hierarchies. Therefore, to
detect as many Material Design elements as possible from an app (either implemented
with official or unofficial Material Design libraries), we leveraged the pixel data from
the apps’ screenshots in the Rico dataset.
Specifically, we used computer vision techniques such as deep Convolutional
Neural Networks (CNNs) to detect Material Design elements from screenshots. To
train these models, we needed positive and negative cropped snapshots of Material
Design elements as training data. To collect the ground truth data, we turned to those
apps that implemented their UIs using the official Material Design library. These
elements were easy to find by class name in the view hierarchy JSON files. In order
to train a good classifier, negative examples of each element also need to be collected
from a relevant location and should not be cropped from a random part of the screen.
To this end, we created a heatmap for each type of element based on the apps using
the official library. With these heatmaps, we cropped the UI regions which did not
use Material Design elements. Therefore, for the screenshots which did not include a
Material Design element, we cropped the screenshot based on the most probable part
244 B. Deka et al.
Fig. 11 The heatmap of the frequency divided by maximum value of each Material Design element
in Rico dataset
of the screen (heatmap) for that element to generate negative examples. Figure 11
shows the heatmaps of each Material Design element in Rico dataset.
After collecting a set of images for each Material Design element, we used deep
Convolutional Neural Networks to detect Material Design elements in the apps. We
trained a separate classifier for each Material Design element to detect that element in
an app’s screenshot. We selected the AlexNet [29] architecture for our Convolutional
Neural Networks. We trained all the networks from scratch with a learning rate of
0.001 and 50,000 iterations. We split our data into 80% training, 10% validation, and
10% for testing the trained network and got at least 95% of accuracy on each com-
ponent. We used Google’s open-source machine learning platform TensorFlow [5]
to implement all our machine learning models. For detailed information about our
methodology, please refer to [23].
Our data analysis led to a number of interesting findings about Material Design’s
usage in the wild. We first report the usage of specific elements such as the Floating
Action Button and the Navigation Drawer, two popular but somewhat controversial
patterns in Material Design, and how the usage of these elements relate to app ratings,
installs, and categories. We then report the usage of Material Design in general and
examine its impact on app popularity.
If the drawbacks of FABs generally outweigh their benefits, as some design critics
argued, one would assume that higher-rated apps would be less likely to use FABs
than those lower-rated apps. To test this hypothesis, we split apps into two groups: a
high-rating group and a low-rating group by the median average rating of all apps in
the Play Store dataset, which was 4.16 at the time of this analysis. The two groups
of apps were balanced and each group had 4673 number of apps.
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 245
Fig. 12 a The percentage of apps using the Floating Action Button (FAB) in the high-rating group
versus the low-rating group. b Box plots of the average ratings of apps using the FAB versus those
not using the FAB. c The percentage of apps using the FAB in the more-installed group versus the
less-installed group. d Box plots of the number of installs of apps using the FAB versus those not
using the FAB
As Fig. 12a shows, there was actually a higher percentage of apps using FABs
in the high-rating group than those in the low-rating group (13.4% vs. 6.6%). The
box plots in Fig. 12b further shows that apps using the FAB were rated higher than
those that did not use it. In fact, 66.6% of apps that used the FAB belonged to the
higher-rating group.
We also used the number of installs as another measure of app popularity. Thus,
we decided to split our apps into two groups: (1) apps with greater than or equal to 1
million installs, and (2) apps with less than 1 million installs. The two groups were
nearly balanced after the split, with 4723 in the more-installed group and 4623 in the
less-installed group. Similar to what we saw in the analysis of FAB usage and app
ratings, apps in the more-installed group appeared to be more likely to feature FABs
than those in the less-installed group (see Fig. 12c). Also, apps using the FAB had a
larger number of installs in comparison to apps without the FAB (see Fig. 12d).
The results above suggest that many developers of popular apps consider the FAB
to be a valuable design pattern. Nonetheless, it’s still possible that the FAB is a more
useful pattern in some situations than others.
To understand where the FAB might be more useful, we examined the usage of the
FAB by app category. Figure 13 shows the top 11 app categories by the percentage of
apps featuring the FAB, excluding categories for which there were too few apps in
the Rico dataset (less than 0.05% of the apps of that category in Google Play). As it
is obvious to see, FAB usage varied considerably among these 11 categories of apps.
The Food and Drink category had the highest percentage of FAB usage among all
the qualified categories. Figure 14 shows some of the FABs in the Food and Drink
category. Each thumbnail belongs to a different app but there are FABs with similar
icons in this category, suggesting common usage of the FAB such as suggesting
recipes (the “folk” FAB), locating nearby restaurants (the “location” FAB). Note
246 B. Deka et al.
Fig. 13 The top category of apps which used the most FAB by percentage in their category
that some of the thumbnails in this picture do not appear to include a FAB, because
they are occluded by another UI component.
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 247
Fig. 15 a The percentage of apps using the Navigation Drawer in the high-rating group versus the
low-rating group. b Box plots of the average ratings of apps using the Navigation Drawer versus
those not using the Navigation Drawer. c The percentage of apps using the Navigation Drawer in
the more-installed group versus the less-installed groups. d Box plots of the number of installs of
apps using the Navigation Drawer versus those not using the Navigation Drawer
We applied the same analysis to examine the usage of the Navigation Drawer in
Material Design. As we can see in Fig. 15a, there were more apps in the high-rating
group which had a Navigation Drawer than those in the low-rating group (7.3% vs.
3.9%). The box plots in Fig. 15b show that the average rating for apps using the
Navigation Drawer was higher than those that did not use it. Among all the apps that
used the Navigation Drawer, 65% of them belonged to the high-rating group.
Similar to our analysis of the FAB usage, we examined the usage of the Navigation
Drawer and the number of app installs. As it is shown in Fig. 15b, apps in the high-
rating group were slightly more likely to feature a Navigation Drawer than those in
the low-rating group. Also, the box plots in Fig. 15d show that apps using Navigation
Drawer had a slightly higher number of installs.
Fig. 16 Distribution of the percentage of apps using at least one of the six common Material Design
elements over percentiles in average rating (blue) and number of installs (orange)
Material Design usage over average rating percentiles. As we can see in Fig. 16, the
usage of Material Design was highly correlated with the average rating percentile
(with the Pearson correlation coefficient ρ = 0.99 and p-value = 3.1 × 10−91 ). In
other words, as the average rating increased, the percentage of apps using Material
Design also increased.
Next, we examined the relationship of the usage of Material Design and the number
of installs, an alternate measure of app popularity. As in the previous step, we sorted
apps by their number of installs and split them into one hundred equal-sized buckets
by percentile. As shown in Fig. 16, the percent of the apps using Material Design is
also highly correlated to number of installs ρ = 0.94 and p-value = 2.3 × 10−47 ).
4.2.6 Summary
To sum up, we review the two research questions we set out to answer. Our first
research question was How can we understand the impact of a pattern language in
one of the largest computing ecosystems in the world? We developed a computational
method to measure the relationship between Material Design, the pattern language
in question, and app popularity in the Android ecosystem. We used Convolutional
Neural Networks as a data mining tool to analyze big UI data. We trained multiple
models to detect Material Design elements in apps’ screenshots.
Our second research question was How can we examine real-world use of design
patterns to inform debates about UI design? To answer this question, we examined
the usage of the Floating Action Button and the Navigation Drawer, two frequently
criticized patterns in online design discussions. While our results do not directly
rebut specific arguments against these two patterns, they clearly show that many
developers and designers found these two patterns valuable and use both patterns
frequently in their highly rated and highly popular apps. Moreover, our results have
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 249
suggested that evaluating the merit of a design pattern should consider the context it
is applied to. For example, developers used the FAB more frequently in certain app
categories such as Food and Drink and Parenting than others.
Over the last three years since its release, Rico has been used by many research teams
as the basis for their own projects. In this section, we attempt to categorize the uses
that Rico has seen to date and highlight a few projects of interest.
We found in our exploration of these use cases that they could be broadly catego-
rized into five areas: (a) mobile ecosystem explorations, (b) UI automation, (c) design
assistance, (d) understanding UI semantics, and (e) automated design. In addition,
efforts in many of these areas have additionally enhanced the Rico dataset itself,
which we will discuss separately. We describe the research in these areas below.
The research summarized in this section attempts to understand the Android app
ecosystem by using the Rico dataset as a sample. Our own work described in Sect. 4.2
falls under this category, where we used Rico to explore the usage of the Material
Design pattern language in the Android app ecosystem.
Micallef et al. used Rico to study the use of login functionality in Android apps
[39]. They found that 32% of apps use at least one kind of login functionality and 9%
provided at least one social login (e.g., Facebook, Google). They found no correlation
between the usage of login features and the number of app downloads or app ratings.
Ross et al. investigated the state of mobile app accessibility using a subset of
the Rico dataset [44]. They specifically focused on the accessibility of image-based
buttons and studied the prevalence of accessibility issues due to missing labels,
duplicate labels, or uninformative labels. They discovered a bi-modal distribution
of missing labels in apps with 46% of apps having less than 10% (46%) of their
image-based buttons labeled and 36% of apps having more than 90% labeled. The
correlation between accessibility, as measured by missing labels, and app ratings was
found to be weak.
5.2 UI Automation
Several works have used Rico to develop mobile UI automation tools and techniques.
One popular use of UI automation is for helping end users. Li et al. look at task
automation in which natural language instructions must be interpreted as a sequence
250 B. Deka et al.
of actions on a mobile touch-screen UI [34]. Toward that end, they used a subset
of the Rico dataset, enhanced it with action commands, and created the RicoSCA
dataset. This dataset, along with two other datasets, allowed them to develop models
to map multi-step instructions into automatically executable actions given the screen
information. Sereskeh et al. developed programming by demonstration system for
smartphone task automation called VASTA[47]. They used the Rico dataset to train
a detector for UI elements that powered the vision capabilities of their system.
Another popular use of UI automation is for testing. Li et al. developed Humanoid,
a test input generator for Android apps [36]. They use the interactions traces from the
Rico dataset to train a deep learning-based model of how humans interact with app
screens. This model is then used to guide Humanoid’s test input generation. Zhang et
al. used Rico to train a deep learning-based model for identifying isomorphic GUIs,
which are GUI’s that may have different contents but represent the same screen
semantically [55]. Although they intend to use such identification to enable robotic
testing for mobile GUIs, this feature could also be useful for crawling mobile apps.
Section 2.3 describes how we handled the identification of isomorphic GUIs during
data collection for the Rico dataset.
Another popular use of the Rico dataset is to develop data-driven tools to assist
with mobile UI design. An example of such a tool is a search engine for finding
example UIs of interest. Such search engines would enable designers to use relevant
examples early on in the design process for inspiration and to guide their design
process. Section 4.1 describes how we used Rico to train an autoencoder to learn an
embedding for UI layouts and used it to demonstrate an example-based search for
UIs which gives the user the ability to search for UIs similar to a UI of interest. Chen
et al. collected an Android UI dataset similar to Rico and used it to train a CNN
to enable searching UIs based on wireframes [15]. Huang et al. developed Swire, a
UI retrieval system that can be queried by using hand-drawn sketches of UIs [27].
This was enabled by training a deep neural network to learn a sketch-screenshot
embedding space for UIs in the Rico dataset and performing a nearest-neighbor
search in that space. Swire is also described in Chapter XX in this book.
Another set of data-driven tools attempt to provide feedback and guide design-
ers, especially novice designers, during the design process. Lee et al. developed
GUIComp, a tool that provides real-time feedback to designers including showing
other relevant examples, predicting user attention characteristics of the design, and
showing design complexity metrics [31]. They used a subset of the Rico dataset as
a basis for their tool and trained an autoencoder to find similar UIs following an
approach similar to that described in Sect. 4.1. Wu et al. developed a tool to predict
user engagement based on the types of animations used within the app [54]. Their
approach was enabled by training a deep learning model on the animations released
as part of the Rico dataset. Finally, Swearngin et al. built upon the Rico dataset to
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 251
create a new dataset for mobile interface tappability using crowdsourcing and then
computationally investigated a variety of signals that are used by typical users to
distinguish tappable versus not-tappable elements [49].
Several recent works have also used the Rico dataset to develop approaches for
developing a taxonomy of UI elements and then building detectors for different UI
element types found in mobile UIs. Liu et al. identified 25 semantic concepts that
are commonly implemented by UI elements, such as next, create, and delete. They
then trained a Convolutional Neural Network (CNN) to detect the corresponding UI
elements in app screens and used it to annotate the elements in Rico dataset. These
semantic annotations are now available for download as part of the Rico dataset.
Moran et al. mined a dataset of app screens similar to Rico and used the resulting
dataset to develop an automated approach for converting GUI mockups to imple-
mented code [40]. To do that, they too developed techniques to detect and classify
different UI elements found in UIs. Chen et al. used the Rico dataset to perform a
large-scale empirical study of seven representative GUI element detection methods
on over 50 k GUI images [16].
Finally, Li et al. collected natural language descriptions, called captions, for ele-
ments in the Rico dataset and used it to train models that generate captions for UI
elements (useful for accessibility and language-based interactions) [35]. In this work,
they also augmented Rico with 12 K newly crawled UI screens.
Another area where a UI dataset is essential is for the development of methods for
the automated generation of UIs. Lee et al. developed the Neural Design Network,
an approach to generate a graphic design layout given a set of components with
user-specified attributes and constraints [32]. Gupta et al. developed the Layout-
Transformer, a technique that leverages a self-attention-based approach to learn con-
textual relationships between layout elements and generate new layouts [26]. Both
these works use the Rico dataset to test their approaches for mobile UIs.
Several of the research projects discussed above have also enhanced the Rico dataset
with new annotations or additional screens. Liu et al. added semantic annotations
for UI elements (e.g, delete, save, search, etc.) to the Rico dataset [37]. This was
252 B. Deka et al.
accomplished by (a) iterative open coding of 73 k UI elements and 720 screens, (b)
training a convolutional neural network that distinguishes between icon classes, and
(c) using that network to compute semantic annotations for the 72 k unique UIs in the
Rico dataset, assigning labels for 78% of the total visible, non-redundant elements.
Leiva et al. released Enrico, a curated dataset of 1460 mobile UIs drawn from
Rico that are tagged with one of 20 different semantic topics (e.g., gallery, chat,
etc.) [38]. This was accomplished by using human annotators to (a) systematically
identify popular UI screens that have consistent screenshots and view hierarchies in
10k randomly drawn screens from the Rico dataset, (b) create a taxonomy with 19
topics that accurately represented these popular UI screens, and (c) assign each UI
to a topic.
For crowd crawling of apps, new tools are available to offer more support to
crowd workers. For example, Chen et al. developed tools that offer guidance to
crowd explorers that can reduce redundant exploration and increase coverage [17].
6 Discussion
To our knowledge, the Rico dataset remains the largest repository of Android mobile
app designs, and it has been used by many research teams worldwide to facilitate
our understanding of the Android mobile app ecosystem and to create tools and
technologies that advance our use of mobile user interfaces. We are happy that we
chose to release the dataset publicly and are impressed with the follow-on work that
has been done as a result. We hope that others who augment Rico or create their
own new datasets will likewise make them publicly available, as this helps the entire
research community. One challenge for the future is how to aggregate these new
additions and new datasets into a single accessible place, as today those that have
been released are shared in a variety of locations with little standardization.
While Rico continues to be useful, it has weaknesses that we hope to address.
The initial idea to create the dataset was born out of a need to train machine learned
models that incorporated an understanding of the user interface. At the time, we
were interested in creating generative models that could produce full or partial user
interfaces designs. While we ended up not pursuing this direction, this initial use
case is reflected in the type of data collected in the Rico dataset. Our goal was to
collect a nearly complete picture of every app that we explored, including each of its
screens, dialog boxes, etc. In our collection process, we intentionally did not try to
capture data about how humans used these interfaces, and we disregarded the tasks for
which the user interfaces might be used, common user behaviors with the interfaces,
and other semantic information and metadata related to the user interfaces. Another
omission in Rico is that it contains no task-related information nor any ecologically
valid traces of human interaction on its UIs. Collecting such data will require new
crowdsourcing techniques, especially at the scale needed for the data to be useful for
deep learning, but would open up the possibility of many new applications that are
not possible with the current dataset.
An Early Rico Retrospective: Three Years of Uses for a Mobile App Dataset 253
Another weakness in Rico is its lack of temporal data; Rico contains a snapshot
of Android UIs collected at just one period of time in early 2017. Presumably many
of the apps in the dataset have changed since they were originally collected, and
certainly, new apps have been created that are not in the dataset. Although we see
little evidence of it so far, models trained from the dataset could suffer from concept
drift if present or future UIs change sufficiently from what was recorded in the dataset.
To this end, we hope to capture an updated version of the dataset and release that
publicly at some point in the future.
Finally, Rico contains just data from Android phone UIs in US English. Collect-
ing data from other device types (e.g., tablets), other operating systems (e.g., iOS),
and other internationalization and localization settings beyond US English would
also open up new applications for the dataset. For example, being able to train design
mappings between UIs that serve the same function but use different languages might
create an opportunity to build automated or semi-automated tools for international-
izing UIs. Training mappings between phone and tablet interfaces could enable the
creation of tools and techniques for improved responsive design.
7 Conclusion
Acknowledgements We thank the reviewers for their helpful comments and suggestions and the
crowd workers who helped build the Rico dataset. This work was supported in part by a Google
Faculty Research Award.
254 B. Deka et al.
References
24. Frank M, Dong B, Felt AP, Song D (2012) Mining permission request patterns from android
and facebook applications. In: Proceeding of the ICDM
25. Fu B, Lin J, Li L, Faloutsos C, Hong J, Sadeh N (2013) Why people hate your app: making
sense of user feedback in a mobile app store. In: Proceedings of the KDD
26. Gupta K, Achille A, Lazarow J, Davis L, Mahadevan V, Shrivastava A (2020) Layout generation
and completion with self-attention
27. Huang F, Canny JF, Nichols J (2019) Swire: sketch-based user interface retrieval. In: Proceed-
ings of the 2019 CHI conference on human factors in computing systems, CHI ’19, Association
for Computing Machinery, New York, NY, USA, pp 1–10
28. Jager T (2017) Is the floating action button bad ux design?
29. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural
information processing systems 25. Curran Associates Inc, Stateline, NV, USA, pp 1097–1105
30. Landay JA, Borriello G (2003) Design patterns for ubiquitous computing. Computer 36(8):93–
95
31. Lee C, Kim S, Han D, Yang H, Park Y-W, Kwon BC, Ko S (2020) Guicomp: a gui design
assistant with real-time, multi-faceted feedback. In: Proceedings of the 2020 CHI conference
on human factors in computing systems, CHI ’20, Association for Computing Machinery, New
York, NY, USA, pp 1–13
32. Lee H-Y, Yang W, Jiang L, Le M, Essa I, Gong H, Yang M-H (2020) Neural design net-
work: graphic layout generation with constraints. In: Proceedings of European conference on
computer vision (ECCV)
33. Lee K, Flinn J, Giuli T, Noble B, Peplin C (2013) Amc: verifying user interface properties for
vehicular applications. In: Proceeding of the Mobisys
34. Li Y, He J, Zhou X, Zhang Y, Baldridge J (2020) Mapping natural language instructions to
mobile UI action sequences. In: Proceedings of the 58th annual meeting of the association for
computational linguistics, Association for Computational Linguistics, pp 8198–8210
35. Li Y, Li G, He L, Zheng J, Li H, Guan Z (2020) Widget captioning: generating natural language
description for mobile user interface elements. In: Proceedings of the 2020 conference on
empirical methods in natural language processing (EMNLP), Association for Computational
Linguistics, pp 5495–5510
36. Li Y, Yang Z, Guo Y, Chen X (2019) Humanoid: a deep learning-based approach to automated
black-box android app testing. In: 2019 34th IEEE/ACM international conference on automated
software engineering (ASE), pp 1070–1073
37. Liu TF, Craft M, Situ J, Yumer E, Mech R, Kumar R (2018) Learning design semantics for
mobile apps. In: Proceedings of the 31st annual ACM symposium on user interface software
and technology, UIST ’18, Association for Computing Machinery, New York, NY, USA, pp
569–579
38. Luis AL, Asutosh Hota AO (2020) Enrico: a high-quality dataset for topic modeling of mobile
ui designs. In: Proceedings of MobileHCI extended abstracts
39. Micallef N, Adi E, Misra G (2018) Investigating login features in smartphone apps. In: Pro-
ceedings of the 2018 ACM international joint conference and 2018 international symposium
on pervasive and ubiquitous computing and wearable computers, UbiComp ’18, Association
for Computing Machinery, New York, NY, USA, pp 842–851
40. Moran K, Bernal-Cárdenas C, Curcio M, Bonett R, Poshyvanyk D (2020) Machine learning-
based prototyping of graphical user interfaces for mobile apps. IEEE Trans Softw Eng
46(2):196–221
41. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In:
Proceeding of the ICML, pp 807–814
42. Neil T (2014) Mobile design pattern gallery: UI patterns for smartphone apps. O’Reilly Media,
Inc., Sebastopol
43. Pernice K, Budiu R (2016) Hamburger menus and hidden navigation hurt ux metrics
44. Ross AS, Zhang X, Fogarty J, Wobbrock JO (2018) Examining image-based button labeling
for accessibility in android apps through large-scale analysis. In: Proceedings of the 20th
256 B. Deka et al.
Abstract Over the last decade, Computer Vision, the branch of Artificial Intelli-
gence aimed at understanding the visual world, has evolved from simply recognizing
objects in images to describing pictures, answering questions about images, aiding
robots maneuver around physical spaces, and even generating novel visual content.
As these tasks and applications have modernized, so too has the reliance on more
data, either for model training or for evaluation. In this chapter, we demonstrate that
novel interaction strategies can enable new forms of data collection and evaluation
for Computer Vision. First, we present a crowdsourcing interface for speeding up
paid data collection by an order of magnitude, feeding the data-hungry nature of mod-
ern vision models. Second, we explore a method to increase volunteer contributions
using automated social interventions. Third, we develop a system to ensure human
evaluation of generative vision models are reliable, affordable, and grounded in psy-
chophysics theory. We conclude with future opportunities for Human–Computer
Interaction to aid Computer Vision.
1 Introduction
Today, Computer Vision applications are ubiquitous. They filter our pictures, control
our car, aid medical experts in disease diagnosis, analyze sports games, and even
generate complete new content. This recent emergence of Computer Vision tools
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 257
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_9
258 R. Krishna et al.
has been made possible because of a shift in the underlying techniques used to train
models; this shift has transferred attention away from hand engineered features [41,
126] toward deep learning [43, 101]. With deep learning techniques, vision models
have surpassed human performance on fundamental tasks, such as object recogni-
tion [159]. Today, vision models are capable of an entirely new host of applications,
such as generating photo-realistic images [90] and 3D spaces [134]. These tasks have
made possible numerous vision-powered applications [74, 107, 204, 206, 207].
This shift is also reflective of yet another change in Computer Vision: algorithms
are becoming more generic and data has become the primary hurdle in performance.
Today’s vision models are data-hungry; they feed on large amounts of annotated train-
ing data. In some cases, data needs to be continuously annotated in order to evaluate
models; for new tasks such as image generation, model-generated images can only
be evaluated if people provide realism judgments. To support data needs, Computer
Vision has relied on a specific pipeline for data collection—one that focuses on
manual labeling using online crowdsourcing platforms such as Amazon Mechanical
Turk [43, 98].
Unfortunately, this data collection pipeline has numerous limiting factors. First,
crowdsourcing can be insufficiently scalable and it remains too expensive for use in
the production of many industry-size datasets [84]. Cost is bound to the amount of
work completed per minute of effort, and existing techniques for speeding up labeling
are not scaling as quickly as the volume of data we are now producing that must be
labeled [184]. Second, while cost issues may be mitigated by relying on volunteer
contributions, it remains unclear how best to incentivize such contributions. Even
though there has been a lot of work in Social Psychology exploring strategies to
incentivize volunteer contributions to online communities [25, 42, 95, 131, 194,
205], it remains unclear how we can employ such strategies to develop automated
mechanisms that incentivize volunteer data annotation useful for Computer Vision.
Third, existing data annotation methods are ad hoc, each executed in idiosyncrasy
without proof of reliability or grounding to theory, resulting in high variance in their
estimates [45, 141, 164]. While high variance in labels might be tolerable when
collecting training data, it becomes debilitating when such ad hoc methods are used
to evaluate models.
Human–Computer Interaction’s opportunity is to look to novel interaction strate-
gies to break this away from traditional data collection pipeline. In this chapter,
we showcase three projects [99, 144, 210], that have helped modern Computer
Vision data needs. The first two projects introduce new training data collection inter-
faces [99] and interactions [144], while the third introduces a reliable system for
evaluating vision models with humans [210]. Our contributions (1) speed up data
collection by an order of magnitude in terms of speed and cost, (2) incentivize vol-
unteer contributions to provide labels through conversational interactions over social
media, and (3) capacitate reliable human evaluation of vision models.
In the first section, we highlight work that accelerates human interactions in micro-
task crowdsourcing, a core process through which computer vision and machine
learning (ML) datasets are predominantly curated [99]. Microtask crowdsourcing
has enabled dataset advances in social science and ML, but existing crowdsourcing
Visual Intelligence through Human Interaction 259
schemes are too expensive to scale up with the expanding volume of data. To scale
and widen the applicability of crowdsourcing, we present a technique that produces
extremely rapid judgments for binary and categorical labels. Rather than punishing
all errors, which causes workers to proceed slowly and deliberately, our technique
speeds up workers’ judgments to the point where errors are acceptable and even
expected. We demonstrate that it is possible to rectify these errors by randomizing
task order and modeling response latency. We evaluate our technique on a breadth of
common labeling tasks such as image verification, word similarity, sentiment anal-
ysis, and topic classification. Where prior work typically achieves a 0.25× to 1×
speedup over fixed majority vote, our approach often achieves an order of magnitude
(10×) speedup.
In the second section, we turn our attention from paid crowdsourcing to volunteer
contributions; we explore how to design social interventions to improve volunteer
contributions when curating datasets [144]. To support the massive data requirements
of modern supervised ML algorithms, crowdsourcing systems match volunteer con-
tributors to appropriate tasks. Such systems learn what types of tasks contributors
are interested to complete. In this paper, instead of focusing on what to ask, we focus
on learning how to ask: how to make relevant and interesting requests to encourage
crowdsourcing participation. We introduce a new technique that augments questions
with learning-based request strategies drawn from social psychology. We also intro-
duce a contextual bandit algorithm to select which strategy to apply for a given task
and contributor. We deploy our approach to collect volunteer data from Instagram
for the task of visual question answering, an important task in computer vision and
natural language processing that has enabled numerous human–computer interaction
applications. For example, when encountering a user’s Instagram post that contains
the ornate Trevi Fountain in Rome, our approach learns to augment its original raw
question “Where is this place?” with image-relevant compliments such as “What a
great statue!” or with travel-relevant justifications such as “I would like to visit this
place,” increasing the user’s likelihood of answering the question and thus provid-
ing a label. We deploy our agent on Instagram to ask questions about social media
images, finding that the response rate improves from 15.8% with unaugmented ques-
tions to 30.54% with baseline rule-based strategies and to 58.1% with learning-based
strategies.
Finally, in the third section, we spotlight our work on constructing a reliable
human evaluation system for generative computer vision models [210]. Generative
models often use human evaluations to measure the perceived quality of their out-
puts. Automated metrics are noisy indirect proxies, because they rely on heuristics
or pretrained embeddings. However, up until now, direct human evaluation strate-
gies have been ad hoc, neither standardized nor validated. Our work establishes a
gold standard human benchmark for generative realism. We construct Human eYe
Perceptual Evaluation (HYPE), a human benchmark that is grounded in psy-
chophysics research in perception, reliable across different sets of randomly sampled
outputs from a model, able to produce separable model performances, and efficient
in cost and time. We introduce two variants: one that measures visual perception
under adaptive time constraints to determine the threshold at which a model’s out-
260 R. Krishna et al.
puts appear real (e.g., 250 ms), and the other a less expensive variant that measures
human error rate on fake and real images sans time constraints. We test HYPE across
six state-of-the-art generative adversarial networks and two sampling techniques on
conditional and unconditional image generation using four datasets: CelebA, FFHQ,
CIFAR-10, and ImageNet. We find that HYPE can track the relative improvements
between models, and we confirm via bootstrap sampling that these measurements
are consistent and replicable.
Social science [92, 133], interactive systems [50, 104], and ML [43, 122] are becom-
ing more and more reliant on large-scale, human-annotated data. Increasingly large
annotated datasets have unlocked a string of social scientific insights [26, 56] and ML
performance improvements [58, 101, 187]. One of the main enablers of this growth
has been microtask crowdsourcing [174]. Microtask crowdsourcing marketplaces
such as Amazon Mechanical Turk offer a scale and cost that makes such annotation
feasible. As a result, companies are now using crowd work to complete hundreds of
thousands of tasks per day [130].
However, even microtask crowdsourcing can be insufficiently scalable, and it
remains too expensive for use in the production of many industry-size datasets [84].
Cost is bound to the amount of work completed per minute of effort, and existing
techniques for speeding up labeling (reducing the amount of required effort) are
not scaling as quickly as the volume of data we are now producing that must be
labeled [184]. To expand the applicability of crowdsourcing, the number of items
annotated per minute of effort needs to increase substantially.
In this paper, we focus on one of the most common classes of crowdsourcing tasks
[78]: binary annotation. These tasks are yes-or-no questions, typically identifying
whether or not an input has a specific characteristic. Examples of these types of
tasks are topic categorization (e.g., “Is this article about finance?”) [166], image
classification (e.g., “Is this a dog?”) [43, 119, 122], audio styles [167], and emotion
detection [119] in songs (e.g., “Is the music calm and soothing?”), word similarity
(e.g., “Are shipment and cargo synonyms?”) [135], and sentiment analysis (e.g., “Is
this tweet positive?”) [142].
Previous methods have sped up binary classification tasks by minimizing worker
error. A central assumption behind this prior work has been that workers make errors
because they are not trying hard enough (e.g., “a lack of expertise, dedication [or]
interest” [168]). Platforms thus punish errors harshly, for example, by denying pay-
ment. Current methods calculate the minimum redundancy necessary to be confident
that errors have been removed [168, 172, 173]. These methods typically result in a
0.25× to 1× speedup beyond a fixed majority vote [87, 146, 160, 168].
We take the opposite position: that designing the task to encourage some error,
or even make errors inevitable, can produce far greater speedups. Because platforms
strongly punish errors, workers carefully examine even straightforward tasks to make
Visual Intelligence through Human Interaction 261
Fig. 1 a Images are shown to workers at 100 ms per image. Workers react whenever they see a
dog. b The true labels are the ground truth dog images. c The workers’ keypresses are slow and
occur several images after the dog images have already passed. We record these keypresses as the
observed labels. d Our technique models each keypress as a delayed Gaussian to predict e the
probability of an image containing a dog from these observed labels
sure they do not represent edge cases [80, 132]. The result is slow, deliberate work.
We suggest that there are cases where we can encourage workers to move quickly
by telling them that making some errors is acceptable. Though individual worker
accuracy decreases, we can recover from these mistakes post hoc algorithmically
(Fig. 1).
We manifest this idea via a crowdsourcing technique in which workers label a
rapidly advancing stream of inputs. Workers are given a binary question to answer,
and they observe as the stream automatically advances via a method inspired by
rapid serial visual presentation (RSVP) [51, 117]. Workers press a key whenever
the answer is “yes” for one of the stream items. Because the stream is advancing
rapidly, workers miss some items and have delayed responses. However, workers
are reassured that the requester expects them to miss a few items. To recover the
correct answers, the technique randomizes the item order for each worker and model
workers’ delays as a normal distribution whose variance depends on the stream’s
speed. For example, when labeling whether images have a “barking dog” in them, a
self-paced worker on this task takes 1.7 s per image on average. With our technique,
workers are shown a stream at 100 ms per image. The technique models the delays
experienced at different input speeds and estimates the probability of intended labels
from the key presses.
We evaluate our technique by comparing the total worker time necessary to achieve
the same precision on an image labeling task as a standard setup with majority vote.
The standard approach takes three workers an average of 1.7 s each for a total of
5.1 s. Our technique achieves identical precision (97%) with five workers at 100 ms
each, for a total of 500 ms of work. The result is an order of magnitude speedup of
10×.
This relative improvement is robust across both simple tasks, such as identify-
ing dogs, and complicated tasks, such as identifying “a person riding a motorcycle”
(interactions between two objects) or “people eating breakfast” (understanding rela-
262 R. Krishna et al.
tionships among many objects). We generalize our technique to other tasks such as
word similarity detection, topic classification, and sentiment analysis. Additionally,
we extend our method to categorical classification tasks through a ranked cascade of
binary classifications. Finally, we test workers’ subjective mental workload and find
no measurable increase.
Overall, we make the following contributions: (1) We introduce a rapid crowd-
sourcing technique that makes errors normal and even inevitable. We show that it
can be used to effectively label large datasets by achieving a speedup of an order
of magnitude on several binary labeling crowdsourcing tasks. (2) We demonstrate
that our technique can be generalized to multi-label categorical labeling tasks, com-
bined independently with existing optimization techniques, and deployed without
increasing worker mental workload.
The main motivation behind our work is to provide an environment where humans
can make decisions quickly. We encourage a margin of human error in the interface
that is then rectified by inferring the true labels algorithmically. In this section, we
review prior work on crowdsourcing optimization and other methods for motivating
contributions. Much of this work relies on artificial intelligence techniques: we com-
plement this literature by changing the crowdsourcing interface rather than focusing
on the underlying statistical model.
Our technique is inspired by rapid serial visual presentation (RSVP), a technique
for consuming media rapidly by aligning it within the foveal region and advancing
between items quickly [51, 117]. RSVP has already been proven to be effective at
speeding up reading rates [203]. RSVP users can react to a single-target image in a
sequence of images even at 125 ms per image with 75% accuracy [148]. However,
when trying to recognize concepts in images, RSVP only achieves an accuracy of
10% at the same speed [149]. In our work, we integrate multiple workers’ errors to
successfully extract true labels.
Many previous papers have explored ways of modeling workers to remove bias
or errors from ground truth labels [79, 146, 199, 200, 209]. For example, an unsu-
pervised method for judging worker quality can be used as a prior to remove bias
on binary verification labels [79]. Individual workers can also be modeled as projec-
tions into an open space representing their skills in labeling a particular image [200].
Workers may have unknown expertise that may in some cases prove adversarial to
the task. Such adversarial workers can be detected by jointly learning the difficulty
of labeling a particular datum along with the expertises of workers [199]. Finally, a
generative model can be used to model workers’ skills by minimizing the entropy of
the distribution over their labels and the unknown true labels [209]. We draw inspi-
ration from this literature, calibrating our model using a similar generative approach
to understand worker reaction times. We model each worker’s reaction as a delayed
Gaussian distribution.
Visual Intelligence through Human Interaction 263
In an effort to reduce cost, many previous papers have studied the tradeoffs
between speed (cost) and accuracy on a wide range of tasks [20, 159, 192, 193]. Some
methods estimate human time with annotation accuracy to jointly model the errors
in the annotation process [20, 192, 193]. Other methods vary both the labeling cost
and annotation accuracy to calculate a tradeoff between the two [44, 81]. Similarly,
some crowdsourcing systems optimize a budget to measure confidence in worker
annotations [86, 87]. Models can also predict the redundancy of non-expert labels
needed to match expert-level annotations [168]. Just like these methods, we show
that non-experts can use our technique and provide expert-quality annotations; we
also compare our methods to the conventional majority-voting annotation scheme.
Another perspective on rapid crowdsourcing is to return results in real time,
often by using a retainer model to recall workers quickly [7, 109, 111]. Like our
approach, real-time crowdsourcing can use algorithmic solutions to combine multi-
ple in-progress contributions [110]. These systems’ techniques could be fused with
ours to create crowds that can react to bursty requests.
One common method for optimizing crowdsourcing is active learning, which
involves learning algorithms that interactively query the user. Examples include train-
ing image recognition [175] and attribution recognition [145] with fewer examples.
Comparative models for ranking attribute models have also optimized crowdsourc-
ing using active learning [120]. Similar techniques have explored optimization of
the “crowd kernel” by adaptively choosing the next questions asked of the crowd in
order to build a similarity matrix between a given set of data points [180]. Active
learning needs to decide on a new task after each new piece of data is gathered from
the crowd. Such models tend to be quite expensive to compute. Other methods have
been proposed to decide on a set of tasks instead of just one task [186]. We draw on
this literature: in our technique, after all the images have been seen by at least one
worker, we use active learning to decide the next set of tasks. We determine which
images to discard and which images to group together and send this set to another
worker to gather more information.
Finally, there is a group of techniques that attempt to optimize label collection by
reducing the number of questions that must be answered by the crowd. For example,
a hierarchy in label distribution can reduce the annotation search space [44], and
information gain can reduce the number of labels necessary to build large taxonomies
using a crowd [18, 35]. Methods have also been proposed to maximize accuracy of
object localization in images [177] and videos [191]. Previous labels can also be
used as a prior to optimize acquisition of new types of annotations [19]. One of the
benefits of our technique is that it can be used independently of these others to jointly
improve crowdsourcing schemes. We demonstrate the gains of such a combination
in our evaluation.
264 R. Krishna et al.
Fig. 2 a Task instructions inform workers that we expect them to make mistakes since the items
will be displayed rapidly. b A string of countdown images prepares them for the rate at which items
will be displayed. c An example image of a “dog” shown in the stream—the two images appearing
behind it are included for clarity but are not displayed to workers. d When the worker presses a key,
we show the last four images below the stream of images to indicate which images might have just
been labeled
be positive items. These help us calibrate each worker’s speed and also provide us
with a mechanism to reject workers who do not react to any of the items.
Once workers start the stream (Fig. 2b), it is important to prepare them for pace
of the task. We thus show a film-style countdown for the first few seconds that
decrements to zero at the same interval as the main task. Without these countdown
images, workers use up the first few seconds getting used to the pace and speed.
Figure 2c shows an example “dog” image that is displayed in front of the user. The
dimensions of all items (images) shown are held constant to avoid having to adjust
to larger or smaller visual ranges.
When items are displayed for less than 400 ms, workers tend to react to all positive
items with a delay. If the interface only reacts with a simple confirmation when
workers press the spacebar, many workers worry that they are too late because another
item is already on the screen. Our solution is to also briefly display the last four
items previously shown when the spacebar is pressed, so that workers see the one
they intended and also gather an intuition for how far back the model looks. For
example, in Fig. 2d, we show a worker pressing the spacebar on an image of a horse.
We anticipate that the worker was probably delayed, and we display the last four
items to acknowledge that we have recorded the keypress. We ask all workers to
first complete a qualification task in which they receive feedback on how quickly we
expect them to react. They pass the qualification task only if they achieve a recall
of 0.6 and precision of 0.9 on a stream of 200 items with 25 positives. We measure
precision as the fraction of worker reactions that were within 500 ms of a positive
cue.
In Fig. 3, we show two sample outputs from our interface. Workers were shown
images for 100 ms each. They were asked to press the spacebar whenever they saw
an image of “a person riding a motorcycle.” The images with blue bars underneath
them are ground truth images of “a person riding a motorcycle.” The images with red
bars show where workers reacted. The important element is that red labels are often
delayed behind blue ground truth and occasionally missed entirely. Both Fig. 3a, b
have 100 images each with 5 correct images.
266 R. Krishna et al.
Fig. 3 Example raw worker outputs from our interface. Each image was displayed for 100 ms and
workers were asked to react whenever they saw images of “a person riding a motorcycle.” Images
are shown in the same order they appeared in for the worker. Positive images are shown with a blue
bar below them and users’ keypresses are shown as red bars below the image to which they reacted
Because of workers’ reaction delay, the data from one worker has considerable
uncertainty. We thus show the same set of items to multiple workers in different
random orders and collect independent sets of keypresses. This randomization will
produce a cleaner signal in aggregate and later allow us to estimate the images to
which each worker intended to react.
Given the speed of the images, workers are not able to detect every single positive
image. For example, the last positive image in Fig. 3a and the first positive image
in Fig. 3b are not detected. Previous work on RSVP found a phenomenon called
“attention blink” [21], in which a worker is momentarily blind to successive positive
images. However, we find that even if two images of “a person riding a motorcycle”
occur consecutively, workers are able to detect both and react twice (Fig. 3a, b). If
workers are forced to react in intervals of less than 400 ms, though, the signal we
extract is too noisy for our model to estimate the positive items.
Multi-class Classification for Categorical Data
So far, we have described how rapid crowdsourcing can be used for binary verification
tasks. Now we extend it to handle multi-class classification. Theoretically, all multi-
class classification can be broken down into a series of binary verifications. For
example, if there are N classes, we can ask N binary questions of whether an item is
in each class. Given a list of items, we use our technique to classify them one class
at a time. After every iteration, we remove all the positively classified items for a
particular class. We use the rest of the items to detect the next class.
Assuming all the classes contain an equal number of items, the order in which we
detect classes should not matter. A simple baseline approach would choose a class at
random and attempt to detect all items for that class first. However, if the distribution
of items is not equal among classes, this method would be inefficient. Consider the
case where we are trying to classify items into ten classes, and one class has 1000
items, while all other classes have 10 items. In the worst case, if we classify the class
Visual Intelligence through Human Interaction 267
with 1000 examples last, those 1000 images would go through our interface ten times
(once for every class). Instead, if we had detected the large class first, we would be
able to classify those 1000 images and they would only go through our interface
once. With this intuition, we propose a class-optimized approach that classifies the
most common class of items first. We maximize the number of items we classify at
every iteration, reducing the total number of binary verifications required.
2.3 Model
where P(C) = k P(Ck ) is the probability of a particular set of items being key-
presses. We set P(Ck ) to be constant, assuming that it is equally likely that a worker
might react to any item. Using Bayes’ rule,
P(Ii ) models our estimate of item Ii being positive. It can be a constant, or it can
be an estimate from a domain-specific ML algorithm [85]. For example, to calculate
P(Ii ), if we were trying to scale up a dataset of “dog” images, we would use a small set
of known “dog” images to train a binary classifier and use that to calculate P(Ii ) for
all the unknown images. With image tasks, we use a pretrained convolutional neural
network to extract image features [171] and train a linear support vector machine to
calculate P(Ii ).
We model P(C w |Ii ) as a set of independent keypresses:
P(C w |Ii ) = P(c1w , . . . , ckw |Ii ) = P(Ckw |Ii ). (3)
k
268 R. Krishna et al.
Our technique hypothesizes that guiding workers to work quickly and make errors
can lead to results that are faster yet with similar precision. We begin evaluating
our technique by first studying worker reaction times as we vary the length of time
for which each item is displayed. If worker reaction times have a low variance, we
accurately model them. Existing work on RSVP estimated that humans usually react
about 400 ms after being presented with a cue [153, 197]. Similarly, the model human
processor [27] estimated that humans perceive, understand, and react at least 240 ms
after a cue. We first measure worker reaction times, then analyze how frequently
positive items can be displayed before workers are unable to react to them in time.
Method. We recruited 1,000 workers on Amazon Mechanical Turk with 96%
approval rating and over 10,000 tasks submitted. Workers were asked to work on
one task at a time. Each task contained a stream of 100 images of polka dot pat-
terns of two different colors. Workers were asked to react by pressing the spacebar
whenever they saw an image with polka dots of one of the two colors. Tasks could
vary by two variables: the speed at which images were displayed and the percentage
of the positively colored images. For a given task, we held the display speed con-
stant. Across multiple tasks, we displayed images for 100 ms–500 ms. We studied
two variables: reaction time and recall. We measured the reaction time to the positive
Visual Intelligence through Human Interaction 269
Fig. 4 We plot the change in recall as we vary percentage of positive items in a task. We experiment
at varying display speeds ranging from 100 ms to 500 ms. We find that recall is inversely proportional
to the rate of positive stimuli and not to the percentage of positive items
color across these speeds. To study recall (percentage of positively colored images
detected by workers), we varied the ratio of positive images from 5% to 95%. We
counted a keypress as a detection only if it occurred within 500 ms of displaying a
positively colored image.
Results. Workers’ reaction times corresponded well with estimates from previous
studies. Workers tend to react an average of 378 ms (σ = 92 ms) after seeing a positive
image. This consistency is an important result for our model because it assumes that
workers have a consistent reaction delay.
As expected, recall is inversely proportional to the speed at which the images
are shown. A worker is more likely to miss a positive image at very fast speeds.
We also find that recall decreases as we increase the percentage of positive items
in the task. To measure the effects of positive frequency on recall, we record the
percentage threshold at which recall begins to drop significantly at different speeds
and positive frequencies. From Fig. 4, at 100 ms, we see that recall drops when the
percentage of positive images is more than 35%. As we increase the time for which
an item is displayed, however, we notice that the drop in recall occurs at a much
higher percentage. At 500 ms, the recall drops at a threshold of 85%. We thus infer
that recall is inversely proportional to the rate of positive stimuli and not to the
percentage of positive images. From these results, we conclude that, at faster speeds,
it is important to maintain a smaller percentage of positive images, while at slower
speeds, the percentage of positive images has a lesser impact on recall. Quantitatively,
to maintain a recall higher than 0.7, it is necessary to limit the frequency of positive
cues to one every 400 ms.
270 R. Krishna et al.
In this study, we deploy our technique on image verification tasks and measure its
speed relative to the conventional self-paced approach. Many crowdsourcing tasks
in computer vision require verifying that a particular image contains a specific class
or concept. We measure precision, recall, and cost (in seconds) by the conventional
approach and compare against our technique.
Some visual concepts are easier to detect than others. For example, detecting
an image of a “dog” is a lot easier than detecting an image of “a person riding a
motorcycle” or “eating breakfast.” While detecting a “dog” is a perceptual task, “a
person riding a motorcycle” requires understanding of the interaction between the
person and the motorcycle. Similarly, “eating breakfast” requires workers to fuse
concepts of people eating a variety foods like eggs, cereal, or pancakes. We test
our technique on detecting three concepts: “dog” (easy concept), “a person riding a
motorcycle” (medium concept) and “eating breakfast” (hard concept). In this study,
we compare how workers fare on each of these three levels of concepts.
Method. In this study, we compare the conventional approach with our technique
on three (easy, medium, and hard) concepts. We evaluate each of these comparisons
using precision scores, recall scores, and the speedup achieved. To test each of the
three concepts, we labeled 10,000 images, where each concept had 500 examples. We
divided the 10,000 images into streams of 100 images for each task. We paid workers
$0.17 to label a stream of 100 images (resulting in a wage of $6 per hour [163]).
We hired over 1,000 workers for this study satisfying the same qualifications as the
calibration task.
The conventional method of collecting binary labels is to present a crowd worker
with a set of items. The worker proceeds to label each item, one at a time. Most
datasets employ multiple workers to label each task because majority voting [174]
has been shown to improve the quality of crowd annotations. These datasets usually
use a redundancy of 3–5 workers [169]. In all our experiments, we used a redundancy
of three workers as our baseline.
When launching tasks using our technique, we tuned the image display speed to
100 ms. We used a redundancy of five workers when measuring precision and recall
scores. To calculate speedup, we compare the total worker time taken by all the five
workers using our technique with the total worker time taken by the three workers
using the conventional method. Additionally, we vary redundancy on all the concepts
to from one to ten workers to see its effects on precision and recall.
Results. Self-paced workers take 1.70 s on average to label each image with a concept
in the conventional approach (Table 1). They are quicker at labeling the easy concept
(1.50 s per worker), while taking longer on the medium (1.70 s) and hard (1.90 s)
concepts.
Using our technique, even with a redundancy of five workers, we achieve a speedup
of 10.20× across all concepts. We achieve order of magnitude speedups of 9.00×,
10.20×, and 11.40× on the easy, medium, and hard concepts, respectively. Overall,
Visual Intelligence through Human Interaction 271
Table 1 We compare the conventional approach for binary verification tasks (image verification,
sentiment analysis, word similarity, and topic detection) with our technique and compute precision
and recall scores. Precision scores, recall scores, and speedups are calculated using three workers
in the conventional setting. Image verification, sentiment analysis, and word similarity used five
workers using our technique, while topic detection used only two workers. We also show the time
taken (in seconds) for one worker to do each task
Task Conventional approach Our technique Speedup
Time (s) Precision Recall Time (s) Precision Recall
Image Easy 1.50 0.99 0.99 0.10 0.99 0.94 9.00×
verification
Medium 1.70 0.97 0.99 0.10 0.98 0.83 10.20×
Hard 1.90 0.93 0.89 0.10 0.90 0.74 11.40×
All 1.70 0.97 0.96 0.10 0.97 0.81 10.20×
concepts
Sentiment 4.25 0.93 0.97 0.25 0.94 0.84 10.20×
analysis
Word 6.23 0.89 0.94 0.60 0.88 0.86 6.23×
similarity
Topic 14.33 0.96 0.94 2.00 0.95 0.81 10.75×
detection
across all concepts, the precision and recall achieved by our technique are 0.97 and
0.81, respectively. Meanwhile, the precision and recall of the conventional method are
0.97 and 0.96, respectively. We thus achieve the same precision as the conventional
method. As expected, recall is lower because workers are not able to detect every
single true positive example. As argued previously, lower recall can be an acceptable
tradeoff when it is easy to find more unlabeled images.
Now, let’s compare precision and recall scores between the three concepts. We
show precision and recall scores in Fig. 5 for the three concepts. Workers perform
slightly better at finding “dog” images and find it the most difficult to detect the more
challenging “eating breakfast” concept. With a redundancy of 5, the three concepts
achieve a precision of 0.99, 0.98, and 0.90, respectively, at a recall of 0.94, 0.83, and
0.74 (Table 1). The precision for these three concepts are identical to the conventional
approach, while the recall scores are slightly lower. The recall for a more difficult
cognitive concept (“eating breakfast”) is much lower, at 0.74, than for the other two
concepts. More complex concepts usually tend to have a lot of contextual variance.
For example, “eating breakfast” might include a person eating a “banana,” a “bowl
of cereal,” “waffles,” or “eggs.” We find that while some workers react to one variety
of the concept (e.g., “bowl of cereal”), others react to another variety (e.g., “eggs”).
When we increase the redundancy of workers to ten (Fig. 6), our model is able to
better approximate the positive images. We see diminishing increases in both recall
and precision as redundancy increases. At a redundancy of 10, we increase recall
to the same amount as the conventional approach (0.96), while maintaining a high
precision (0.99) and still achieving a speedup of 5.1×.
272 R. Krishna et al.
Fig. 5 We study the precision (left) and recall (right) curves for detecting “dog” (top), “a person on
a motorcycle” (middle), and “eating breakfast” (bottom) images with a redundancy ranging from
1 to 5. There are 500 ground truth positive images in each experiment. We find that our technique
works for simple as well as hard concepts
Fig. 6 We study the effects of redundancy on recall by plotting precision and recall curves for
detecting “a person on a motorcycle” images with a redundancy ranging from 1 to 10. We see
diminishing increases in precision and recall as we increase redundancy. We manage to achieve the
same precision and recall scores as the conventional approach with a redundancy of 10 while still
achieving a speedup of 5×
We conclude from this study that our technique (with a redundancy of 5) can
speed up image verification with easy, medium, and hard concepts by an order of
magnitude while still maintaining high precision. We also show that recall can be
compensated by increasing redundancy.
Visual Intelligence through Human Interaction 273
So far, we have shown that rapid crowdsourcing can be used to collect image verifica-
tion labels. We next test the technique on a variety of other common crowdsourcing
tasks: sentiment analysis [142], word similarity [174], and topic detection [116].
Method. In this study, we measure precision, recall, and speedup achieved by our
technique over the conventional approach. To determine the stream speed for each
task, we followed the prescribed method of running trials and speeding up the stream
until the model starts losing precision. For sentiment analysis, workers were shown a
stream of tweets and asked to react whenever they saw a positive tweet. We displayed
tweets at 250 ms with a redundancy of five workers. For word similarity, workers
were shown a word (e.g., “lad”) for which we wanted synonyms. They were then
rapidly shown other words at 600 ms and asked to react if they see a synonym (e.g.,
“boy”). Finally, for topic detection, we presented workers with a topic like “housing”
or “gas” and presented articles of an average length of 105 words at a speed of 2 s
per article. They reacted whenever they saw an article containing the topic we were
looking for. For all three of these tasks, we compare precision, recall, and speed
against the self-paced conventional approach with a redundancy of three workers.
Every task, for both the conventional approach and our technique, contained 100
items.
To measure the cognitive load on workers for labeling so many items at once, we
ran the widely used NASA Task Load Index (TLX) [37] on all tasks, including image
verification. TLX measures the perceived workload of a task. We ran the survey on
100 workers who used the conventional approach and 100 workers who used our
technique across all tasks.
Results. We present our results in Table 1 and Fig. 7. For sentiment analysis, we
find that workers in the conventional approach classify tweets in 4.25 s. So, with a
redundancy of three workers, the conventional approach would take 12.75 s with a
precision of 0.93. Using our method and a redundancy of five workers, we complete
the task in 1250 ms (250 ms per worker per item) and 0.94 precision. Therefore, our
technique achieves a speedup of 10.2×.
Likewise, for word similarity, workers take around 6.23 s to complete the con-
ventional task, while our technique succeeds at 600 ms. We manage to capture a
comparable precision of 0.88 using five workers against a precision of 0.89 in the
conventional method with three workers. Since finding synonyms is a higher level
cognitive task, workers take longer to do word similarity tasks than image verification
and sentiment analysis tasks. We manage a speedup of 6.23×.
Finally, for topic detection, workers spend significant time analyzing articles in
the conventional setting (14.33 s on average). With three workers, the conventional
approach takes 43 s. In comparison, our technique delegates 2 s for each article. With
a redundancy of only two workers, we achieve a precision of 0.95, similar to the 0.96
achieved by the conventional approach. The total worker time to label one article
using our technique is 4 s, a speedup of 10.75×.
274 R. Krishna et al.
Fig. 7 Precision (left) and recall (right) curves for sentiment analysis (top), word similarity (mid-
dle), and topic detection (bottom) images with a redundancy ranging from 1 to 5. Vertical lines
indicate the number of ground truth positive examples
The mean TLX workload for the control condition was 58.5 (σ = 9.3), and
62.4 (σ = 18.5) for our technique. Unexpectedly, the difference between conditions
was not significant (t (99) = −0.53, p = 0.59). The “temporal demand” scale item
appeared to be elevated for our technique (61.1 vs. 70.0), but this difference was not
significant (t (99) = −0.76, p = 0.45). We conclude that our technique can be used
to scale crowdsourcing on a variety of tasks without statistically increasing worker
workload.
the most examples. When using our interface, we broke tasks into streams of 100
images displayed for 100 ms each. We used a redundancy of three workers for the
conventional interface and five workers for our interface. We calculated the precision
and recall scores across each of these three methods as well as the cost (in seconds)
of each method.
Results. (1) In the naive approach, we need to collect 20,000 binary labels that take
1.7 s each. With five workers, this takes 102,000 s ($170 at a wage rate of $6/hr) with
an average precision of 0.99 and recall of 0.95. (2) Using the baseline approach, it
takes 12,342 s ($20.57) with an average precision of 0.98 and recall of 0.83. This
shows that the baseline approach achieves a speedup of 8.26× when compared with
the naive approach. (3) Finally, the class-optimized approach is able to detect the
most common class first and hence reduces the number of times an image is sent
through our interface. It takes 11,700 s ($19.50) with an average precision of 0.98
and recall of 0.83. The class-optimized approach achieves a speedup of 8.7× when
compared to the naive approach. While the speedup between the baseline and the
class-optimized methods is small, it would be increased on a larger dataset with more
classes.
Our method can be combined with existing techniques [13, 44, 145, 175] that opti-
mize binary verification and multi-class classification by preprocessing data or using
active learning. One such method [44] annotated ImageNet (a popular large dataset
for image classification) effectively with a useful insight: they realized that its classes
could be grouped together into higher semantic concepts. For example, “dog,” “rab-
bit,” and “cat” could be grouped into the concept “animal.” By utilizing the hierarchy
of labels that is specific to this task, they were able to preprocess and reduce the num-
ber of labels needed to classify all images. As a case study, we combine our technique
with their insight and evaluate the speedup in collecting a subset of ImageNet.
Method. We focused on a subset of the dataset with 20,000 images and classified
them into 200 classes. We conducted this case study by comparing three ways of
collecting labels: (1) The naive approach asked 200 binary questions for each image
in the subset, where each question asked if the image belonged to one of the 200
classes. We used a redundancy of 3 workers for this task. (2) The optimal labeling
method used the insight to reduce the number of labels by utilizing the hierarchy
of image classes. (3) The combined approach used our technique for multi-class
classification combined with the hierarchy insight to reduce the number of labels
collected. We used a redundancy of five workers for this technique with tasks of 100
images displayed at 250 ms.
Results. (1) Using the naive approach, this would result in asking 4 million binary
verification questions. Given that each binary label takes 1.7 s (Table 1), we estimate
276 R. Krishna et al.
that the total time to label the entire dataset would take 6.8 million seconds ($11,333
at a wage rate of $6/hr). (2) The optimal labeling method is estimated to take 1.13
million seconds ($1,888) [44]. (3) Combining the hierarchical questions with our
interface, we annotate the subset in 136,800 s ($228). We achieve a precision of 0.97
with a recall of 0.82. By combining our 8× speedup with the 6× speedup from
intelligent question selection, we achieve a 50× speedup in total.
2.9 Discussion
Consider, for example, that we are building a dataset of images with their tagged
geolocations (Fig. 8). When we encounter an image of a person wearing a black
shirt next to a beautiful scenery, existing ML systems can generate questions such as
“where is this place?”. However, prior work reports that such requests seem mechan-
ical, resulting in lower response rates [17]. In our approach, requests might be aug-
mented by content compliment strategies [156] reactive to the image content,
such as “What a great statue!” or “That’s a beautiful building!,” or by interest
matching strategies [36] reactive to the image content, such as “I love visiting
statues!” or “I love seeing old buildings!”
Augmenting requests with social strategies requires (1) defining a set of possi-
ble social strategies, (2) developing a method to generate content for each strategy
conditioned on an image, and (3) choosing the appropriate strategy to maximize
response conditioned on the user and their post. In this paper, we tackle these three
challenges. First, we adopt a set of social strategies that social psychologists have
demonstrated to be successful in human–human communication [36, 72, 108, 156,
181]. While our set is not exhaustive, it represents a diverse list of strategies—some
that augment questions conditioned on the image and others conditioned on the user’s
language. While previous work has explored the use of ML models to generate image-
conditioned natural language fragments, for generating captions and questions, ours
is the first method that employs these techniques to generate strategies that increase
worker participation.
To test the efficacy of our approach, we deploy our system on Instagram, a social
media image-sharing platform. We collect datasets and develop ML-based models
that use a convolutional neural network (CNN) to encode the image contents and a
long short-term memory network (LSTM) to generate each social strategy across a
large set of different kinds of images. We compare our ML strategies against baseline
rule-based strategies using linguistic features extracted from the user’s post [118].
We show a sample of augmented questions in Fig. 9. We find that choosing appropri-
ate strategies and augmenting requests leads to a significant absolute participation
Visual Intelligence through Human Interaction 279
Agent
It’s because the What a great statue! I believe that you are I also like to watch the I really like deep
burger looks creative. What Where is this place? a cartoon lover. Where can sunset over the water. forest. Where is this place?
is this food called? I get that? Where is this place?
Fig. 9 Our agent chooses appropriate social strategies and contextualizes questions to maximize
crowdsourcing participation
increase of 42.36% over no strategy when using ML strategies and a 14.78% increase
when using rule-based strategies. We also find that no specific strategy is the univer-
sal best choice, implying that knowing when to use a strategy is important. While
we specifically focus on VQA and Instagram, our approach generalizes to other
crowdsourcing systems that support language-based interaction with contributors.
ated by hiring professionals, as in the case of Facebook M [67], or crowd workers [14,
112]. Unlike these approaches where people have a goal and invoke a passive conver-
sational agent, we build active agents reach out to people with questions that increase
humans participation.
Social interaction with machines. To design an agent capable of eliciting a user’s
help, we need to understand how a user views the interaction. The Media Equation
proposes that people adhere to similar social norms in their interactions with com-
puters as they do in interactions with other people [154]. It shoes that agents that
seem more human-like, in terms of behavior and gestures, provoke users to treat
them similar to a person [29, 30, 139]. Consistent with these observations, prior
work has also shown that people are more likely to resolve misunderstandings with
more human-like agents [39]. This leads us to question whether a human-like con-
versational agent can encourage more online participation from online contributors.
Prior work on interactions with machines investigates social norms that a machine
can mimic in a binary capacity—either it respects the norm correctly or violates it
with negligence [34, 165]. Instead, we project social interaction on a spectrum—
some social strategies are more successful than others in a given context—and learn
a selection strategy that maximizes participation.
Structuring requests to enhance motivation. There have been many proposed
social strategies to enhance the motivation to contribute in online communities [95].
For example, asking a specific question rather than making a statement or asking an
open-ended question increases the likelihood of getting a response [25]. Requests
succeed significantly more often when contributors are addressed by name [131].
Emergencies receive more responses than requests without time constraints [42].
Prior work has shown that factors that increase the contributor’s affinity for the
requester increase the persuasive power of the message on online crowdfunding
sites [205]. It has also been observed that different behavior elicits different kind
of support from online support groups with self disclosure eliciting emotional sup-
port and questioning resulting in informational support [194]. The severity of the
outcome of responding to a request can also influence motivation [31]. Our work
incorporates some of these established social strategies and leverages language gen-
eration algorithms to build an agent that can deploy them across a wide variety of
different requests.
The goal of our system is to draw on theories of how people ask other people for help
and favors, then learn how to emulate those strategies. Drawing on prior work, we
sampled a diverse set of nine social strategies. While the set of nine social strategies,
we explore are not an exhaustive set, we believe it represents a wide enough range of
possible strategies to demonstrate the method and effects of teaching social strategies
to machines. The social strategies we explore are:
Visual Intelligence through Human Interaction 281
In this section, we describe our approach for augmenting requests with social strate-
gies. Our approach is divided into two components: generation and selection. A high-
level system diagram is depicted in Fig. 10. Given a social media post, we featurize
the post metadata, question, and caption, then send them to the selection component.
The main goal of the selection component is to choose an effective social strategy
to use for the given post. This strategy, along with a generated question to ask [96],
and the social media post are sent to the generation component, which augments the
question by generating a natural language phrase for the chosen social strategy. The
augmented request is then shared with the contributor. The selection module gathers
feedback, positive if the contributor responds in an informative manner. Uninforma-
tive responses or no response are counted as a negative feedback.
Selection: Choosing a Social Strategy
We model our selection component as a contextual bandit. Contextual bandits are a
common reinforcement learning technique for efficiently exploring different options
and exploiting the best choices over time, generalizing from previous trials to uncom-
282 R. Krishna et al.
Fig. 10 Given a social media post and a question we want to ask, we augment the question with
a social strategy. Our system contains two components. First, a selection component featurizes
the post and user and chooses a social strategy. Second, a generation component creates a natural
language augmentation for the question given the image and the chosen strategy. The contributor’s
response or silence is used to generate a feedback reward for the selection module
monly observed situations [118]. The component receives a feature vector and outputs
its choice of an arm (option) that it expects to result in the highest expected reward.
Each social media post is represented as a feature vector that encodes information
about the user, the post, and the caption. User features include—number of posts
the user has posted, number of followers, number of accounts the user is follow-
ing, number of other users tagged in their posts, filters, and AR effects the user
uses frequently on the platform, user’s engagement with videos, whether the user
is a verified business or an influencer, user’s privacy settings, the engagement with
Instagram features such as highlight reels and resharing, and sentiment analysis on
their biography. Post features include the number of users who like the post and the
number of users who commented on the post. User and post features are drawn from
Instagram’s API and featurized as bag of words or one-hot vectors. Lastly, caption
features are extracted from sentiment using Vader [76], and the hashtags extracted
using regular expressions.
We train a contextual bandit model to choose a social strategy given the extracted
features, conditioned on the success of each social strategy used on similar social
media posts in the past. The arms that the contextual bandit considers represent
each of the nine social strategies that the system can use. If a chosen social strategy
receives a response, we parse and check if the response contains an answer [46]. If
so, the model receives a positive reward for choosing the social strategy. If a chosen
social strategy does not receive a response, or if the response does not contain an
answer, the model receives a negative reward.
Our implementation of contextual bandit uses the adaptive greedy algorithm for
balancing the tradeoff between exploration and exploitation. During training, the
algorithm chooses an option that the model associates with a high uncertainty of
reward. If there is no option with a high uncertainty, the algorithm chooses a random
option to explore. The threshold for uncertainty decreases as the model is exposed
to more data. During inference, the model predicts the social strategy with highest
expected reward [208].
Visual Intelligence through Human Interaction 283
2The dataset of social media posts and social strategies for training the reinforcement learning
model, as well as the trained contextual bandit model, is publicly available at http://cs.stanford.edu/
people/ranjaykrishna/socialstrategies.
284 R. Krishna et al.
3.4 Experiments
training data. Once trained, we post 100 questions per strategy to Instagram, resulting
in 1100 total posts. To further study the scalability and transfer of strategies learned
in one domain and applied to another, we train augmentation models using data from
a “source” domain and test its effect on posts from “target” domains. For example,
we train models using data collected from the #travel source domain and test on the
rest as target domains.
To train the selection model, we gather 10k posts from Instagram and generate
augmentations with each of the social strategies. Each post, with all the augmentated
questions, is sent to AMT workers, who are asked to pick the strategies that would
be appropriate to use. We choose to train the selection model using AMT instead of
Instagram as it allows us to quickly collect large amounts of training data and negate
the impact of other confounds. Each AMT task included 10 social media posts. One
out of the ten posts contained an attention checker in the question to verify that the
workers were actually reading the questions. Workers were compensated at a rate of
$12 per hour.
Augmenting Questions with Social Strategies
Our goal in the first set of experiments is to study the effect of using social strategies
to augment questions.
Informative responses. Before we inspect the effects of social strategies, we first
report the quality of responses from Instagram users. We manually annotate all our
responses and find that 93.01% of questions are both relevant as well as answer-
able. Out of the relevant questions, 95.52% of responses were informative, i.e. the
responses contained the correct answer to the question. Figure 12 visualizes a set
of example responses for different posts with different social strategies in the travel
domain. While all social strategies outperformed the baseline in receiving responses,
the quality of the responses differed across strategies.
Effect of social strategies. Table 2 reports the informative response rate across all
the social strategies. We find that, compared to the baseline case, where no strat-
egy is used, rule-based strategies improve participation by 14.78 percent points. An
unpaired t-test confirms that participation increases by designing appropriate rule-
based social strategies (t (900) = 3.05, p < 0.01). When social strategy data is col-
lected and used to train ML strategies, performance increases by 42.36 percent points
and 27.58 percent points when compared against un-augmented (t (900) = 8.17,
p < 0.001) and rule-based strategies (t (900) = 8.96, p < 0.001) and confirmed by
unpaired t-tests. Overall, we find that expertise compliment and logical
justification performed strongly in shopping domain, but weakly in animals
and food domains.
To test the scalability of our strategies across image domains, we train models
on a source domain and deploy them on a target domain. We find that expertise
compliment drops in performance while interest matching improves. The
drop implies that ML models that heavily depend on example data points used in
training process are not robust in new domains. Therefore, while ML strategies are
the most effective, they require strategy data collected for the domain in which they
Visual Intelligence through Human Interaction 287
Table 2 Response rates achieved by different strategies on posts in the source and target domains.
The bottom of the table shows a comparison between average performance of ML based strategies,
average performance of rule-based strategies and baseline un-augmented questions
Source domain (%) Target domain (%)
Expertise compliment 72.90 29.55
Content compliment 59.11 68.96
Interest matching 45.31 85.38
Logical justification 55.17 19.7
Answer attempt 41.37 42.69
Help request 31.52 32.84
Valence matching 37.43 36.12
Time scarcity 24.63 26.27
Random justification 17.73 32.84
ML based strategies 58.12 50.89
Rule based strategies 30.54 34.15
No strategy 15.76 13.13
Fig. 13 Difference between response rate of the agent and humans for each social strategy. Green
indicates the agent is better than people and red indicates the opposite
using rule-based ( p < 0.05) or humans using rule-based strategies ( p < 0.05). This
demonstrates that a ML model that has witnessed examples of social strategies can
outperform rule-based systems. However, there is no significant difference between
the agent using ML strategies versus humans using the same social strategies.
Learning to Select a Social Strategy
In our previous experiment we established that different domains have different
strategies that perform best. Now, we evaluate how well our selection component
performs at selecting the most effective strategy. Specifically, we test how well our
selection model performs (1) against a random strategy, (2) against the most effec-
tive strategy (expertise compliment) from the previous experiment, and (3)
against the oracle strategy generated by crowdworkers. Recall that the oracle strategy
does not constrain workers to use any particular strategy.
Since this test needs be able to test multiple strategies on the same post, we
perform our evaluation on AMT. Workers are shown two strategies for a given post
and asked to choose which strategy is most likely to receive a response. We perform
pairwise comparisons between our selection model against a random strategy across
11k posts, against expertise compliment across 549 posts and against open-
ended human questions across 689 posts.
Effect of selection. A binomial test indicates that our selection method was cho-
sen 54.12% more often than a random strategy B(N = 11, 844, p < 0.001). It was
chosen 58.28% more often than expertise compliment B(N = 549, p <
0.001). And finally, it was chosen 75.61% more often than the oracle human gen-
erated questions B(N = 689, p < 0.001). We conclude that our selection model
outperforms all existing baselines.
Qualitative analysis. Figure 14a shows that the agent can choose to focus on different
aspects of the image even when the subject of the image is roughly the same: old
Visual Intelligence through Human Interaction 289
Fig. 14 Example strategy selection and augmentations in the travel domain. a Our system learns
to focus on different aspects of the image. b The system is able to discern between very similar
images and understand that the same objects can have different connotations. c, d Example failure
case when objects were misclassified
traditional buildings. In one, the agent compliments the statue, which is the most
salient feature of the old European building shown in the image. In the other, it
shows appreciation for the overall architecture of the old Asian building, which does
not have a single defining feature like a statue.
Figure 14b shows two images that are both contain water and has similar color
composition. In one, the agent compliments the water seen on the beach as refreshing
and in the other, the fish seen underwater as cute. Referring to a fish in a beach photo
would have been incorrect as would have been describing water as refreshing in an
underwater photo.
Though social strategies are useful, they can also lead to new errors. Figure 14c,
d showcases an example questions where the agent fails to recognize mountains and
food and generates phrases referring to beaches and flowers.
3.5 Discussion
Intended use. This work demonstrates that it is possible to train an AI agent to use
social strategies that are found in human-to-human interaction contexts to increase the
likelihood of a human crowdsourcing respondent. Such responses suggest a future
in which supervised ML models can be trained on authentic online data that are
provided by willing helpers than from paid workers. We expect that such strategies
can lead to adaptive ML systems that can learn during their deployment, by asking
their users whenever they are uncertain about their environment. Unlike existing paid
crowdsourcing techniques that grow linearly in cost as the number of annotations
increases, our method is a fixed cost solution where social strategies need to be
collected for a specific domain and then deployed to encourage volunteers.
Negative usage. It is also important that we pause to note the potential negative
implications of computing research, and how they can be addressed. The psychol-
ogy techniques that our work relies on have been used in negotiations and marketing
campaigns for decades. Automating such techniques can also lead to influencing emo-
tions or behavior at a magnitude greater than single human–human interaction [53,
94]. When using natural language techniques, we advocate that agents continue to
self-identify as bots for this reason. There is a need for online communities to estab-
290 R. Krishna et al.
lish a standard acceptable use of such techniques and how the contributors should
be informed about the intentions behind an agent’s request.
Limitations and future work. Our social strategies are inspired by social psychology
research. Ours are by no means an exhaustive list of possible strategies. Future
research could follow a more “bottom-up” approach of directly learning to emulate
strategies by observing human–human interactions. Currently, our requests involve
exactly one dialogue turn, and we do not yet explore multi-turn conversations. This
can be important: for example, the answer attempt strategy may be more effective at
getting an answer now, but might also decrease the probability that the contributor
will want to continue cooperating in the long term. Future work can explore how to
guide conversations to enable more complex labeling schemes.
Conclusion Our work: (1) identifies social strategies that can be repurposed to
improve crowdsourcing requests for visual question answering, (2) trains and deploys
ML- and rule-based models that deploy these strategies to increase crowdsourcing
participation, and (3) demonstrates that these models significantly improve partic-
ipation on Instagram, that no single strategy is optimal, and that a selection model
can chooses the appropriate strategy.
Generating realistic images is regarded as a focal task for measuring the progress of
generative models. Automated metrics are either heuristic approximations [22, 45,
89, 150, 158, 164] or intractable density estimations, examined to be inaccurate on
high dimensional problems [12, 70, 182]. Human evaluations, such as those given on
Amazon Mechanical Turk [45, 158], remain ad hoc because “results change drasti-
cally” [164] based on details of the task design [92, 114, 124]. With both noisy auto-
mated and noisy human benchmarks, measuring progress over time has become akin
to hill-climbing on noise. Even widely used metrics, such as Inception Score [164]
and Fréchet Inception Distance [68], have been discredited for their application to
non-ImageNet datasets [6, 15, 151, 157]. Thus, to monitor progress, generative mod-
els need a systematic gold standard benchmark. In this paper, we introduce a gold
standard benchmark for realistic generation, demonstrating its effectiveness across
four datasets, six models, and two sampling techniques, and using it to assess the
progress of generative models over time (Fig. 15).
Realizing the constraints of available automated metrics, many generative mod-
eling tasks resort to human evaluation and visual inspection [45, 158, 164]. These
human measures are (1) ad hoc, each executed in idiosyncrasy without proof of relia-
bility or grounding to theory, and (2) high variance in their estimates [45, 141, 164].
These characteristics combine to a lack of reliability, and downstream, (3) a lack
of clear separability between models. Theoretically, given sufficiently large sample
sizes of human evaluators and model outputs, the law of large numbers would smooth
Visual Intelligence through Human Interaction 291
Fig. 15 Our human evaluation metric, HYPE, consistently distinguishes models from each other:
here, we compare different generative models performance on FFHQ. A score of 50% represents
indistinguishable results from real, while a score above 50% represents hyper-realism
out the variance and reach eventual convergence; but this would occur at (4) a high
cost and a long delay.
We present HYPE (Human eYe Perceptual Evaluation) to address these
criteria in turn. HYPE: (1) measures the perceptual realism of generative model
outputs via a grounded method inspired by psychophysics methods in perceptual
psychology, (2) is a reliable and consistent estimator, (3) is statistically separable
to enable a comparative ranking, and (4) ensures a cost and time efficient method
through modern crowdsourcing techniques such as training and aggregation. We
present two methods of evaluation. The first, called HYPEtime , is inspired directly
by the psychophysics literature [38, 93], and displays images using adaptive time
constraints to determine the time-limited perceptual threshold a person needs to
distinguish real from fake. The HYPEtime score is understood as the minimum time, in
milliseconds, that a person needs to see the model’s output before they can distinguish
it as real or fake. For example, a score of 500 ms on HYPEtime indicates that humans
can distinguish model outputs from real images at 500 ms exposure times or longer,
but not under 500 ms. The second method, called HYPE∞ , is derived from the first to
make it simpler, faster, and cheaper while maintaining reliability. It is interpretable as
the rate at which people mistake fake images and real images, given unlimited time
to make their decisions. A score of 50% on HYPE∞ means that people differentiate
generated results from real data at chance rate, while a score above 50% represents
hyper-realism in which generated images appear more real than real images.
We run two large-scale experiments. First, we demonstrate HYPE’s performance
on unconditional human face generation using four popular generative adversarial
networks (GANs) [9, 62, 88, 89] across CelebA-64 [125]. We also evaluate two
newer GANs [22, 138] on FFHQ-1024 [89]. HYPE indicates that GANs have clear,
measurable perceptual differences between them; this ranking is identical in both
HYPEtime and HYPE∞ . The best performing model, StyleGAN trained on FFHQ
and sampled with the truncation trick, only performs at 27.6% HYPE∞ , suggesting
substantial opportunity for improvement. We can reliably reproduce these results
with 95% confidence intervals using 30 human evaluators at $60 in a task that takes
10 min.
Second, we demonstrate the performance of HYPE∞ beyond faces on conditional
generation of five object classes in ImageNet [43] and unconditional generation of
CIFAR-10 [100]. Early GANs such as BEGAN are not separable in HYPE∞ when
292 R. Krishna et al.
Fig. 16 Example images sampled with the truncation trick from StyleGAN trained on FFHQ.
Images on the right exhibit the highest HYPE∞ scores, the highest human perceptual fidelity
after each image, four perceptual mask images are rapidly displayed for 30 ms each.
These noise masks are distorted to prevent retinal afterimages and further sensory
processing after the image disappears [61]. We generate masks using an existing
texture-synthesis algorithm [147]. Upon each submission, HYPEtime reveals to the
evaluator whether they were correct.
Image exposures are in the range [100 ms, 1000 ms], derived from the perception
literature [54]. All blocks begin at 500 ms and last for 150 images (50% gener-
ated, 50% real), values empirically tuned from prior work [38, 40]. Exposure times
are raised at 10 ms increments and reduced at 30 ms decrements, following the 3-
up/1-down adaptive staircase approach, which theoretically leads to a 75% accuracy
threshold that approximates the human perceptual threshold [38, 61, 115].
Every evaluator completes multiple staircases, called blocks, on different sets of
images. As a result, we observe multiple measures for the model. We employ three
blocks, to balance quality estimates against evaluators’ fatigue [65, 103, 161]. We
average the modal exposure times across blocks to calculate a final value for each
evaluator. Higher scores indicate a better model, whose outputs take longer time
exposures to discern from real.
294 R. Krishna et al.
To ensure that our reported scores are consistent and reliable, we need to sample
sufficiently from the model as well as hire, qualify, and appropriately pay enough
evaluators.
Sampling sufficient model outputs. The selection of K images to evaluate from
a particular model is a critical component of a fair and useful evaluation. We must
sample a large enough number of images that fully capture a model’s generative
diversity, yet balance that against tractable costs in the evaluation. We follow existing
work on evaluating generative output by sampling K = 5000 generated images from
each model [138, 164, 195] and K = 5000 real images from the training set. From
these samples, we randomly select images to give to each evaluator.
Quality of evaluators. To obtain a high-quality pool of evaluators, each is required
to pass a qualification task. Such a pre-task filtering approach, sometimes referred to
as a person-oriented strategy, is known to outperform process-oriented strategies that
perform post-task data filtering or processing [137]. Our qualification task displays
3 We explicitly reveal this ratio to evaluators. Amazon Mechanical Turk forums would enable
evaluators to discuss and learn about this distribution over time, thus altering how different evaluators
would approach the task. By making this ratio explicit, evaluators would have the same prior entering
the task.
4 Hyper-realism is relative to the real dataset on which a model is trained. Some datasets already
look less realistic because of lower resolution and/or lower diversity of images.
Visual Intelligence through Human Interaction 295
100 images (50 real and 50 fake) with no time limits. Evaluators must correctly
classify 65% of both real and fake images. This threshold should be treated as a
hyperparameter and may change depending upon the GANs used in the tutorial and
the desired discernment ability of the chosen evaluators. We choose 65% based on
the cumulative binomial probability of 65 binary choice answers out of 100 total
answers: there is only a one in one-thousand chance that an evaluator will qualify
by random guessing. Unlike in the task itself, fake qualification images are drawn
equally from multiple different GANs to ensure an equitable qualification across all
GANs. The qualification is designed to be taken occasionally, such that a pool of
evaluators can assess new models on demand.
Payment. Evaluators are paid a base rate of $1 for working on the qualification task.
To incentivize evaluators to remained engaged throughout the task, all further pay
after the qualification comes from a bonus of $0.02 per correctly labeled image,
typically totaling a wage of $12/hr.
Datasets. We evaluate on four datasets. (1) CelebA-64 [125] is popular dataset for
unconditional image generation with 202k images of human faces, which we align
and crop to be 64 × 64 px. (2) FFHQ-1024 [89] is a newer face dataset with 70k
images of size 1024 × 1024 px. (3) CIFAR-10 consists of 60k images, sized 32 × 32
px, across 10 classes. (4) ImageNet-5 is a subset of 5 classes with 6.5k images at
128 × 128 px from the ImageNet dataset [43], which have been previously identified
as easy (lemon, Samoyed, library) and hard (baseball player, French horn) [22].
Architectures. We evaluate on four state-of-the-art models trained on CelebA-64 and
CIFAR-10: StyleGAN [89], ProGAN [88], BEGAN [9], and WGAN-GP [62]. We
also evaluate on two models, SN-GAN [138] and BigGAN [22] trained on ImageNet,
sampling conditionally on each class in ImageNet-5. We sample BigGAN with (σ =
0.5 [22]) and without the truncation trick.
We also evaluate on StyleGAN [89] trained on FFHQ-1024 with (ψ = 0.7 [89])
and without truncation trick sampling. For parity on our best models across datasets,
StyleGAN instances trained on CelebA-64 and CIFAR-10 are also sampled with the
truncation trick.
We sample noise vectors from the d-dimensional spherical Gaussian noise prior
z ∈ Rd ∼ N (0, I ) during training and test times. We specifically opted to use the
same standard noise prior for comparison, yet are aware of other priors that optimize
for FID and IS scores [22]. We select training hyperparameters published in the
corresponding papers for each model.
Evaluator recruitment. We recruit 930 evaluators from Amazon Mechanical Turk,
or 30 for each run of HYPE. To maintain a between-subjects study in this evaluation,
we recruit independent evaluators across tasks and methods.
296 R. Krishna et al.
We report results on HYPEtime and demonstrate that the results of HYPE∞ approxi-
mates those from HYPEtime at a fraction of the cost and time.
HYPEtime
CelebA-64 We find that StyleGANtrunc resulted in the highest HYPEtime score (modal
exposure time), at a mean of 439.3 ms, indicating that evaluators required nearly
a half-second of exposure to accurately classify StyleGANtrunc images (Table 3).
StyleGANtrunc is followed by ProGAN at 363.7 ms, a 17% drop in time. BEGAN
and WGAN-GP are both easily identifiable as fake, tied in last place around the
minimum available exposure time of 100 ms. Both BEGAN and WGAN-GP exhibit
Table 4 HYPE∞ on four GANs trained on CelebA-64. Counterintuitively, real errors increase with
the errors on fake images, because evaluators become more confused and distinguishing factors
between the two distributions become harder to discern
Rank GAN HYPE∞ Fakes Reals Std. 95% CI KID FID Precision
(%) error error
1 StyleGANtrunc 50.7% 62.2% 39.3% 1.3 48.2–53.1 0.005 131.7 0.982
2 ProGAN 40.3% 46.2% 34.4% 0.9 38.5–42.0 0.001 2.5 0.990
3 BEGAN 10.0% 6.2% 13.8% 1.6 7.2–13.3 0.056 67.7 0.326
4 WGAN-GP 3.8% 1.7% 5.9% 0.6 3.2–5.7 0.046 43.6 0.654
a bottoming out effect—reaching the minimum time exposure of 100 ms quickly and
consistently.
To demonstrate separability between models we report results from a one-way
analysis of variance (ANOVA) test, where each model’s input is the list of modes from
each model’s 30 evaluators. The ANOVA results confirm that there is a statistically
significant omnibus difference (F(3, 29) = 83.5, p < 0.0001). Pairwise post hoc
analysis using Tukey tests confirms that all pairs of models are separable (all p <
0.05) except BEGAN and WGAN-GP (n.s.).
FFHQ-1024. We find that StyleGANtrunc resulted in a higher exposure time than
StyleGANno-trunc , at 363.2 ms and 240.7 ms, respectively (Table 3). While the 95%
confidence intervals that represent a very conservative overlap of 2.7 ms, an unpaired
t-test confirms that the difference between the two models is significant (t (58) =
2.3, p = 0.02).
HYPE∞
CelebA-64. Table 4 reports results for HYPE∞ on CelebA-64. We find that
StyleGANtrunc resulted in the highest HYPE∞ score, fooling evaluators 50.7% of
the time. StyleGANtrunc is followed by ProGAN at 40.3%, BEGAN at 10.0%, and
WGAN-GP at 3.8%. No confidence intervals are overlapping and an ANOVA test
is significant (F(3, 29) = 404.4, p < 0.001). Pairwise post hoc Tukey tests show
that all pairs of models are separable (all p < 0.05). Notably, HYPE∞ results in
separable results for BEGAN and WGAN-GP, unlike in HYPEtime where they were
not separable due to a bottoming-out effect.
FFHQ-1024. We observe a consistently separable difference between StyleGANtrunc
and StyleGANno-trunc and clear delineations between models (Table 5). HYPE∞ ranks
StyleGANtrunc (27.6%) above StyleGANno-trunc (19.0%) with no overlapping CIs.
Separability is confirmed by an unpaired t-test (t (58) = 8.3, p < 0.001).
Cost Tradeoffs with Accuracy and Time
One of HYPE’s goals is to be cost and time efficient. When running HYPE, there
is an inherent tradeoff between accuracy and time, as well as between accuracy and
cost. This is driven by the law of large numbers: recruiting additional evaluators in
a crowdsourcing task often produces more consistent results, but at a higher cost (as
298 R. Krishna et al.
each evaluator is paid for their work) and a longer amount of time until completion
(as more evaluators must be recruited and they must complete their work).
To manage this tradeoff, we run an experiment with HYPE∞ on StyleGANtrunc .
We completed an additional evaluation with 60 evaluators, and compute 95% boot-
strapped confidence intervals, choosing from 10 to 120 evaluators (Fig. 18). We see
that the CI begins to converge around 30 evaluators, our recommended number of
evaluators to recruit.
At 30 evaluators, the cost of running HYPEtime on one model was approximately
$360, while the cost of running HYPE∞ on the same model was approximately $60.
Payment per evaluator for both tasks was approximately $12/hr. Evaluators spent
an average of one hour each on a HYPEtime task and 10 min each on a HYPE∞
task. Thus, HYPE∞ achieves its goals of being significantly cheaper to run, while
maintaining consistency.
Comparison to Automated Metrics
As FID [68] is one of the most frequently used evaluation methods for uncondi-
tional image generation, it is imperative to compare HYPE against FID on the
same models. We also compare to two newer automated metrics: KID [11], an
unbiased estimator independent of sample size, and F1/8 (precision) [162], which
Visual Intelligence through Human Interaction 299
Table 6 HYPE∞ on three models trained on ImageNet and conditionally sampled on five classes. BigGAN routinely outperforms SN-GAN. BigGantrunc and
BigGanno-trunc are not separable
GAN Class HYPE∞ Fakes Error Reals Error Std. 95% CI KID FID Precision
(%)
Easy BigGantrunc Lemon 18.4% 21.9% 14.9% 2.3 14.2–23.1 0.043 94.22 0.784
BigGanno-trunc Lemon 20.2% 22.2% 18.1% 2.2 16.0–24.8 0.036 87.54 0.774
SN-GAN Lemon 12.0% 10.8% 13.3% 1.6 9.0–15.3 0.053 117.90 0.656
Easy BigGantrunc Samoyed 19.9% 23.5% 16.2% 2.6 15.0–25.1 0.027 56.94 0.794
BigGanno-trunc Samoyed 19.7% 23.2% 16.1% 2.2 15.5–24.1 0.014 46.14 0.906
SN-GAN Samoyed 5.8% 3.4% 8.2% 0.9 4.1–7.8 0.046 88.68 0.785
Easy BigGantrunc Library 17.4% 22.0% 12.8% 2.1 13.3–21.6 0.049 98.45 0.695
BigGanno-trunc Library 22.9% 28.1% 17.6% 2.1 18.9–27.2 0.029 78.49 0.814
SN-GAN Library 13.6% 15.1% 12.1% 1.9 10.0–17.5 0.043 94.89 0.814
Hard BigGantrunc French 7.3% 9.0% 5.5% 1.8 4.0–11.2 0.031 78.21 0.732
Horn
BigGanno-trunc French 6.9% 8.6% 5.2% 1.4 4.3–9.9 0.042 96.18 0.757
Horn
SN-GAN French 3.6% 5.0% 2.2% 1.0 1.8–5.9 0.156 196.12 0.674
Horn
Hard BigGantrunc Baseball 1.9% 1.9% 1.9% 0.7 0.8–3.5 0.049 91.31 0.853
player
BigGanno-trunc Baseball 2.2% 3.3% 1.2% 0.6 1.3–3.5 0.026 76.71 0.838
player
SN-GAN Baseball 2.8% 3.6% 1.9% 1.5 0.8–6.2 0.052 105.82 0.785
player
R. Krishna et al.
Visual Intelligence through Human Interaction 301
Table 7 Four models on CIFAR-10. StyleGANtrunc can generate realistic images from CIFAR-10
GAN HYPE∞ Fakes Reals Std. 95% CI KID FID Precision
(%) error error
StyleGANtrunc 23.3% 28.2% 18.5% 1.6 20.1–26.4 0.005 62.9 0.982
PROGAN 14.8% 18.5% 11.0% 1.6 11.9–18.0 0.001 53.2 0.990
BEGAN 14.5% 14.6% 14.5% 1.7 11.3–18.1 0.056 96.2 0.326
WGAN-GP 13.2% 15.3% 11.1% 2.3 9.1–18.1 0.046 104.0 0.654
only BigGAN, we find far weaker coefficients for KID (ρ = −0.151, p = 0.68), FID
(ρ = −0.067, p = .85), and precision (ρ = −0.164, p = 0.65). This illustrates an
important flaw with these automatic metrics: their ability to correlate with humans
depends upon the generative model that the metrics are evaluating on, varying by
model and by task.
CIFAR-10
For the difficult task of unconditional generation on CIFAR-10, we use the same four
model architectures in Experiment 1: CelebA-64. Table 7 shows that HYPE∞ was
able to separate StyleGANtrunc from the earlier BEGAN, WGAN-GP, and ProGAN,
indicating that StyleGAN is the first among them to make human-perceptible progress
on unconditional object generation with CIFAR-10.
Comparison to automated metrics. Spearman rank-order correlation coeffi-
cients on all four GANs show medium, yet statistically insignificant, correlations
with KID (ρ = −0.600, p = 0.40) and FID (ρ = 0.600, p = 0.40) and precision
(ρ = −.800, p = 0.20).
like the Inception Score (IS) [164] and Fréchet Inception Distance (FID) [68] to
evaluate images and BLEU [143], CIDEr [185] and METEOR [5] scores to evaluate
text. While we focus on how realistic generated content appears, other automatic
metrics also measure diversity of output, overfitting, entanglement, training stability,
and computational and sample efficiency of the model [6, 15, 128]. Our metric may
also capture one aspect of output diversity, insofar as human evaluators can detect
similarities or patterns across images. Our evaluation is not meant to replace existing
methods but to complement them.
Limitations of automatic metrics. Prior work has asserted that there exists coarse
correlation of human judgment to FID [68] and IS [164], leading to their widespread
adoption. Both metrics depend on the Inception-v3 Network [179], a pretrained Ima-
geNet model, to calculate statistics on the generated output (for IS) and on the real
and generated distributions (for FID). The validity of these metrics when applied to
other datasets has been repeatedly called into question [6, 15, 151, 157]. Perturba-
tions imperceptible to humans alter their values, similar to the behavior of adversarial
examples [105]. Finally, similar to our metric, FID depends on a set of real examples
and a set of generated examples to compute high-level differences between the dis-
tributions, and there is inherent variance to the metric depending on the number of
images and which images were chosen—in fact, there exists a correlation between
accuracy and budget (cost of computation) in improving FID scores, because spend-
ing a longer time and thus higher cost on compute will yield better FID scores [128].
Nevertheless, this cost is still lower than paid human annotators per image.
Human evaluations. Many human-based evaluations have been attempted to vary-
ing degrees of success in prior work, either to evaluate models directly [45, 141]
or to motivate using automated metrics [68, 164]. Prior work also used people to
evaluate GAN outputs on CIFAR-10 and MNIST and even provided immediate feed-
back after every judgment [164]. They found that generated MNIST samples have
saturated human performance—i.e. people cannot distinguish generated numbers
from real MNIST numbers, while still finding 21.3% error rate on CIFAR-10 with
the same model [164]. This suggests that different datasets will have different levels
of complexity for crossing realistic or hyper-realistic thresholds. The closest recent
work to ours compares models using a tournament of discriminators [141]. Never-
theless, this comparison was not yet rigorously evaluated on humans nor were human
discriminators presented experimentally. The framework we present would enable
such a tournament evaluation to be performed reliably and easily.
4.7 Discussion
Limitations. Extensions of HYPE may require different task designs. In the case of
text generation (translation, caption generation), HYPEtime will require much longer
and much higher range adjustments to the perceptual time thresholds [97, 198]. In
addition to measuring realism, other metrics like diversity, overfitting, entanglement,
training stability, and computational and sample efficiency are additional benchmarks
that can be incorporated but are outside the scope of this paper. Some may be better
suited to a fully automated evaluation [15, 128]. Similar to related work in evalu-
ating text generation [64], we suggest that diversity can be incorporated using the
automated recall score measures diversity independently from precision F1/8 [162].
Conclusion. HYPE provides two human evaluation benchmarks for generative mod-
els that (1) are grounded in psychophysics, (2) provide task designs that produce
reliable results, (3) separate model performance, (4) are cost and time efficient.
We introduce two benchmarks: HYPEtime , which uses time perceptual thresholds,
and HYPE∞ , which reports the error rate sans time constraints. We demonstrate the
efficacy of our approach on image generation across six models {StyleGAN, SN-
GAN, BigGAN, ProGAN, BEGAN, WGAN-GP}, four image datasets {CelebA-64,
FFHQ-1024, CIFAR-10, ImageNet-5 }, and two types of sampling methods {with,
without the truncation trick}.
5 Conclusion
Popular culture has long depicted vision as a primary interaction modality between
people and machines; vision is a necessary sensing capability for humanoid robots
such as C-3PO from “Star Wars,” Wall-E from the eponymous film, and even disem-
bodied Artificial Intelligence such as Samantha the smart operating system from the
movie “Her.” These fictional machines paint a potential real future where machines
can tap into the expressive range of non-intrusive information that Computer Vision
affords. Our expressions, gestures, and relative position to objects carry a wealth
of information that intelligent interactive machines can use, enabling new applica-
tions in domains such as healthcare [63], sustainability [83], human-interpretable
actions [48], and mixed-initiative interactions [73].
While Human–Computer Interaction (HCI) researchers have long discussed and
debated what human–AI interaction should look like [73, 170], we have rarely
provided concrete, immediately operational goals to Computer Vision researchers.
Instead, we’ve largely left this job up to the vision community itself, which has pro-
duced a variety of immediately operational tasks to work on. These tasks play an
important role in the AI community; some of them ultimately channel the efforts of
thousands of AI researchers and set the direction of progress for years to come. The
tasks range from object recognition [43], to scene understanding [98], to explainable
AI [1], to interactive robot training [183]. And while many such tasks have been
worthwhile endeavors, we often find that the models they produce don’t work in
practice or don’t fit end-users’ needs as hoped [24, 136].
304 R. Krishna et al.
If the tasks that guide the work of thousands of AI researchers do not reflect the
HCI community’s understanding of how humans can best interact with AI-powered
systems, then the resulting AI-powered systems will not reflect it either. We therefore
believe there is an important opportunity for HCI and Computer Vision researchers to
begin closing this gap by collaborating to directly integrate HCI’s insights and goals
into immediately actionable vision tasks, model designs, data collection protocols,
and evaluation schemes. One such example of this type of work is the HYPE bench-
mark mentioned earlier in this chapter [210], which aimed to push GAN researchers
to focus directly on a high-quality measurement of human perception when creating
their models. Another is the approach taken by the social strategies project men-
tioned earlier in this chapter [144], which aimed to push data collection protocols to
consider social interventions designed to elicit volunteer contributions.
What kind of tasks might HCI researchers work to create? First, explainable AI
aims to help people understand how computer vision models work, but methods are
developed without real consideration of how humans will ultimately use explanations
to interact with them. HCI researchers might propose design choices for how to
introduce and explain vision models grounded in human subjects experiments [23,
91]. Second, perceptual robotics can learn to complete new tasks by incorporating
human rewards, but do not consider how people actually want to provide feedback
to robots [183]. If we want robots to incorporate an understanding of how humans
want to give feedback when deployed, then HCI researchers might propose new
training paradigms with ecologically valid human interactions. Third, multi-agent
vision systems [82] are developed that ignore key aspects of human psyche, such
as choosing to perform non-optimal behaviors, despite foundational work in HCI
noting the perils of such assumptions in AI planning [178]. Without incorporating
human behavioral priors, these multi-agents systems work well when collaborating
between AI agents but fail when one of the agents is replaced by a human [28]. If
we want multi-agent vision systems that understand biases that people have when
performing actions, then HCI researchers might propose human–AI collaboration
tasks and benchmarks in which agents are forced to encounter realistic human actors
(indeed, non-vision work has begun to move in this direction [106]).
Acknowledgements The first project was supported by the National Science Foundation award
1351131. The second project was partially funded by the Brown Institute of Media Innovation
and by Toyota Research Institute (“TRI”). The third project was partially funded by a Junglee
Corporation Stanford Graduate Fellowship, an Alfred P. Sloan fellowship and by TRI. This chapter
solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
References
1. Adadi A, Berrada M (2018) Peeking inside the black-box: a survey on explainable artificial
intelligence (xai). IEEE Access 6:52138–52160
2. Ambati V, Vogel S, Carbonell J (2011) Towards task recommendation in micro-task markets
Visual Intelligence through Human Interaction 305
25. Burke M, Kraut RE, Joyce E (2014) Membership claims and requests: some newcomer social-
ization strategies in online communities. Small Group Research
26. Burke M, Kraut R (2013) Using facebook after losing a job: Differential benefits of strong
and weak ties. In: Proceedings of the 2013 conference on computer supported cooperative
work. ACM, pp 1419–1430
27. Card SK, Newell A, Moran TP (1983) The psychology of human-computer interaction
28. Carroll M, Shah R, Ho MK, Griffiths T, Seshia S, Abbeel P, Dragan A (2019) On the utility
of learning about humans for human-ai coordination. In: Advances in neural information
processing systems, pp 5174–5185
29. Cassell J, Thórisson KR (1999) The power of a nod and a glance: envelope vs. emotional
feedback in animated conversational agents. Appl Artif Intell 13:519–538
30. Cerrato L, Ekeklint S (2002) Different ways of ending human-machine dialogues
31. Chaiken S (1989) Heuristic and systematic information processing within and beyond the
persuasion context. In: Unintended thought, pp 212–252
32. Chellappa R, Sinha P, Jonathon Phillips P (2010) Face recognition by computers and humans.
Computer 43(2):46–55
33. Cheng J, Teevan J, Bernstein MS (2015) Measuring crowdsourcing effort with error-time
curves. In: Proceedings of the 33rd annual ACM conference on human factors in computing
systems. ACM, pp 1365–1374
34. Chidambaram V, Chiang Y-H, Mutlu B (2012) Designing persuasive robots: how robots
might persuade people using vocal and nonverbal cues. In: Proceedings of the seventh annual
ACM/IEEE international conference on human-robot interaction. ACM, pp 293–300
35. Chilton LB, Little G, Edge D, Weld DS, Landay JA (2013) Cascade: crowdsourcing taxonomy
creation. In: Proceedings of the SIGCHI conference on human factors in computing systems.
ACM, pp 1999–2008
36. Cialdini R (2016) Pre-suasion: a revolutionary way to influence and persuade. Simon and
Schuster
37. Colligan L, Potts HWW, Finn CT, Sinkin RA (2015) Cognitive workload changes for nurses
transitioning from a legacy system with paper documentation to a commercial electronic
health record. Int J Med Inform 84(7):469–476
38. Cornsweet TN (1962) The staircrase-method in psychophysics
39. Corti K, Gillespie A (2016) Co-constructing intersubjectivity with artificial conversational
agents: people are more likely to initiate repairs of misunderstandings with agents represented
as human. Comput Hum Behav 58:431–442
40. Dakin SC, Omigie D (2009) Psychophysical evidence for a non-linear representation of facial
identity. Vis Res 49(18):2285–2296
41. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005
IEEE computer society conference on computer vision and pattern recognition (CVPR’05),
vol 1, pp 886–893
42. Darley JM, Latané B (1968) Bystander intervention in emergencies: diffusion of responsibility.
J Personal Soc Psychol 8(4p1):377
43. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical
image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee,
pp 248–255
44. Deng J, Russakovsky O, Krause J, Bernstein MS, Berg A, Fei-Fei L (2014) Scalable multi-
label annotation. In: Proceedings of the SIGCHI conference on human factors in computing
systems. ACM, pp 3099–3102
45. Denton EL, Chintala S, Fergus R, et al (2015) Deep generative image models using a laplacian
pyramid of adversarial networks. In: Advances in neural information processing systems, pp
1486–1494
46. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional
transformers for language understanding. arXiv:1810.04805
47. Difallah DE, Demartini G, Cudré-Mauroux P (2013) Pick-a-crowd: tell me what you like, and
i’ll tell you what to do. In: Proceedings of the 22nd international conference on world wide
web, WWW ’13. ACM, New York, NY, USA, pp 367–374
Visual Intelligence through Human Interaction 307
48. Dragan AD, Lee KCT, Srinivasa SS (2013) Legibility and predictability of robot motion. In:
2013 8th ACM/IEEE international conference on human-robot interaction (HRI). IEEE, pp
301–308
49. Fast E, Chen B, Mendelsohn J, Bassen J, Bernstein MS (2018) Iris: a conversational agent for
complex tasks. In: Proceedings of the 2018 CHI conference on human factors in computing
systems. ACM, p 473
50. Fast E, Steffee D, Wang L, Brandt JR, Bernstein MS (2014) Emergent, crowd-scale program-
ming practice in the ide. In: Proceedings of the 32nd annual ACM conference on Human
factors in computing systems. ACM, pp 2491–2500
51. Fei-Fei L, Iyer A, Koch C, Perona P (2007) What do we perceive in a glance of a real-world
scene? J Vis 7(1):10
52. Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap.
Evolution 39(4):783–791
53. Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2016) The rise of social bots. Commun
ACM 59(7):96–104
54. Fraisse P (1984) Perception and estimation of time. Ann Rev Psychol 35(1):1–37
55. Geiger D, Schader M (2014) Personalized task recommendation in crowdsourcing information
systems – current state of the art. Decis Support Syst 65:3–16. Crowdsourcing and Social
Networks Analysis
56. Gilbert E, Karahalios K (2009) Predicting tie strength with social media. In: Proceedings of
the SIGCHI conference on human factors in computing systems. ACM, pp 211–220
57. Gillund G, Shiffrin RM (1984) A retrieval model for both recognition and recall. Psychol Rev
91(1):1
58. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object
detection and semantic segmentation. In: 2014 IEEE conference on computer vision and
pattern recognition (CVPR). IEEE, pp 580–587
59. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A,
Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing
systems, pp 2672–2680
60. Gray M, Suri S (2019) Ghost work: how to stop silicon valley from building a new global
underclass. Eamon Dolan
61. Greene MR, Oliva A (2009) The briefest of glances: the time course of natural scene under-
standing. Psychol Sci 20(4):464–472
62. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of
wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777
63. Haque A, Milstein A, Fei-Fei L (2020) Illuminating the dark spaces of healthcare with ambient
intelligence. Nature 585(7824):193–202
64. Hashimoto TB, Zhang H, Liang P (2019) Unifying human and statistical evaluation for natural
language generation. arXiv:1904.02792
65. Hata K, Krishna R, Fei-Fei L, Bernstein MS (2017) A glimpse far into the future: under-
standing long-term crowd worker quality. In: Proceedings of the 2017 ACM conference on
computer supported cooperative work and social computing. ACM, pp 889–901
66. Healy K, Schussman A (2003) The ecology of open-source software development. Technical
report, Technical report, University of Arizona, USA
67. Hempel J (2015) Facebook launches m, its bold answer to siri and cortana. In: Wired. Retrieved
January 1:2017
68. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two
time-scale update rule converge to a local nash equilibrium. In: Advances in neural information
processing systems, pp 6626–6637
69. Hill BM (2013) Almost wikipedia: eight early encyclopedia projects and the mechanisms of
collective action. Massachusetts institute of technology, pp 1–38
70. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural
Comput 14(8):1771–1800
308 R. Krishna et al.
71. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–
1780
72. Hoffman ML (1981) Is altruism part of human nature? J Personal Soc Psychol 40(1):121
73. Horvitz E (1999) Principles of mixed-initiative user interfaces. In: Proceedings of the SIGCHI
conference on human factors in computing systems. ACM, pp 159–166
74. Huang F, Canny JF (2019) Sketchforme: composing sketched scenes from text descriptions
for interactive applications. In: Proceedings of the 32nd annual ACM symposium on user
interface software and technology, pp 209–220
75. Huang T-HK, Chang J, Bigham J (2018) Evorus: a crowd-powered conversational assistant
built to automate itself over time. In: Proceedings of the 2018 CHI conference on human
factors in computing systems. ACM, p 295
76. Hutto CJ, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of
social media text. In: Eighth international AAAI conference on weblogs and social media
77. Iordan MC, Greene MR, Beck DM, Fei-Fei L (2015) Basic level category structure emerges
gradually across human ventral visual cortex. In: Journal of cognitive neuroscience
78. Ipeirotis PG (2010) Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads.
The ACM Mag Stud 17(2):16–21
79. Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In:
Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67
80. Irani LC, Silberman M (2013) Turkopticon: interrupting worker invisibility in amazon
mechanical turk. In: Proceedings of the SIGCHI conference on human factors in comput-
ing systems. ACM, pp 611–620
81. Jain SD, Grauman K (2013) Predicting sufficient annotation strength for interactive foreground
segmentation. In: 2013 IEEE international conference on computer vision (ICCV). IEEE, pp
1313–1320
82. Jain U, Weihs L, Kolve E, Farhadi A, Lazebnik S, Kembhavi A, Schwing A (2020) A cor-
dial sync: Going beyond marginal policies for multi-agent embodied tasks. In: European
conference on computer vision. Springer, pp 471–490
83. Jean N, Burke M, Xie M, Davis WM, Lobell DB, Ermon S (2016) Combining satellite imagery
and machine learning to predict poverty. Science 353(6301):790–794
84. Josephy T, Lease M, Paritosh P (2013) Crowdscale 2013: crowdsourcing at scale workshop
report
85. Kamar E, Hacker S, Horvitz E (2012) Combining human and machine intelligence in large-
scale crowdsourcing. In: Proceedings of the 11th international conference on autonomous
agents and multiagent systems-volume 1. International Foundation for Autonomous Agents
and Multiagent Systems, pp 467–474
86. Karger DR, Oh S, Shah D (2011) Budget-optimal crowdsourcing using low-rank matrix
approximations. In: 2011 49th annual allerton conference on communication, control, and
computing (allerton). IEEE, pp 284–291
87. Karger DR, Oh S (2014) Shah D Budget-optimal task allocation for reliable crowdsourcing
systems. Oper Res 62(1):1–24
88. Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality,
stability, and variation. arXiv:1710.10196
89. Karras T, Laine S, Aila T (2018) A style-based generator architecture for generative adversarial
networks. arXiv:1812.04948
90. Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial
networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 4401–4410
91. Khadpe P, Krishna R, Fei-Fei L, Hancock JT, Bernstein MS (2020) Conceptual
metaphors impact perceptions of human-ai collaboration. Proc ACM Hum-Comput Interact
4(CSCW2):1–26
92. Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceed-
ings of the SIGCHI conference on human factors in computing systems. ACM, pp 453–456
Visual Intelligence through Human Interaction 309
93. Klein SA (2001) Measuring, estimating, and understanding the psychometric function: a
commentary. Percept Psychophys 63(8):1421–1455
94. Kramer ADI, Guillory JE, Hancock JT (2014) Experimental evidence of massive-scale emo-
tional contagion through social networks. Proc Natl Acad Sci 111(24):8788–8790
95. Kraut RE, Resnick P (2011) Encouraging contribution to online communities. Building suc-
cessful online communities: evidence-based social design, pp 21–76
96. Krishna R, Bernstein M, Fei-Fei L (2019) Information maximizing visual question generation.
In: IEEE conference on computer vision and pattern recognition
97. Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-captioning events in videos.
In: Proceedings of the IEEE international conference on computer vision, pp 706–715
98. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma
DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense
image annotations. Int J Comput Vis 123(1):32–73
99. Krishna RA, Hata K, Chen S, Kravitz J, Shamma DA, Fei-Fei L, Bernstein MS (2016) Embrac-
ing error to enable rapid crowdsourcing. In: Proceedings of the 2016 CHI conference on human
factors in computing systems. ACM, pp 3167–3179
100. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Tech-
nical report, Citeseer
101. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. In: Advances in neural information processing systems, pp 1097–1105
102. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural
information processing systems 25. Curran Associates, Inc., pp 1097–1105
103. Krueger GP (1989) Sustained work, fatigue, sleep loss and performance: a review of the
issues. Work Stress 3(2):129–141
104. Kumar R, Satyanarayan A, Torres C, Lim M, Ahmad S, Klemmer SR, Talton JO (2013)
Webzeitgeist: design mining the web. In: Proceedings of the SIGCHI conference on human
factors in computing systems. ACM, pp 3083–3092
105. Kurakin A, Goodfellow I, Bengio S (2016) Adversarial examples in the physical world.
arXiv:1607.02533
106. Kwon M, Biyik E, Talati A, Bhasin K, Losey DP, Sadigh D (2020) When humans aren’t opti-
mal: robots that collaborate with risk-aware humans. In: Proceedings of the 2020 ACM/IEEE
international conference on human-robot interaction, pp 43–52
107. Laielli M, Smith J, Biamby G, Darrell T, Hartmann B (2019) Labelar: a spatial guidance
interface for fast computer vision image collection. In: Proceedings of the 32nd annual ACM
symposium on user interface software and technology, pp 987–998
108. Langer EJ, Blank A, Chanowitz B (1978) The mindlessness of ostensibly thoughtful action: the
role of “placebic” information in interpersonal interaction. J Personal Soc Psychol 36(6):635
109. Laput G, Lasecki WS, Wiese J, Xiao R, Bigham JP, Harrison C (2015) Zensors: adaptive,
rapidly deployable, human-intelligent sensor feeds. In: Proceedings of the 33rd annual ACM
conference on human factors in computing systems. ACM, pp 1935–1944
110. Lasecki W, Miller C, Sadilek A, Abumoussa A, Borrello D, Kushalnagar R, Bigham J (2012)
Real-time captioning by groups of non-experts. In: Proceedings of the 25th annual ACM
symposium on user interface software and technology. ACM, pp 23–34
111. Lasecki WS, Murray KI, White S, Miller RC, Bigham JP (2011) Real-time crowd control of
existing interfaces. In: Proceedings of the 24th annual ACM symposium on User interface
software and technology. ACM, pp 23–32
112. Lasecki WS, Wesley R, Nichols J, Kulkarni A, Allen JF, Bigham JP (2013) Chorus: a crowd-
powered conversational assistant. In: Proceedings of the 26th annual ACM symposium on
User interface software and technology. ACM, pp 151–162
113. Law E, Yin M, Goh J, Chen K, Terry MA, Gajos KZ (2016) Curiosity killed the cat, but
makes crowdwork better. In: Proceedings of the 2016 CHI conference on human factors in
computing systems. ACM, pp 4098–4110
310 R. Krishna et al.
138. Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative
adversarial networks. arXiv:1802.05957
139. Nass C, Brave S (2007) Wired for speech: how voice activates and advances the human-
computer relationship. The MIT Press
140. Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories
using spatial-temporal words. Int J Comput Vis 79(3):299–318
141. Olsson C, Bhupatiraju S, Brown T, Odena A, Goodfellow I (2018) Skill rating for generative
models. arXiv:1808.04888
142. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–
2):1–135
143. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of
machine translation. In: Proceedings of the 40th annual meeting on association for computa-
tional linguistics. Association for Computational Linguistics, pp 311–318
144. Park J, Krishna R, Khadpe P, Fei-Fei L, Bernstein M (2019) Ai-based request augmentation to
increase crowdsourcing participation. Proc AAAI Conf Hum Comput Crowdsourcing 7:115–
124
145. Parkash A, Parikh D (2012) Attributes for classifier feedback. In: Computer vision–ECCV
2012. Springer, pp 354–368
146. Peng Dai MD, Weld S (2010) Decision-theoretic control of crowd-sourced workflows. In: In
the 24th AAAI conference on artificial intelligence (AAAI’10. Citeseer
147. Portilla J, Simoncelli EP (2000) A parametric texture model based on joint statistics of complex
wavelet coefficients. Int J Comput Vis 40(1):49–70
148. Potter MC (1976) Short-term conceptual memory for pictures. J Exp Psychol Hum Learn
Mem 2(5):509
149. Potter MC, Levy EI (1969) Recognition memory for a rapid sequence of pictures. J Exp
Psychol 81(1):10
150. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep con-
volutional generative adversarial networks. arXiv:1511.06434
151. Ravuri S, Mohamed S, Rosca M, Vinyals O (2018) Learning implicit generative models with
the method of learned moments. arXiv:1806.11006
152. Rayner K, Smith TJ, Malcolm GL, Henderson JM (2009) Eye movements and visual encoding
during scene perception. Psychol Sci 20(1):6–10
153. Reeves A, Sperling G (1986) Attention gating in short-term visual memory. Psychol Rev
93(2):180
154. Reeves B, Nass CI (1996) The media equation: how people treat computers, television, and
new media like real people and places. Cambridge university press
155. Reich J, Murnane R, Willett J (2012) The state of wiki usage in us k–12 schools: Leveraging
web 2.0 data warehouses to assess quality and equity in online learning environments. Educ
Res 41(1):7–15
156. Robert C (1984) Influence: the psychology of persuasion. William Morrow and Company,
Nowy Jork
157. Rosca M, Lakshminarayanan B, Warde-Farley D, Mohamed S (2017) Variational approaches
for auto-encoding generative adversarial networks. arXiv:1706.04987
158. Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M (2019) Faceforensics++:
learning to detect manipulated facial images. arXiv:1901.08971
159. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A,
Bernstein M, Berg AC, Li F-F (2014) Imagenet large scale visual recognition challenge. In:
International Journal of Computer Vision, pp 1–42
160. Russakovsky O, Li L-J, Fei-Fei L (2015) Best of both worlds: human-machine collaboration
for object annotation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 2121–2131
161. Rzeszotarski JM, Chi E, Paritosh P, Dai P (2013) Inserting micro-breaks into crowdsourcing
workflows. In: First AAAI conference on human computation and crowdsourcing
312 R. Krishna et al.
162. Sajjadi MSM, Bachem O, Lucic M, Bousquet O, Gelly S (2018) Assessing generative models
via precision and recall. In: Advances in neural information processing systems, pp 5228–5237
163. Salehi N, Irani LC, Bernstein MS (2015) We are dynamo: overcoming stalling and friction in
collective action for crowd workers. In: Proceedings of the 33rd annual ACM conference on
human factors in computing systems. ACM, pp 1621–1630
164. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved
techniques for training gans. In: Advances in neural information processing systems, pp
2234–2242
165. Sardar A, Joosse M, Weiss A, Evers V (2012) Don’t stand so close to me: users’ attitudinal
and behavioral responses to personal space invasion by robots. In: Proceedings of the seventh
annual ACM/IEEE international conference on human-robot interaction. ACM, pp 229–230
166. Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization.
Mach Learn 39(2):135–168
167. Seetharaman P, Pardo B (2014) Crowdsourcing a reverberation descriptor map. In: Proceed-
ings of the ACM international conference on multimedia. ACM, pp 587–596
168. Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data
mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, pp 614–622
169. Sheshadri A, Lease M (2013) Square: a benchmark for research on computing crowd consen-
sus. In: First AAAI conference on human computation and crowdsourcing
170. Shneiderman B, Maes P (1997) Direct manipulation vs. interface agents. Interactions 4(6):42–
61 November
171. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556
172. Smyth P, Burl MC, Fayyad UM, Perona P (1994) Knowledge discovery in large image
databases: dealing with uncertainties in ground truth. In: KDD workshop, pp 109–120
173. Smyth P, Fayyad U, Burl M, Perona P, Baldi P (1995) Inferring ground truth from subjective
labelling of venus images
174. Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good?: evaluating
non-expert annotations for natural language tasks. In: Proceedings of the conference on empir-
ical methods in natural language processing. Association for Computational Linguistics, pp
254–263
175. Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classifi-
cation. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE,
pp 1585–1592
176. Sperling G (1963) A model for visual memory tasks. Hum Factors 5(1):19–31
177. Su H, Deng J, Fei-Fei L (2012) Crowdsourcing annotations for visual object detection. In:
Workshops at the twenty-sixth AAAI conference on artificial intelligence
178. Suchman LA (1987) Plans and situated actions: the problem of human-machine communica-
tion. Cambridge University Press, Cambridge
179. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception archi-
tecture for computer vision. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp 2818–2826
180. Tamuz O, Liu C, Belongie S, Shamir O, Kalai AT (2011) Adaptively learning the crowd
kernel. arXiv:1105.1033
181. Taylor PJ, Thomas S (2008) Linguistic style matching and negotiation outcome. Negot Confl
Manag Res 1(3):263–281
182. Theis L, van den Oord A, Bethge M (2015) A note on the evaluation of generative models.
arXiv:1511.01844
183. Thomaz AL, Breazeal C (2008) Teachable robots: understanding human teaching behavior
to build more effective robot learners. Artif Intell 172(6–7):716–737
184. Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016)
Yfcc100m: the new data in multimedia research. Commun ACM 59(2). To Appear
Visual Intelligence through Human Interaction 313
185. Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evalua-
tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
4566–4575
186. Vijayanarasimhan S, Jain P, Grauman K (2010) Far-sighted active learning on a budget for
image and video recognition. In: 2010 IEEE conference on computer vision and pattern
recognition (CVPR). IEEE, pp 3035–3042
187. Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption
generator. arXiv:1411.4555
188. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption
generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 3156–3164
189. von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: Proceedings of the
SIGCHI conference on Human factors in computing systems. ACM, pp 319–326
190. von Ahn L, Dabbish L (2004) Labeling images with a computer game, pp 319–326
191. Vondrick C, Patterson D, Ramanan D (2013) Efficiently scaling up crowdsourced video anno-
tation. Int J Comput Vis 101(1):184–204
192. Wah C, Branson S, Perona P, Belongie S (2011) Multiclass recognition and part localization
with humans in the loop. In: 2011 IEEE international conference on computer vision (ICCV).
IEEE, pp 2524–2531
193. Wah C, Van Horn G, Branson S, Maji S, Perona P, Belongie S (2014) Similarity comparisons
for interactive fine-grained categorization. In: 2014 IEEE conference on computer vision and
pattern recognition (CVPR). IEEE, pp 859–866
194. Wang Y-C, Kraut RE, Levine JM (2015) Eliciting and receiving online support: using
computer-aided content analysis to examine the dynamics of online social support. J Med
Internet Res 17(4):e99
195. Warde-Farley D, Bengio Y (2016) Improving generative adversarial networks with denoising
feature matching
196. Warncke-Wang M, Ranjan V, Terveen L, Hecht B (2015) Misalignment between supply and
demand of quality content in peer production communities. In: Ninth international AAAI
conference on web and social media
197. Weichselgartner E, Sperling G (1987) Dynamics of automatic and controlled visual attention.
Science 238(4828):778–780
198. Weld DS, Lin CH, Bragg J (2015) Artificial intelligence and collective intelligence. In: Hand-
book of collective intelligence, pp. 89–114
199. Welinder P, Branson S, Perona P, Belongie SJ (2010) The multidimensional wisdom of crowds.
In: Advances in neural information processing systems, pp 2424–2432
200. Whitehill J, Wu T-f, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count
more: optimal integration of labels from labelers of unknown expertise. In: Advances in neural
information processing systems, pp 2035–2043
201. Wichmann FA, Jeremy Hill N (2001) The psychometric function: I. Fitting, sampling, and
goodness of fit. Percept Psychophys 63(8):1293–1313
202. Willis CG, Law E, Williams AC, Franzone BF, Bernardos R, Bruno L, Hopkins C, Schorn
C, Weber E, Park DS et al (2017) Crowdcurio: an online crowdsourcing platform to facilitate
climate change studies using herbarium specimens. New Phytol 215(1):479–488
203. Wobbrock JO, Forlizzi J, Hudson SE, Myers BA (2002) Webthumb: interaction techniques for
small-screen browsers. In: Proceedings of the 15th annual ACM symposium on User interface
software and technology. ACM, pp 205–208
204. Xia H, Jacobs J, Agrawala M (2020) Crosscast: adding visuals to audio travel podcasts. In:
Proceedings of the 33rd annual ACM symposium on user interface software and technology,
pp 735–746
205. Yang D, Kraut RE (2017) Persuading teammates to give: systematic versus heuristic cues for
soliciting loans. Proc. ACM Hum-Comput Interact 1(CSCW):114:1–114:21
206. Yue Y-T, Yang Y-L, Ren G, Wang W (2017) Scenectrl: mixed reality enhancement via efficient
scene editing. In: Proceedings of the 30th annual ACM symposium on user interface software
and technology, pp 427–436
314 R. Krishna et al.
207. Zhang H, Sciutto C, Agrawala M, Fatahalian K (2020) Vid2player: controllable video sprites
that behave and appear like professional tennis players. arXiv:2008.04524
208. Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient
descent algorithms. In: Proceedings of the twenty-first international conference on Machine
learning. ACM, p 116
209. Zhou D, Basu S, Mao Y, Platt JC (2012) Learning from the wisdom of crowds by minimax
entropy. In: Advances in neural information processing systems, pp 2195–2203
210. Zhou S, Gordon M, Krishna R, Narcomey A, Fei-Fei LF, Bernstein M (2019) Hype: a bench-
mark for human eye perceptual evaluation of generative models. In: Advances in neural
information processing systems, pp 3449–3461
ML Tools for the Web: A Way for Rapid
Prototyping and HCI Research
Abstract Machine learning (ML) has become a powerful tool with the potential
to enable new interactions and user experiences. Although the use of ML in HCI
research is growing, the process of prototyping and deploying ML remains challeng-
ing. We claim that ML tools designed to be used on the Web are suitable for fast
prototyping and HCI research. In this chapter, we review literature, current technolo-
gies, and use cases of ML tools for the Web. We also provide a case study, using
TensorFlow.js—a major Web ML library, to demonstrate how to prototype with Web
ML tools in different prototyping scenarios. At the end, we discuss challenges and
future directions of designing tools for fast prototyping and research.
1 Introduction
1 https://www.tensorflow.org/.
2 https://pytorch.org/.
3 https://pandas.pydata.org/.
4 https://numpy.org/.
N. Li (B) · J. Mayes · P. Yu
Google, TensorFlow.js Team, Mountain View, CA, USA
e-mail: linazhao@google.com
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 315
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_10
316 N. Li et al.
5 https://github.com/tensorflow/tfjs.
ML Tools for the Web … 317
large models [10]. There are also techniques to quantize model weights to make the
model size smaller [13] or to prune and optimize model architecture for inference.
We can even leverage device hardware acceleration such as the GPU to speed things
further.
ML is still relatively new to many HCI researchers, designers, and application
developers. We think that there is also a lack of understanding of ML, educational
materials, and prototyping tools to bridge the gap between ML research and the rest
of the world. The HCI community is now starting to study how to apply ML in
design [7], and software development [4]. In the developer community, there have
been a number of emerging supporting libraries such as Danfo.js6 that replicates
Pandas, and other common utilities. Books and courses teaching how to use ML tools
in JavaScript are also now being published,7,8 but there is room for improvement in
helping people understand, discover, and share models, which we will discuss later
in the chapter.
2 Related Work
There has been growing interest in designing ML libraries for on-device learning as
seen in the number of ML system papers [6, 14, 30]. There has also been growing
interest in applying ML in HCI and research as seen in recent CHI workshops [12,
17, 19]. To gain a full picture of what ML libraries for Web can do and how they can
address the need for fast prototyping and research in HCI, we undertook a review of
the following areas:
• ML use cases in HCI research.
• ML libraries for the Web.
• Task-specific libraries and User-friendly libraries for non-ML experts.
• Challenges for non-ML experts.
The technical side of the HCI community has been working on applying ML to
improve interactions. We describe some common themes, categorized by input types;
see Table 2.
Current state-of-the-art computer vision models are capable of capturing eye, face,
body, and hand movement using webcams, therefore, reducing the need for special
sensors and equipment.
6 https://danfo.jsdata.org/.
7 https://www.manning.com/books/deep-learning-with-javascript.
8 https://www.coursera.org/learn/browser-based-models-tensorflow.
318 N. Li et al.
Table 2 Inputs, Data formats, TensorFlow.js models, and HCI use cases
Input Data format Model Use case
Face,Eye,Iris Video FaceMesh Eye tracking
Body Video BodyPix, PoseNet Video conferencing,
VR, health
Hand Video Handtracking Art and design
Speech Audio Speech-command Dialog interface,
model Accessibility
Text Text MobileBERT Chat bot
Face and eye tracking. Sewell and Komogortsev [29] demonstrate how to train
a neural network to track eye gaze with only a standard webcam. Facial expressions
can be used in many scenarios, one application area is affect detection. For example,
Bosch et al. studied how ML-powered affect detection can be used to inform learning
states in the classroom [3].
Body segmentation. Body-centric design relies on body segmentation techniques
in designing a body-aware experience. In video conferencing, body segments of par-
ticipants can be inserted into a co-located virtual space (see example research [26]),
so that they can accomplish tasks that normally require pointing to or working on
a shared artifact. Body segmentation can also be used to enhance a person’s virtual
presentation [15]. In health research, images or videos of body movements can be
used to capture early symptoms [27].
Hand tracking. Hand or gesture tracking also has wide application, such as using
virtual hands to accomplish complex tasks [34]. In another study, hand and body are
used to recognize gestures of aircraft handling [31].
Speech recognition. Speech or command recognition has gained increasing atten-
tion as assistant agents popularize. It is also an important topic in accessibility
research [28].
NLP. Language models can power many text-based research, such as chat bot [1].
NLP is also used to help diagnose mental health [33], machine translation [36], and
spoken language interaction [22], just to name a few.
2.2 ML Libraries
Ma et al. [20] provided a comprehensive review of ML libraries for the web in 2019.
There have been a lot of changes in this fast growing field, so we summarize the
findings and provide an updated review here. The 2019 review examined 7 libraries,
just 3 of them are still in active development as of Dec. 2020, including Tensor-
ML Tools for the Web … 319
Flow.js,9 brain.js10 and WebDNN.11 The other 4 libraries have not had code changes
for 2 years. Concurrently to this, new ML libraries for the web emerged, such as
ONNX.js12 and Paddle.js.13 These libraries can load a model and execute the model
reasonably fast in the browser. Some of these libraries can also train new models
directly in the web browser.
The primary challenge in model loading is to be able to load different model
formats generated from different training libraries. Taking the MobileNet model as
an example, the TensorFlow model format will be different from the PyTorch model
format. By far, ONNX.js has the best coverage of different model formats. It can
convert PyTorch, TensorFlow, Keras, and, of course, its own format enabling other
libraries to use ONNX.js as a bridge to convert models to a desired format. The
TensorFlow.js converter only supports TensorFlow and Keras models, but through
ONNX.js, it can also convert a PyTorch model.
The primary challenge in model execution is computational efficiency. There are
a number of different web technologies that can be used to accelerate computation.
One method of acceleration is parallel computing. By far, TensorFlow.js has the
largest parallel computing options for JS based ML. It can use the GPU to run the
computation, which not only frees up the resource in CPU, but also further takes
advantage of the GPU’s architecture. On the GPU, a matrix can be stored in 2D
space where each number in the matrix can be thought of as a point on the 2D space.
The GPU can then apply any function to these points in parallel. This means that a
large matrix multiplication can be just as fast as small matrix multiplication because
the large matrix simply takes a larger space on GPU but the time taken to do the
computation for all points in parallel is the same.
TensorFlow.js re-purposes the 2D GPU graphic API WebGL14 for model compu-
tation. This means for an N-dimensional tensor, it would be flattened and stored in a
2D space, where each number is a point in the 2D space. Operations on each point
can be run in parallel as previously described.
Alternatively, TensorFlow.js can also use a relatively new CPU computing technol-
ogy, called WebAssembly (WASM).15 WASM is a cross-browser, portable assembly
and binary format for the web. Developers don’t write WASM directly. Instead, exist-
ing C/C++/Rust code is compiled to WASM via tool chains such as Emscripten. Its
SIMD16 and Multithread17 feature allows WASM code to be run in the browser on
multiple cores. Currently, Multithreading is supported in Chrome and Firefox, while
SIMD is supported in Chrome and Firefox behind a flag the user must set to enable.
9 https://github.com/tensorflow/tfjs.
10 https://github.com/BrainJS/brain.js.
11 https://github.com/mil-tokyo/webdnn.
12 https://github.com/microsoft/onnxjs.
13 https://github.com/PaddlePaddle/Paddle.js.
14 https://www.khronos.org/webgl/.
15 https://webassembly.org/.
16 https://v8.dev/features/simd.
17 https://developers.google.com/web/updates/2018/10/wasm-threads.
320 N. Li et al.
TensorFlow.js is the only library that has SIMD and MultiThreading acceleration at
the time of writing.
It is also worth-noting that on-device computation efficiency is studied extensively
and an active field in mobile-based ML system design. As we know, existing C/C++
code can be compiled to WASM and run in the browser. Mobile-based frameworks
for iOS are mostly written in C/C++, so that their solutions that run on the CPU
can be easily migrated to the web through WASM. The problems mobile developers
face when trying to run ML on device, i.e., model conversion and optimization,
acceleration for different device hardware, lightweight in bundle size [14] are very
similar to running ML in the web, therefore, these mobile-based libraries are very
suitable to be adapted for web solutions too. This effectively brings a lot of mature
optimization solutions to the web, which can expedite ML tool development for the
web. There are several mobile-based libraries, including TensorFlow Lite,18 Caffe2,19
NCNN,20 MNN,21 and many more.
Task-specific libraries provide simpler APIs for a specific task, such as face landmark
recognition, pose recognition, command recognition, etc. These ML tasks usually
require additional steps to pre/post-process data or feed the data through a pipeline
of multiple models. These steps are hidden from users, so that they can just focus
on the input and output for their application. There are many task-specific libraries
for web usage. One example is face-api.js.22 It takes a HTML image or video as
input and output key points and bounding box coordinates from one of its high-level
APIs, such as face detection, face landmark detection, face expression recognition,
age estimation, and gender recognition. Another example is ml5.js.23 It provides
a collection of high-level APIs for input types like image, video, sound, and text,
and can accomplish tasks such as face landmark recognition, style transfer, pitch
detection, text generation, and more. These task-specific libraries are often designed
to be very easy to integrate and use within existing web applications, usually only
adding 2 to 3 lines to the application code.
18 https://www.tensorflow.org/lite.
19 https://caffe2.ai/.
20 https://github.com/Tencent/ncnn.
21 https://github.com/alibaba/MNN.
22 https://github.com/justadudewhohacks/face-api.js/.
23 https://ml5js.org/.
ML Tools for the Web … 321
User-friendly ML systems are also built on top of ML libraries, they provide end
to end user experience to custom train a model with new data or to try an existing
pre-trained model. These systems often provide GUIs to guide users through the
workflow. Users can choose from many options, such as what model to load, set
hyperparameters, preprocessing methods, and more. The tools will then take care of
the rest to train and output a model.
One example of a user friendly ML system is Runway ML.24 It allows you to
download many pre-trained models from the cloud and run on your local machine in
minutes. From image classifiers to generative networks, you can try pretty much all
types of ML models that are popular right now. It even exposes a web server that acts
like a local API you can call with your own custom data from web browsers to build
ML enabled web apps in minutes. Runway ML aims to build the next generation
of creative tools that are powered by machine learning to enable more than just
researchers to be able to use these advanced models at scale.
Another representative example is Weka [9]. It is a popular choice among the data
scientist community. The system provides a GUI interface for training classifiers.
The UI guides users to go through the components of the training process, including
uploading and preprocessing inputs, setting up training and testing data, setting up
validation methods, choosing classifiers, training and seeing the results with accuracy
analysis. The data is mostly presented in tabulated format.
Another great example would be AutoML systems. These systems hide even more
of the steps by automating a lot of the above described processes. In a nutshell, they
are systems that train many models concurrently, automatically selecting hyperpa-
rameters, comparing model accuracy, and then selects the best resulting model for
use with a given input data set. This automatic compare-and-choose process is for-
mally defined as a Combined Algorithm Selection and Hyperparameter optimization
problem in Feurer et al. paper [8], in which they detailed AutoML system design.
There is an emerging number of AutoML systems out there. The Auto-Weka system
is built on top of Weka [16]. The Google AutoML system25 [2] is a cloud-based
service that provides classifiers specifically for images, video, and text.
Another example is Teachable Machine,26 which makes the training process even
easier by simplifying the two ends of the training process. On the input end, it allows
data collection on the fly using inputs from camera and mic entirely in the web
browser. On the output end, it allows users to use the just trained model right away
in the browser, so that they can validate the accuracy of the output and retrain imme-
diately. This simplified process immensely expedites the iterative training process
and allows anyone (no matter what their background in ML or even programming)
to make a working ML prototype in under 5 min.
24 https://runwayml.com/.
25 https://cloud.google.com/automl.
26 https://teachablemachine.withgoogle.com/.
322 N. Li et al.
There’s a large body of research on end-user programming [23], but few studies on
programming with ML. This problem is different from traditional programming, as
pointed out by Patel et al. [24], “Despite the widespread impact in research, ML algo-
rithms are seldom applied in practice by ordinary software engineers. One reason
is that applying machine learning is different from traditional programming. Tradi-
tional programming is often discrete and deterministic, but most machine learning is
stochastic.. . . Traditional programming is often debugged with print statements and
breakpoints, but machine learning requires analysis with visualizations and statis-
tics.” And Patel asserts that “Developers need new methods and tools to support the
task of applying machine learning to their everyday problems.”
We synthesize several pioneer works in this area. Patel et al. studied the difficulties
software developers encountered in using statistical ML [25]. Yang et al. studied
how non-experts build ML models [37]. Cai et al. studied the challenges software
developers encounter in using deep learning ML [4]. From these studies, we found
several themes for non-ML experts programming with ML.
A lot of non-ML experts came to ML because they have a data set at hand with a real-
world problem to solve. They see ML as “black boxes”, with which given an input, it
will output a result. They describe the problem as real-world situations and desired
outcomes not as formalized ML problems. For example, they want to use ML to solve
problems like: “Does my houseplant need watering? Is there any coffee left in the
pot? What is the weather like outside? Are there any eggs in my refrigerator? Who
left those dirty dishes in the sink? Has the mail been delivered yet? How many police
cars pass by my window on a typical day?” [21], or alert when microwave finished,
estimate emotion from wearables, predict machine maintenance, answer HR Policy
questions [37]. Under this mindset, they often search for publicly-available working
ML examples that solve similar problems [37]. They evaluate the models by visually
examining the results or trying it in their applications [25, 37].
Another common behavior seen in non-ML experts is that they select and debug
models through rapid trial-and-error. Lacking sufficient ML knowledge, they seldom
base their decision on model architecture, rather they often just “tried all algorithms”
that they can get and choose the one that shows the best outcome [37]. Or as reflected
in Patel et al. paper [25], “We basically tried a whole bunch of Weka experimentation
and different algorithms.”
ML Tools for the Web … 323
We propose that fast prototyping and deploying ML models on the Web can have a
positive spiral effect for the whole research cycle. Figure 1 depicts a general research
workflow involving applied ML. It starts from a new model. We will take the new
model, use ML web tools to convert the models to web format, and deploy with
JavaScript code. Once deployed, we can study the experience. Feedback on expe-
rience about the application or bugs and biases about the models are exposed and
collected. The model and/or the application is modified and redeployed taking this
feedback into account, thus beginning the cycle again to further refine.
By easily deploying the model and sharing with users online, we are likely to
get feedback faster, which, in turn, improves the model/application to produce more
(and better) results in the future, creating a positive spiral of research and innovation
at scale. Let’s dive into each of these parts of the life cycle in more detail.
Traditionally some research papers are accompanied by a link to a GitHub repo where
you can clone the code to be set up in your own environment for testing/expanding
upon. As the expectation is to clone the researcher’s environment server side there are
some rather large blocks for people adopting code in this form due to the assumptions
and dependencies made. Let’s take a look at a typical developer flow if someone is
324 N. Li et al.
new to machine learning (most people are new to ML right now at time of writing)
and want to run your example Python code:
1. Find your code from a research paper/publication/video.
2. Clone your GitHub repository to their server (if they have one - else they need to
set up a suitable environment and understand Linux/terminal usage etc. too).
3. Install ML library of choice and other dependencies you may have used.
4. Install CUDA (not trivial) to ensure GPU acceleration for training/inference on
NVIDIA GPUs.
5. Read instructions on how to use the model (often these are not very detailed and
have many assumptions on prior knowledge about said research).
6. If all previous 5 steps work out perfectly and the user understands your expla-
nations to get the code running, they can finally try out your model. Majority of
people have probably bailed by step 4 or 5.
Contrast this to the web ML experience:
1. Interested users visit a website which they found via a share/video/research paper.
2. The machine Learning model executes without needing to install any dependen-
cies and automatically selects the best hardware to execute based on what is
available on the client device.
3. Easy to adapt JavaScript code (as the web page itself is a live working demo
whose source can be viewed with ease) showing how to send data to the model
in the web environment which is easy to replicate (simply view source).
ML Tools for the Web … 325
Now that we can easily access and deploy the model in web applications, it allows
us to reach even more participants online. With the broadened user base, it is more
likely that we find people interacting with the application in ways that may even
surprise us as people may see a different vision to the one we had in mind for their
unique experience.
3.3 Feedback
Now that we have more people using the model, we are likely to find potential bugs
and biases in the model or design that we may have overlooked. Maybe we had
created a speech recognition model trained with a US English data set. But when the
application is tested with users in the UK (or other countries), issues with accuracy
are discovered. We can then iterate on the model to produce a better model that takes
into account these new considerations.
Related work has revealed that the task-specific APIs and user friendly interfaces
are easier to adopt for non-ML experts. TensorFlow.js has a model garden28 with
carefully curated high-level APIs. These APIs are all task-oriented, including object
detection, face-landmarks-detection, hand pose, body segmentation, toxicity detec-
tion, etc.
27 https://www.tensorflow.org/js.
28 https://github.com/tensorflow/tfjs-models.
326 N. Li et al.
There are two ways to load the body segmentation API in your application: through
a HTML script tag, or via npm.
Below snippet shows the HTML script tag approach
HTML
Terminal
Now the model’s API is available to use, use the API to load the body segmentation
model, this has to be done only once
Javascript
Javascript
The result here will contain a JSON object we can parse through to then do with
as we wish in our application.
Given an image with one or more people, the API’s segmentPersonParts method
predicts the 24 body part segmentations for all people in a given image. It returns
a PartSegmentation object corresponding to body parts for each pixel for all people
found as one array of objects.
The PartSegmentation object returned contains a width, height, allPoses, and an
Int32Array with each value in the array representing one pixel passed to the function,
the value is one of −1 to 24, representing whether this pixel is one of 24 body parts
or background (represented by −1). The allPoses array contains key points of the
body, such as left_upper_arm_front, torso_front, left_hand, etc. For details; see the
api documentation.29
Figure 2 shows the outputs of the model visualized. Figure 2a is the original
image. Figure 2b renders different body parts in different colors (using the pixel
level Int32Array that is returned). Figure 2c shows the key points of the body from
“allPoses” object. The API also provides utility functions for rendering.
In this example, we have shown that ML can be integrated into a web application
with just a few lines of code. With the output from the model, we can create many
new experiences. For inspiration, we describe a few examples below (demos are
available in the footnote):
29 https://github.com/tensorflow/tfjs-models/tree/master/body-pix.
328 N. Li et al.
The opportunities to use ML in web applications is far beyond what existing Tensor-
Flow.js models can offer. As depicted in the positive spiral effect diagram, if any new
model developed by the ML community can be converted to web format, it opens
up many opportunities for non-ML experts to adopt and adapt ML in applications of
their own. In this section, we provide an example of the workflow.
30 https://www.youtube.com/watch?v=kFtIddNLcuM.
31 https://github.com/yemount/pose-animator.
32 https://www.youtube.com/watch?v=x1JYnsvvaJs.
ML Tools for the Web … 329
Terminal
Step 2: Invoke the converter in the terminal and follow the instructions, it will
ask for the location of the saved model and optional conversion configurations. After
this step, it will output a model.json file and several model weights binaries. The
model.json file specifies the model architecture as a graph. And the model weights
binaries contain weights for each edge in the graph.
Terminal
tensorflowjs_wizard
Similar to the example in the last section, you need to load the library either through
a HTML script tag or npm. Below snippet shows how to use HTML script tag to load
the library
HTML
You need to host the model.json and weights files somewhere such as on a CDN
or your web server that you can access. The below snippet shows how to use the
TensorFlow.js library to load the model
Javascript
Now you have loaded the model, you can feed inputs to it and run the model. Below
snippet shows the API to run the model.
Javascript
In this example, we have shown that with a few steps you can use any model
in a web application so long as the ops used by that model are also supported in
TensorFlow.js. We will be notified about not supported ops when converting the
model.
two scenarios: (1) If an existing model is trained with one classification task, for
example, to recognize sunflower vs. tulip, but your requirement is to recognize dog
vs. cat, you can just redefine the classes and custom train the sunflower vs. tulip
model to recognize dog vs. cat. (2) If you want to provide personalized results to
each of your users, you can distribute the same original model to each user, have the
user custom train their own model on their own devices. Then users can use their
own model. We provide examples of two ML tools for transfer learning, they require
little to no coding, therefore, allowing users, especially non programmers, to focus
on ideas rather than coding.
33 https://teachablemachine.withgoogle.com/.
332 N. Li et al.
Users need to define classes and provide some samples for each class. The samples
can be uploaded from the file system or collected in real time with microphone (for
audio) or camera (for images). In this example, we use the microphone to collect
sound in real time. We record 20 s of audio for the ‘Background Noise’ class, and
record 8 s of audio for the ‘Ok Google’ class. For the second class, we just say ‘Ok,
Google.’ several times.
Click the ‘Train Model’ button in the middle and wait for a few seconds.
Once the model is trained, it will be available to test in real time on the right. Keep
saying words to the microphone and see the model outputs confidence level for
each class in real time. As seen in this example, when we say ‘Ok, Google’, the
confidence score for the ‘Ok Google’ class is 99%. So within a few minutes, we are
able to custom train a sound recognition model that can recognize hot words ‘Ok,
Google’ with very high accuracy. Note that, if the accuracy is not satisfactory, we
can add more samples and retrain the model again.
5.1.4 Deployment
Once you are satisfied with the model performance, download the model and run it
with TensorFlow.js in the browser. The code to load and execute the model is the
same as the above two examples.
Cloud AutoML34 allows you to train production quality models for computer vision
and more by automatically trying different models and hyperparameters to figure
out the best combination that works well with your dataset. Compared to Teachable
Machine, Cloud AutoML provides more customization and typically is suitable when
working with larger amounts of training data (think Gigabytes or more) for more
advanced usage.
34 https://cloud.google.com/automl.
ML Tools for the Web … 333
First, you need to upload your training data to a cloud storage bucket that can be
accessed by Cloud AutoML. In the example Fig. 4, we have uploaded several folders
of flowers that we would like to recognize and train a model for
We can choose whether we would prefer the model training to optimize for higher
accuracy or faster prediction time. Depending on the use case, you may have a
preference for one over the other. Finally, set a budget to define the upper bounds
of how long the exhaustive search is allowed to take and click on start training. See
Fig. 5.
Once the model has finished training, we will be notified and we can log in and
export the model to TensorFlow.js format. Then we can download the model.json
and binary files required to run in the web browser; see Fig. 6
334 N. Li et al.
5.2.4 Deployment
Similar to the previous examples, we use TensorFlow.js to deploy and run the model
on the client side. For Google Cloud AutoML, it needs to load an additional library
as shown below
HTML
<script src="//cdn.jsdelivr.net/npm/@tensorflow/tfjs/
dist/tf.min.js"></script>
<script src="//cdn.jsdelivr.net/npm/@tensorflow/tfjs-automl/
ML Tools for the Web … 335
dist/tf-automl.min.js"></script>
<script>
async function run() {
const model = await tf.automl.loadImageClassification(’model.json’);
const image = document.getElementById(’daisy’);
const predictions = await model.classify(image);
}
run();
</script>
6 Deployment Considerations
The web allows us to reach a variety of devices: desktop, laptops, mobile phones,
tablets, and even IoT devices like Raspberry Pi. Hardware support (CPU/GPU) has
been improved dramatically for these devices in recent years, but resources such as
memory, battery power, and network bandwidth need to be kept top of mind when
deploying to such devices as they can vary from device to device and is not fixed.
Therefore, there are several key aspects we need to consider when deploying models
to the web.
Model size is an important performance factor, it is mainly affected by the number
of nodes and the weights of a model. For example, a well-known image classification
model—MobileNetV2—-whose size is around 8MB.35 Another example is Mobile-
BERT, an NLP Model for question and answer finding from a piece of text, its size
is around 95MB, and on average takes 10 s to download on a 4G network.36
Reducing model size will generally lead to better performance. One way of reduc-
ing model training parameters is to reduce layers or size of a layer, which usually
leads to loss of accuracy, but it can work for applications that can tolerate accuracy
loss. Another way is to quantize the model weights. By default, weights uses float32
as data type, which takes 32-bit to represent a number. If we use int16 to represent
the number, then it will take 16-bit, half of the previous size. Converting from float32
to int16 or int8 can lead to 2–4 times size reduction. A caveat of quantization is that
the precision loss will also result in slight model accuracy loss.
Besides reducing model size, there are also techniques to optimize the graph
architecture for inference. For example, some of the graph architectures that are
35 This model contains 3.47M parameters, which results in 300 million multiply-accumulate oper-
ations in every model execution.
36 This model contains around 15.1M parameters.
336 N. Li et al.
useful in training but are unnecessary for inference can be pruned. We list some
common optimization techniques here37 :
• Constant Folding: Statically infers the value of tensors when possible by folding
constant nodes in the graph and materializes the result using constants.
• Arithmetic: Eliminating common subexpressions and simplifying arithmetic state-
ments.
• Op Fusing: Fuse subgraphs onto more efficient implementations by replacing com-
monly occurring subgraphs with optimized fused monolithic kernels.
6.3 Benchmarking
The model inference speed can vary by the type of device, its underlying hardware,
the acceleration options, the model specific architecture, and even input size. As
such there is no golden rule of when to choose what. We need to always benchmark a
model to gauge its performance on a target device before deployment. TensorFlow.js
provides a benchmarking tool40 that allows users to benchmark their models against
different back ends, acceleration options, and input sizes.
37 https://www.tensorflow.org/guide/graph_optimization.
38 https://gpuweb.github.io/gpuweb/.
39 https://www.w3.org/groups/cg/webmachinelearning.
40 https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html.
ML Tools for the Web … 337
7 Discussion
In this chapter, we started a discussion of using Web ML tools for fast prototyping
and research for non-ML experts, especially designers, web developers and HCI
researchers. Throughout the chapter, we have demonstrated how Web ML tools can
be used to quickly experiment with ML ideas in web applications.
7.1 Limitations
Admittedly, ML tools for the Web is only one way of supporting fast prototyping and
research. Also, it still requires some level of programming skills, so it doesn’t help
designers without those skills, however, tools like Teachable Machine and Runway
ML show that, in the future, there may be more tools aimed at designers allowing
them to prototype in a way that is meaningful with limited technical knowledge of
this domain. We briefly discuss the possible solutions for these other needs.
For mobile and IoT innovations, users can still use web tools like TensorFlow.js
to deploy to those platforms. For mobile devices, TensorFlow.js can be deployed in
mobile browsers, PWA apps, mini programs such as weChat, and native apps through
WebView via frameworks like React Native. For IoT devices, TensorFlow.js can be
deployed on Raspberry Pi through Node.
If the use case specifically requires a native app experience beyond what one can
achieve with the above tooling, there are also several popular mobile ML libraries
specifically for this purpose, such as TensorFlow Lite41 and MNN.42 To run mod-
els on IoT devices efficiently, the TensorFlow Lite for MicroController library43 is
specifically designed to be run on small, low-powered computing devices. There are
high-level APIs for native development too, for example, TensorFlow Lite recently
launched a Task Library,44 which provides high- level APIs for vision and natural
language processing.
For designers and HCI researchers, who need to design or study ML ideas in
early stages, such as in low fidelity prototypes, we are not aware of any mainstream
tools for this type of work. Dove et al. [7] conducted an in-depth study on challenges
for designers working with ML and they found that prototyping with ML was diffi-
cult for designers because current prototyping tools can’t support them. The unique
challenges lie in that, (1) ML system has high degree of uncertainty, so is hard to pro-
totype without the actual model and the data; (2) It is hard to prototype unexpected
system behavior such as false positive and false negative; (3)Technical complexity
of using ML system for designers are too high. While we need to consider new tools
to overcome the above challenges, we have also shown the potential to adapt current
41 https://www.tensorflow.org/lite.
42 https://github.com/alibaba/MNN.
43 https://www.tensorflow.org/lite/microcontrollers.
44 https://www.tensorflow.org/lite/inference_with_metadata/task_library/overview.
338 N. Li et al.
available ML web tools for fast prototyping in some areas. The Teachable Machine
and Cloud AutoML vision examples require no coding at all to train a new model
for common detection tasks. High- level APIs exist to be directly used in HTML
pages for common tasks, such as hand, face, pose detection, body segmentation, and
more. A lot of interactions can be easily built with these available tools. This can be
a starting point for designers to take the lead to innovate new design concepts with
ML.
We have proposed to use ML tools for the web as a way for fast prototyping and
research. Throughout the chapter, we have demonstrated with examples and practical
guides how to use those tools for prototyping. We summarize key points on why we
think web ML tools are a suitable set for fast prototyping and research.
7.2.1 Privacy
We can both train and classify data on the client machine without ever sending data
to a 3rd party web server. There may be times where this may be a requirement to
comply with local laws and policies or when processing any data that the user may
want to keep on their machine and not sent to a 3rd party, for example, in a medical
application.
7.2.2 Speed
As we are not having to send data to a remote server, running the model can be faster
as there is no round trip time from client to server and back to the client again. Even
better, we have direct access to the device’s sensors such as the camera, microphone,
GPS, accelerometer, and more should the user grant us access opening up many
opportunities for real-time applications.
With one click, anyone in the world can click a link we send them, open the web
page in their browser, and utilize what we have made - a zero install frictionless
experience. No need for a complex server side Linux setup with CUDA drivers and
much more just to use the machine learning system. Anyone in the world can try our
model with ease, no matter what their background is.
ML Tools for the Web … 339
7.2.4 Cost
No servers means the only thing we need to pay for is a cloud storage to host the
HTML, CSS, JavaScript, and model files. The cost of cloud storage is much cheaper
than keeping a server (potentially with a graphics card attached) running 24/7.
7.2.5 Ecosystem
Machine Learning for the web is new. While this presents a great opportunity for
those helping to shape its future, there are also a few challenges due to its young age.
A few points to consider are as follows.
Supporting libraries may not yet be available in the JavaScript ecosystem that are
common in Python, for example NumPy. However, this is changing fast, such as
Danfo.js45 that replicates Pandas. So with time, we will gain parity here as more
JavaScript developers adopt ML into their workflows and find the need to recreate
some of these popular tools and libraries to increase efficiency of using ML in their
applications.
Educational resources are also not as mature as Python equivalents. Once again this
is changing fast although early adopters may need to spend some time to find the
educational materials. We are already seeing a number of great books and courses
come out for TensorFlow.js. A few are listed here for further reading:
45 https://danfo.jsdata.org/.
340 N. Li et al.
Searching for ML models suitable for web applications can be hard, especially for
non-ML experts. There are a number of starting points to find new models produced
by the TensorFlow.js team and the community:
• tfjs-models52 : Task-oriented APIs with premade models for many common use
cases.
• TF Hub53 : A site that hosts many open sourced models and allow model discovery
through browsing, searching, and filtering.
8 Conclusion
ML tooling for web engineering provides opportunities for non-ML experts, e.g.,
software engineers, web developers, designers, and HCI researchers, to quickly pro-
totype and research new ML-powered ideas. In this chapter, we reviewed the history
of web-based ML tooling, presented examples and practical guides, showing how to
use those tools for fast prototyping. We also presented limitations to consider. While
46 https://www.oreilly.com/library/view/learning-tensorflowjs/9781492090786/.
47 https://www.manning.com/books/deep-learning-with-javascript.
48 https://www.apress.com/gp/book/9781484262726.
49 https://www.apress.com/gp/book/9781484264171.
50 https://www.coursera.org/learn/browser-based-models-tensorflow.
51 https://www.pluralsight.com/courses/building-machine-learning-solutions-tensorflow-js-tfjs.
52 https://github.com/tensorflow/tfjs-models.
53 https://tfhub.dev/.
ML Tools for the Web … 341
ML on Web is a new frontier for many right now, it can open great opportunities
to deploy at scale both for production and HCI research. Now it is a great time to
take your first step and join the fast growing Web ML community for your projects,
applications, or ideas too.
Acknowledgements We would like to thank Sandeep Gupta and Daniel Smilkov for their contri-
butions and valuable feedback.
References
1. Ashktorab Z, Jain M, Vera Liao Q, Weisz JD (2019) Resilient chatbots: repair strategy prefer-
ences for conversational breakdowns. In: Proceedings of the 2019 CHI conference on human
factors in computing systems (CHI ’19). Association for Computing Machinery, New York,
NY, USA, Article Paper 254, 12 pp. ISBN 9781450359702. https://doi.org/10.1145/3290605.
3300484
2. Bisong E (2019) Google AutoML: cloud vision. In: Building machine learning and deep learn-
ing models on google cloud platform. Apress, Berkeley, CA, pp 581–598
3. Bosch N, D’Mello S, Baker R, Ocumpaugh J, Shute V, Ventura M, Wang L, Zhao W (2015)
Automatic detection of learning-centered affective states in the wild. In: Proceedings of the 20th
international conference on intelligent user interfaces (IUI ’15). Association for Computing
Machinery, New York, NY, USA, pp 379–388. ISBN 9781450333061. https://doi.org/10.1145/
2678025.2701397
4. Cai CJ, Guo PJ (2019) Software developers learning machine learning: motivations, hurdles,
and desires. In: 2019 IEEE symposium on visual languages and human-centric computing
(VL/HCC), pp 25–34. https://doi.org/10.1109/VLHCC.2019.8818751
5. Carney M, Webster B, Alvarado I, Phillips K, Howell N, Griffith J, Jongejan J, Pitaru A, Chen
A (2020) Teachable machine: approachable web-based tool for exploring machine learning
classification. In: Extended abstracts of the 2020 CHI conference on human factors in comput-
ing systems (CHI EA ’20). Association for Computing Machinery, New York, NY, USA, pp
1–8. ISBN 9781450368193. https://doi.org/10.1145/3334480.3382839
6. David R, Duke J, Jain A, Janapa Reddi V, Jeffries N, Li J, Kreeger N, Nappier I, Natraj M,
Regev S, Rhodes R, Wang T, Warden P (2021) TensorFlow lite micro: embedded machine
learning on TinyML systems
7. Dove G, Halskov K, Forlizzi J, Zimmerman J (2017) UX design innovation: challenges for
working with machine learning as a design material. In: Proceedings of the 2017 CHI conference
on human factors in computing systems (CHI ’17). Association for Computing Machinery,
New York, NY, USA, pp 278–288. ISBN 9781450346559. https://doi.org/10.1145/3025453.
3025739
8. Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015)
Efficient and robust automated machine learning. In: Cortes C, Lawrence N, Lee
D, Sugiyama M, Garnett R (eds) Advances in neural information processing sys-
tems, vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/
11d0e6287202fced83f79975ec59a3a6-Paper.pdf
9. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data
mining software: an update. 11, 1. ISSN1931-0145. https://doi.org/10.1145/1656274.1656278
10. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network
11. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H
(2017) MobileNets: efficient convolutional neural networks for mobile vision applications
12. Inkpen K, Chancellor S, De Choudhury M, Veale M, Baumer EPS (2019) Where is the human?
Bridging the gap between AI and HCI (CHI EA ’19). Association for Computing Machin-
342 N. Li et al.
27. Rick SR, Bhaskaran S, Sun Y, McEwen S, Weibel N (2019) NeuroPose: geriatric rehabilitation
in the home using a webcam and pose estimation. In: Proceedings of the 24th international
conference on intelligent user interfaces: companion (IUI ’19). Association for Computing
Machinery, New York, NY, USA, pp 105–106. ISBN 9781450366731. https://doi.org/10.1145/
3308557.3308682
28. Rubin Z, Kurniawan S, Gotfrid T, Pugliese A (2016) Motivating individuals with spastic cere-
bral palsy to speak using mobile speech recognition. In: Proceedings of the 18th international
ACM SIGACCESS conference on computers and accessibility (ASSETS ’16). Association for
Computing Machinery, New York, NY, USA, pp 325–326. ISBN 9781450341240. https://doi.
org/10.1145/2982142.2982203
29. Sewell W, Komogortsev O (2010) Real-time eye gaze tracking with an unmodified commodity
webcam employing a neural network. In: CHI ’10 extended abstracts on human factors in
computing systems (CHI EA ’10). Association for Computing Machinery, New York, NY,
USA, pp 3739–3744. ISBN 9781605589305. https://doi.org/10.1145/1753846.1754048
30. Smilkov D, Thorat N, Assogba Y, Yuan A, Kreeger N, Yu P, Zhang K, Cai S, Nielsen E, Soergel
D et al (2019) Tensorflow. js: machine learning for the web and beyond. arXiv:1901.05350
31. Song Y, Demirdjian D, Davis R (2012) Continuous body and hand gesture recognition for
natural human-computer interaction. ACM Trans Interact Intell Syst 2, 1, Article 5, 28 pp.
ISSN2160-6455. https://doi.org/10.1145/2133366.2133371
32. Stackoverflow (2020) 2020 StackOverflow developer survey. https://insights.stackoverflow.
com/survey/2020#technology-programming-scripting-and-markup-languages-professional-
developers
33. Thieme A, Belgrave D, Doherty G (2020) Machine learning in mental health: a systematic
review of the HCI literature to support the development of effective and implementable ML
systems. ACM Trans Comput-Hum Interact 27, 5, Article 34, 53 pp. ISSN1073-0516. https://
doi.org/10.1145/3398069
34. Wang R, Paris S, Popoviundefined J (2011) 6D hands: markerless hand-tracking for computer
aided design. In: Proceedings of the 24th annual ACM symposium on user interface software
and technology (UIST ’11). Association for Computing Machinery, New York, NY, USA, pp
549–558. ISBN 9781450307161. https://doi.org/10.1145/2047196.2047269
35. Xu A, Liu Z, Guo Y, Sinha V, Akkiraju R (2017) A new chatbot for customer service on social
media. In: Proceedings of the 2017 CHI conference on human factors in computing systems
(CHI ’17). Association for Computing Machinery, New York, NY, USA, pp 3506–3510. ISBN
9781450346559. https://doi.org/10.1145/3025453.3025496
36. Yamashita N, Ishida T (2006) Effects of machine translation on collaborative work. In: Pro-
ceedings of the 2006 20th anniversary conference on computer supported cooperative work
(CSCW ’06). Association for Computing Machinery, New York, NY, USA, pp 515–524. ISBN
1595932496. https://doi.org/10.1145/1180875.1180955
37. Yang Q, Suh J, Chen N-C, Ramos G (2018) Grounding interactive machine learning tool design
in how non-experts actually build models. In: Proceedings of the 2018 designing interactive
systems conference (DIS ’18). Association for Computing Machinery, New York, NY, USA,
pp 573–584. ISBN 9781450351980. https://doi.org/10.1145/3196709.3196729
38. Yuan A, Li Y (2020) Modeling human visual search performance on realistic webpages using
analytical and deep learning methods. In: Proceedings of the 2020 CHI conference on human
factors in computing systems (CHI ’20). Association for Computing Machinery, New York,
NY, USA, pp 1–12. ISBN 9781450367080. https://doi.org/10.1145/3313831.3376870
Interactive Reinforcement Learning
for Autonomous Behavior Design
1 Introduction
The reinforcement learning (RL) paradigm is based on the idea of an agent that
learns by interacting with its environment [40, 96]. The learning process is achieved
by an exchange of signals between the agent and its environment; the agent can
perform actions that affect the environment, while the environment informs the agent
about the effects of its actions. Additionally, it is assumed that the agent has at
least one goal to pursue and—by observing how its interactions affect its dynamic
environment—it learns how to behave to achieve its goal. However, standard RL
methods that learn models automatically are inconvenient for applications that could
have a great impact on our everyday life, such as autonomous companion robots or
medical applications. Generally speaking, to enable the use of RL in these types of
environments, we need to overcome the next shortcomings: how to correctly specify
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 345
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_11
346 C. Arzate Cruz and T. Igarashi
the problem, speed up the learning procedure, and personalize the model to the
particular user preferences. One way to tackle these problems is through the use of
interactive RL.
The interactive RL approach adds a human-in-the-loop that can adapt the under-
lying learning agent to facilitate its learning or make it solve a problem in a particular
manner. In this chapter, we introduce the interactive RL framework and how HCI
researchers can use it. In particular, we help researchers to play two main roles: (1)
designing new interaction techniques and (2) proposing new applications.
To help researchers perform the role (1), in Sect. 2.1 we describe how different
types of feedback can leverage an RL model for different purposes. Additionally,
in Sect. 4 we provide an analysis of recent research in interactive RL to give the
audience an overview of the state of the art. Then, in Sect. 5 we help researchers
perform the role (2) by proposing generic design principles that will provide a guide
to effectively implement interactive RL applications. Finally, in Sect. 6 we present
what we consider the most promising research directions in interactive RL.
This chapter is an extension of our survey paper in [11], with changes to give
it a more educational form, additional sections that give a broader view of RL and
its interactive counterpart, and the inclusion of new papers in our analysis of recent
research in interactive RL.
One of the main advantages of the RL paradigm is that it let us define problems by
assigning numeric rewards to certain situations in the environment. For instance, if
our agent is learning how to drive a car, we need to specify (with a numeric reward)
that crossing red lights is bad while advancing when the light is green is a positive
behavior. It can be difficult to frame real-world problems using RL (like designing
the rewards for an autonomous car), but once we have a good representation of the
problem, we can create effective behaviors even for previously unseen situations. That
is, RL designers don’t need to manually code a behavior for every single situation
the agent might face.
Some of the most complex applications where RL-based agents have been suc-
cessful are playing most Atari games better than humans [72], defeating the world
champion in Go [92], playing Dota 2 better than professional players [80], a human-
like robot hand that manipulates physical objects [78], and automatically cooling a
real-world data center [53]. These success stories have shown that RL-based appli-
cations can achieve high performance in real-world applications where historically,
other (classic) machine learning techniques had been better.
The most common approach to model RL agents is the Markov decision pro-
cess (MDP) formalism. An MDP is an optimization model for an agent acting in a
stochastic environment [83], which is defined by the tuple S, A, T, R, γ , where
• S is a set of states.
Interactive Reinforcement Learning for Autonomous Behavior Design 347
• A is a set of actions.
• T : S × A × S → [0, 1] is the transition function that assigns the probability
of reaching state s when executing action a in state s, that is, T (s | s, a) =
P(s | a, s), and it follows the Markov property which implies that the future
states of this process depend only upon the current state s.
• R : S × A → R is the reward function, with R(s, a) denoting the immediate
numeric reward value obtained when the agent performs action a in state s.
• γ ∈ [0, 1] is the discount factor that defines the preference of the agent to seek
immediate or more distant rewards.
The reward function defines the goals of the agent in the environment, while the
transition function captures the effect of the agent’s actions for each particular state.
Although in many RL problems, it is assumed that the agent has access to a reward and
transition function, for real-world problems the reward function is usually designed
by an expert on the task at hand, and the transition function is learned by the agent
exploring its environment.
We present an example of a problem defined by the reinforcement learning frame-
work in Fig. 1. In this problem, the agent (triangle) starts in the top-left corner and
must reach the goal G in the top-right corner by taking an action a ∈ A in each state
s that it encounters. The agent receives from the environment an immediate reward
r ∈ R and a state update after it executes an action. In our example, the agent receives
a positive reward if it gets closer to the goal G, and a negative reward if it steps into
lava. Therefore, the agent’s objective is to find the optimal policy π ∗ that defines the
best strategy to follow in each state s in the environment. The policy π ∗ is a function
that characterizes the expected cumulative reward for selecting a given action in a
particular state.
To compute a policy π , we can use different types of algorithms, such as dynamic
programming, evolutionary algorithms (neuroevolution), Monte Carlo methods (e.g.,
MCTS), and other methods or combinations. For example, one of the limitations of
dynamic programming is handling large action-state representations in a tabular form.
348 C. Arzate Cruz and T. Igarashi
Regarding the performance of the RL models, it’s been proven that integrating
expert knowledge improves the accuracy of the models [31, 45]. Moreover, human
feedback includes prior knowledge about the world that can significantly improve
learning rates [27].
In short, the interactive RL approach provides HCI researchers with multiple
options to personalize and/or improve the performance of agents by integrating
knowledge about a task from different sources (other agents or human users), inter-
preting high-level feedback (e.g., voice, sketches, text, and eye gaze), and understand-
ing high-level goals (e.g., a personal preference over particular actions). For instance,
an interactive RL method can take as input voice commands or facial expressions to
teach a robot how to navigate [9], or use play traces of the user as input to personalize
the play style of a bot that plays Mario Bros [12]. Another possibility is focusing on
improving the performance of training dialogue agents using as input crowd-sourced
dialogue rewrites [91]. The (human) feedback for interactive RL methods comes in
varied forms, and we can use it to tailor different aspects of the agent’s behavior.
Finally, we can find a broad type of applications that use the interactive RL frame-
work, such as teaching robots how to dance [69], creating adaptable task-oriented
dialogue systems [90], learning the generation of dialogue [60], and an intelligent
tutor that suggests courses to students so that their skills match the necessities of the
industry [108]. Furthermore, there is the opportunity to adapt current automatic RL
applications to an interactive setting, such as procedural content generation for video
games [43], generating music [38], and modeling social influences in bots [39].
Fig. 3 Selected testbeds for interactive RL. a Gridworld. b Mario AI benchmark. c Pac-Man. d
Nao Robot. e TurtleBot. f Sophie’s Kitchen
The Nao robot (Fig. 3d) and the TurtleBot (Fig. 3e) are popular platforms in
robotics that are perfect for testing natural ways to interact with human users. The
main disadvantage of using robots as a testbed is that usually, RL algorithms require
long periods of learning to achieve good results, which can be prohibitive in physical
platforms such as robots. On the other hand, Sophie’s kitchen (Fig. 3f) is designed
as an online tool for interactive RL that explores the impact of demonstrating uncer-
tainty to the users [103]. Finally, the main focus of the OpenAI Gym platform [16]
is designing and comparing RL algorithms.
Choosing the right way for users to communicate their needs and preferences to a
computer is key to achieve an effective human-computer interaction. The high-level
input modalities for HCI extend from a keyboard, eye gaze, and sketches to natural
language. We can use any of these high-level input modalities for interactive RL
applications if we code their output with a format that an RL algorithm can use. In
this section, we provide HCI researchers with the foundations to adapt the output of
high-level input modalities to data that can tailor an RL algorithm (see Fig. 2). This
knowledge is essential to design novel interaction techniques for interactive RL.
Interactive Reinforcement Learning for Autonomous Behavior Design 351
In this section, we analyze how human feedback is used to tailor the low-level mod-
ules of diverse RL algorithms; we call this form of interaction between users and
RL algorithms design dimensions. Furthermore, the summary of selected works in
Table 1 presents a concise and precise classification that helps to compare different
design dimensions. For this summary, we focus on works that have been successfully
applied to RL methods that use a human-in-the-loop. We further narrow the reach of
this survey by considering only works published between 2010 and 2020.
All works classified in this design dimension tailor the reward function of the RL-
based algorithm using human feedback. The main objectives of this feedback are to
speed up learning, customize the behavior of the agent to fit the user’s intentions, or
teach the agent new skills or behaviors.
Designing a reward function by hand is trivial for simple problems where it is
enough to assign rewards as 1 for winning and 0 otherwise. For example, in the
simplest setup of a Gridworld environment, the agent receives −1 for each time-step
and 1 if it reaches the goal state. However, for most problems that an agent can face
in a real-world environment (e.g., with multiple and contradictory goals), encoding
the desired behaviors in a reward function can be challenging.
The reward design problem is therefore complex because the RL designer has
to define the agent’s behaviors as goals that are explicitly represented as rewards.
This approach can cause difficulties in complex applications where the designer
has to foresee every obstacle the agent could possibly encounter. Furthermore, an
effective reward function has to handle well trade-offs between goals. These reward
design challenges make the process an iterative task: RL designers alternate between
evaluating the reward function and optimizing it until they find it acceptable. This
alternating process is called reward shaping [79].
Reward shaping (RS) is a popular method of guiding RL algorithms by human
feedback [79, 111]. In what is arguably the most common version of RS, the user
adds extra rewards that enhance the environmental reward function [9, 45, 102] as
R = R + F, where F : S × A × S → R is the shaping reward function [79]. A
hand-engineered reward function has been demonstrated to increase learning rates
[7, 9, 26, 28–30, 32, 49, 59, 68, 79, 86, 102, 109]. A similar approach consists of
only using human feedback as a reward signal for the agent [10, 46, 110].
Using RS is also beneficial in sparse reward environments: the user can provide
the agent with useful information in states where there are no reward signals from the
environment or in highly stochastic environments [111]. Another advantage of RS
is that it gives researchers a tool with which to better specify, in a granular manner,
the goals in the current environment. That is, the computational problem is specified
via a reward function [95].
352 C. Arzate Cruz and T. Igarashi
Table 1 Selected works for each design dimension. DD = design dimension, HKI = human
knowledge integration, RS = reward shaping, PS = policy shaping, AcAd = action advice, VF =
value function, SV = Scalar-valued, HeuFun = heuristic function, Dem = demonstration, Cri =
critique, GUI = grafical user interface, FE = facial expression, VC = voice command, GC = game
controller, AT = artifact, CT = controller, HF = human feedback, ER = environmental reward,
GEP = guided exploration process, PBRS = potential-based reward shaping, ACTG = actions
containing the target goal, Mario = Super Mario Bros
DD Testbed Interaction Initiative HKI Feedback
Reward Robot in maze-like FE Passive RS using HF + Cri
function environment [9] ER
Navigation simulation GUI Passive Advantage Cri
[10] Function
Sophie’s Kitchen game GUI Passive, Active RS using HF + Cri
[103, 105] EF
Bowling game [110] GUI Passive RS + HF SV
Shopping assistant, GUI Active Active IRD Queries
Gridworld [71]
Mario, Gridworld, Soccer Coding Passive PBRS HeuFun
simulation [86]
Navigation simulation VC Passive RS using HF +
[102] ER
Atari, robotics simulation GUI Active RS using HF Queries
[19]
Policy Gridworld, TurtleBot GUI, GC Passive PS AcAd
robot [66]
Gridworld [51] VC Passive PS Cri, AcAd
Pac-Man, Frogger [32] GUI Passive PS Cri
Mario [12] GUI Active PS Dem, Cri
Exploration Pac-Man, Cart-Pole GUI Passive GEP
process simulation [116]
Simulated cleaning VC Passive GEP AcAd
Robot [22, 23]
Pac-Man [7] GUI Active GEP AcAd
Pac-Man [30] GUI Active Myopic Agent AcAd
Sophie’s Kitchen game GUI Active ACTG Guidence
[103]
Street Fighter game [13] Not apply Passive EB using Safe Dem
RL
Nao Robot [94] GUI Passive ACTG Guidence
Nexi robot [46] AT + CT Passive Myopic Agent AcAd
Value function Mountain Car simulation GUI Passive Weighted VF Dem
[44]
Keepaway simulation GUI Passive Weighted VF Dem
[101]
Mario, Cart Pole [17] Not apply Passive Initialization of Dem
VF
Interactive Reinforcement Learning for Autonomous Behavior Design 353
On the other hand, we need to consider the credit assignment problem [3, 97].
This difficulty emerges from the fact that when a human provides feedback, it is
applied to actions that happened sometime in the past—there is always a delay
between the action’s occurrence and the human response. Furthermore, the human’s
feedback might refer to one specific state or an entire sequence of states the bot
visited.
Reward hacking is another negative side effect resulting from a deficient reward
design [8, 35, 47, 87]. This side effect can cause non-optimal behaviors that fail
to achieve the goals or intentions of the designers. Usually, these kinds of failure
behaviors arise from reward functions that do not anticipate all trade-offs between
goals. For example, in [20] the authors present an agent that drives a boat in a race
but instead of moving forward to reach the goal, the agent learned that the policy
with the highest reward was to hit special targets along the track.
2.1.2 Policy
In the policy design dimension, we include interactive RL methods that augment the
policy of an agent using human knowledge. This process is called policy shaping
(PS) [33].
The PS approach consists of formulating human feedback as action advice that
directly updates the agent’s behavior. The user can interact with the RL algorithm
using an action advice type of human feedback. For this design dimension, we need
access to the (partial) policy of an (expert) user in the task the agent is learning to
perform. Then, we use this feedback to directly update the policy of the agent at
the corresponding states. In the RL example we present in Fig. 1, we could use a PS
approach to show the agent the best strategy to avoid stepping on lava in a particular
maze.
One advantage of the PS approach is that it does not rely on the representation
of the problem using a reward function. In real-life scenarios with many conflicting
objectives, the PS approach might make it easier for the agent to indicate if its policy
is correct, rather than trying to explain it through a reward-shaping method. Never-
theless, the user should know a near-optimal policy to improve the agent’s learning
process. Even though the effect of the feedback’s quality has been investigated in
interactive RL [30], more research on this topic is needed to better understand which
design dimension is least sensitive to the quality of feedback.
It’s worth mentioning that some authors consider approaches that use a binary
critique on the actions as PS [33]. In general, these methods prune the actions labeled
as “bad” from the action set at the corresponding action-state. This pruning process
creates a bias in the action selection of the agent, not its policy. Consequently, we
categorize this method as a guided exploration process.
354 C. Arzate Cruz and T. Igarashi
In general, the idea behind the value function design dimension is creating aug-
mented value functions. We can perform this augmentation by combining two or
more value functions. To integrate expert knowledge, at least one of those value
functions must come from the expert’s feedback (i.e., a value function that describes
the expert’s optimal behavior).
Using the value function design dimension is an effective strategy to personalize
and accelerate the learning process of agents [17, 48, 101]. However, there are too
few studies on this design dimension to conclusively compare its performance to
other design dimensions [44].
The main advantage that this design dimension presents over the rest is that we
can reuse expert’s value functions in multiple scenarios that share similar state-
action spaces. In this manner, we can transfer the expert’s knowledge in a form
Interactive Reinforcement Learning for Autonomous Behavior Design 355
(value function) that includes information about the long-term effects of taking an
action at a given state.
So far, we have explained how RL experts can inject human knowledge through the
main components of a basic RL algorithm. There are other ways, however, to interact
with low-level features of particular RL approaches. Next, we will explain the two
main design dimension alternatives for an interactive RL setting.
Function Approximation (FA) allows us to estimate continuous state spaces
rather than tabular approaches such as tabular Q-learning in a Gridworld domain.
For any complex domain with a continuous state space (especially for robotics), we
need to represent the value function continuously rather than in discrete tables; this
is a scalability issue. The idea behind FA is that RL engineers can identify patterns
in the state space; that is, the RL expert is capable of designing a function that can
measure the similarity between different states.
FA presents an alternative design dimension that can work together with any
other type of interaction channel, which means the implementation of an FA in an
interactive RL algorithm does not obstruct interaction with other design dimensions.
For example, the authors of [110] proposed a reward-shaping algorithm that also
uses an FA to enable their method to use raw pixels as input from the video game
they used as a testbed. However, their FA is created automatically.
As far as we know, the paper in [86] is the only work that has performed user-
experience experiments for an interactive RL that uses hand-engineered FAs to accel-
erate the base RL algorithm. Their authors asked the participants of the experiment
to program an FA for a soccer simulation environment. The FAs proposed by the
participants were successful: they accelerated the learning process of the base RL
algorithm. The same experiment was performed again, this time using the game
Super Mario Bros. [42] as a testbed. In the second experiment, the RL’s performance
worsened when using the FAs. This evidence suggests that designing an effective FA
is more challenging than simply using an RS technique in complex environments.
Hierarchical Decomposition (HRL) is an effective approach to tackle high-
dimensional state-action spaces by decomposing them into smaller sub-problems
or temporarily extended actions [25, 54, 99, 106]. In an HRL setting, an expert can
define a hierarchy of sub-problems that can be reused as skills in different applica-
tions. Although HRL has been successfully tested in different research areas [13, 14,
54], there are no studies from the user-experience perspective in an interactive RL
application.
356 C. Arzate Cruz and T. Igarashi
The use of binary critique to evaluate an RL model’s policy refers to binary feed-
back (positive or negative) that indicates if the last chosen action by the agent was
satisfactory. This signal of human feedback was initially the only source of reward
used [37]. This type of feedback was shown to be less than optimal because people
provide an unpredictable signal and stop providing critiques once the agent learns
the task [36].
One task for the RL designer is to determine whether a type of reward will be
effective in a given application. For example, it was shown that using binary critique
as policy information is more efficient than using a reward signal [44, 104]. Similarly,
in [33] the authors propose an algorithm that incorporates the binary critique signal
as policy information. From a user-experience perspective, it has been shown that
using critique to shape policy is unfavorable [51].
It is also worth noting that learning from binary critique is popular because it is
an easy and versatile method of using non-expert feedback; the user is required to
click only “+” and “−” buttons.
The heuristic function approach [15, 67] is another example of the critique
feedback category. Instead of receiving binary feedback directly from the user, in this
approach, the critique signal comes from a hand-engineered function. This function
encodes heuristics that map state-action pairs to positive or negative critiques. The
aim of the heuristic function method is to reduce the amount of human feedback.
Empirical evidence suggests that this type of feedback can accelerate RL algorithms;
however, more research is needed to test its viability in complex environments such
as video games [86].
The authors of the active inverse reward design approach [71] present a query-
based procedure for inverse reward design [34]. A query is presented to the user with
a set of sub-rewards, and the user then has to choose the best among the set. The sub-
rewards are constructed to include as much information as possible about unknown
rewards in the environment, and the set of sub-rewards is selected to maximize the
understanding of different sub-optimal rewards.
Interactive Reinforcement Learning for Autonomous Behavior Design 357
In the action advice type of feedback, the human user provides the agent with the
action they believe is optimal at a given state, the agent executes the advised action,
and the base RL algorithm continues as usual. From the standpoint of user experience,
the immediate effect of the user’s advice on the agent’s policy makes this feedback
procedure less frustrating [50, 51].
There are other ways to provide action advice to the agent, such as learning
from demonstration [101], inverse RL [79, 120, 121], apprentice learning [1], and
imitation learning [19]. All these techniques share the characteristic that the base RL
algorithm receives as input an expert demonstration of the task the agent is supposed
to learn, which is ideal, as people enjoy demonstrating how agents should behave [5,
41, 104].
2.2.4 Guidance
The last type of feedback, guidance, is based on the premise that humans find it more
natural to describe goals in an environment by specifying the object(s) of interest at
a given time-step [94, 103]. This human knowledge leverages the base RL algorithm
because the focus is on executing actions that might lead it to a goal specified by the
user, which means that the RL algorithm needs to have access to a transition function
of the dynamics in the environment. If the transition function is available, or we can
construct it, this can be an effective way to communicate which goal is the best to
follow at any given state.
Table 2 Interactive RL methods and the usual types of feedback they can use as input
Reward Policy Guided Augmented
shaping shaping exploration value functions
Critique
Action advice
Guidance
Demonstration
Inverse RL
RL method can use as input. In Table 2, we summarize the typical feedback each
interactive RL method can use.
For HCI researcher, playing the role of designing new interaction techniques can
start a new project from the high-level feedback (e.g., voice, text, and sketches) they
want to use. Then, by following our proposed design guides for interactive RL, they
can choose which technique to use and its most common type of input. In this manner,
it’s easier to find out how to interpret high-level feedback and which elements of it
are more useful to translate into low-level feedback for the underlying RL algorithm.
In this section, we present an example of an HCI researcher, named Emily, using our
interactive RL architecture to design a novel application.
Emily has been thinking about creating a novel interactive machine learning appli-
cation but she’s not sure yet on which subject to focus on. Suddenly, she remembers
that in Sect. 5 she read about different design principles and the importance of cre-
ating interactive RL applications that require as little human feedback as possible.
These constraints help Emily to come up with the idea of implementing an interface
that lets her communicate using natural language with a bot that plays Mario Bros.
That is, she has decided the testbed (one from those presented in Sect. 1.3) and the
type of high-level feedback she wants to use. Then, she decides to use RL to learn
the policy for the bot because this paradigm gives her a wide range of human inte-
gration methods (see Sect. 2.1) from which she can choose. In particular, she codes
the Super Mario Bros. game as a Markov Decision Process (MDP) and then uses a
genetic algorithm (she found this implementation on the Internet) to find a policy
and, at the same time, she codes an algorithm that computes the transition function.
Now Emily has an RL-based bot that is good at playing Super Mario Bros.; the
next step is choosing the human integration methods that fit what she wants to. After
Interactive Reinforcement Learning for Autonomous Behavior Design 359
thinking for a few days (and playing a lot of Super Mario Bros.), she decides that
it’s important for her to adapt the play style of Mario to fit her preferences. Also,
she thinks that specifying the right goal to achieve at a given time is a good commu-
nication method. After reading about the different feedback types in Sect. 2.2, she
decides that guidance is the way to go. Then, she checks Table 2 and sees that the
most common human integration methods that use guidance as input are policy shap-
ing and guided exploration. After reading Sect. 2.1, she decides that using a guided
exploration method is better because she has a transition model of the environment.
After spending a few days/weeks (and drinking a lot of coffee), she completes the
implementation of the interface and an interactive RL algorithm that exposes some
parts of the underlying MDP so Mario can express the current goal he’s pursuing
and Emily can use it as an input method to guide him to the right goal (see Fig. 4).
Now, Emily can change the behavior of Mario by writing the appropriate goal and
then pressing the button “Submit Feedback”.
The reward shaping (RS) method aims to mold the behavior of a learning agent by
modifying its reward function to encourage the behavior the RL designer wants.
Most research on interactive RL only focuses on the performance of the algo-
rithms as the evaluation metric, which leaves unclear what type of feedback and
360 C. Arzate Cruz and T. Igarashi
RL algorithm are better from an HCI perspective. On the other hand, in [103] its
authors analyzed the teaching style of non-expert users depending on the reward
channel they had at their disposal. Users were able to use two types of RS: positive
numerical reward and negative numerical reward. These types of feedback directly
modify the value function of the RL model. However, when the user gives negative
feedback, the agent tries to undo the last action it performed; this reveals the learn-
ing progress to the user and motivates them to use more negative feedback, which
achieves good performance with less feedback. They also found that some users give
anticipatory feedback to the bot; that is, users assume that their feedback is meant
to direct the bot in future states. This analysis displays the importance of studying
users’ teaching strategies in interactive RL. We need to better understand the user’s
preferences to teach agents, as well as how agents should provide better feedback
about their learning mechanism to foster trust and improve the quality and quantity
of users’ feedback. Doing this type of user-experience evaluation is important to get
a better idea of how humans prefer to communicate to RL-based algorithms and how
exposing the behavior of the agent can help users on giving better feedback.
Another RS strategy is to manually create heuristic functions that encourage the
agent to perform particular actions in certain states of the environment [15, 86]. This
way, the agent automatically receives feedback from the hand-engineered heuristic
function. The type of feedback is defined by the RL designer, and it can be given
using any of the feedback types reviewed in this paper (i.e., critique or scalar value).
The experiments conducted in [86] demonstrate that using heuristic functions as
input for an interactive RL algorithm can be a natural approach to injecting human
knowledge in an RL method. The main shortcoming of heuristic functions is that
they are difficult to build and require programming skills. Although it has been
investigated how non-experts build ML models in real life [114], there are not many
studies on the use of more natural modes of communication to empower non-experts
in ML to build effective heuristic functions that generalize well. Furthermore, users
found challenging to design effective heuristic functions for a clone of Super Mario
Bros. In this particular case, users found it more comfortable using a combination of
vanilla RS and heuristic functions. It’s important to notice that, in most interactive
RL research, the results in small testbeds (e.g., Gridworld) are not representative in
testbeds with a bigger state-action space.
The Evaluative Reinforcement (TAMER) algorithm [48] uses traces of demonstra-
tions as input to build a model of the user that is later used to automatically guide the
RL algorithm. Later, the authors of [9] proposed an algorithm called DQN-TAMER
that combines the TAMER and Deep TAMER algorithms. This novel combination
of techniques aims to improve the performance of the learning agent using both
environment and human binary feedback to shape the reward function of the model.
Furthermore, they experimented in a maze-like environment with a robot that receives
implicit feedback; in this scenario, the RS method was driven by the facial expression
of the user. Since human feedback can be imprecise and intermittent, mechanisms
were developed to handle these problems. This work is one of the few examples that
use high-level feedback (facial expressions) for RS. Furthermore, deep versions of
interactive RL methods benefit mostly from function approximation, as the use of this
Interactive Reinforcement Learning for Autonomous Behavior Design 361
technique minimizes the feedback needed to get good results. This advantage is due
to the generalization of user feedback among all similar states—human knowledge
is injected into multiple similar states instead of only one.
The policy shaping (PS) approach consists of directly molding the policy of a learning
agent to fit its behavior to what the RL designer envisions.
The authors of [32] introduced an approach that directly infers the user’s pol-
icy from critique feedback. In particular, they proposed a Bayesian approach that
computes the optimal policy from human feedback, taking as input the critiques
for each state-action pair. The results of this approach are promising, as it outper-
forms other methods, such as RS. However, PS experiments were carried out using
a simulated oracle instead of human users. Further experiments with human users
should be conducted to validate the performance of this interactive RL method from
a user-experience perspective.
On the other hand, in [51] are conducted experiments to determine which type of
feedback, critique, or action advice creates a better user experience in an interactive
RL setting. Specifically, they compared the critique approach in [32] to the proposed
Newtonian action advice in [52]. Compared to the critique approach, the action
advice type of feedback got better overall results: it required less training time, it
performed objectively better, and it produced a better user experience with it.
In [66], the Convergent Actor-Critic by Humans (COACH) interactive RL algo-
rithm is introduced. Later, in [10] a deep version, named deep COACH, uses a deep
neural network coupled with a replay memory buffer and an autoencoder. Unlike
the COACH implementation, deep COACH uses raw pixels from the testbed as
input. The authors argue that using this high-level representation as input means
their implementation is better suited for real scenarios. However, the testbed con-
sists of simplistic toy problems, and a recent effort demonstrated that deep neural
networks using raw pixels as input spend most of their learning capacity extracting
useful information from the scene and just a tiny part on the actual behaviors [24].
MarioMix [12] is an interactive RL method that uses as input high-level feedback
to create varied bot behaviors for a clone of the Super Mario Bros. game in real-time.
The user feedback is in the form of behavior demonstration from either the users
playing the game or a particular play style of a pre-computed bot in the database of
MarioMix. MarioMix uses this demonstration input to find in its behavior database
bots that play in a way that resembles what the users want (the behavior used as
input). Then, MarioMix presents to the user two different play styles from which
the user can choose and assign to a particular segment of the game stage. In this
manner, users can mix multiple policies that activate in particular segments of the
stage. Since most of the computation is made offline, users can create play styles that
align with their preferences in real-time. The main disadvantage of this approach
362 C. Arzate Cruz and T. Igarashi
is that users might want a particular play style that is not part of the pre-computed
dataset. However, this approach allows short interaction cycles.
The procedure to augment a value function consists of combining the value function
of the agent with one created from human feedback.
Studies have proposed combining the human and agent value functions to accel-
erate learning [44, 101]. In [101], the authors introduce the Human-Agent Transfer
(HAT) algorithm, which is based on the rule transfer method [100]. The HAT algo-
rithm generates a policy from the recorded human traces. This policy is then used
to shape the q-value function of the agent. This shaping procedure gives a constant
bonus to state-action pairs of the agent q-learning function that aligns with the action
proposed by the previously computed human policy.
In [17], its authors present an interactive RL algorithm named RLfD2 that uses
demonstrations by an expert as input. With these demonstrations, they create a
potential-based piece-wise Gaussian function. This function has high values in state-
action pairs that have been demonstrated by the expert and 0 values where no demon-
strations were given. This function is used to bias the exploration process of a Q(λ)-
learning algorithm in two different ways. First, the Q-function of the RL algorithm
is initialized with the potential-based function values. Second, the potential-based
function is treated as a shaping function that complements the reward function from
the environment. The combination of these two bias mechanisms is meant to leverage
human knowledge from the expert throughout the entire learning process.
From the user-experience standpoint, the augmented value function design dimen-
sion has the advantage of transfer learning. For instance, a value function con-
structed by one user can be used as a baseline to bias the model of another user
trying to solve the same task—the learned knowledge from one user is transferred to
another. Multiple sources of feedback (coded as value functions) can be combined
to obtain more complete feedback in a wider range of states. It is also convenient
that a model of the environment is not essential for this approach.
Inverse reward design (IRD) is the process of inferring a true reward function from
a proxy reward function.
IRD [34, 71] is used to reduce reward hacking failures. According to the proposed
terminology in [34], the hand-engineered reward function named the proxy reward
function is just an approximation of the true reward function, which is one that
perfectly models real-world scenarios. The process of inferring a true reward function
from a proxy reward function is IRD.
To infer the true reward function, the IRD method takes as input a proxy reward
function, the model of the test environment, and the behavior of the RL designer
who created it. Then, using Bayesian approaches, a distribution function that maps
the proxy reward function to the true reward function is inferred. This distribution
364 C. Arzate Cruz and T. Igarashi
of the true reward function makes the agent aware of uncertainty when approaching
previously unseen states, so it behaves in a risk-averse manner in new scenarios. The
results of the experiments in [34] reveal that reward hacking problems lessen with
an IRD approach.
The main interaction procedure of IRD starts as regular RS, and the system then
queries the user to provide more information about their preference for states with
high uncertainty. This kind of procedure can benefit from interaction techniques to
better explain uncertainty to humans and from compelling techniques to debug and
fix problems in the model.
5.1 Feedback
Fatigue of users and its effects on the quantity and quality of feedback should be
considered. It has been observed that humans tend to reduce the quantity of feedback
they give over time [35, 37, 47]. The quality also diminishes, as humans tend to give
less positive feedback over time. According to [66], this degradation of the quantity
and quality of human feedback also depends on the behavior exhibited by the agent.
The authors of [66] found that humans tend to give more positive feedback when
they notice that the agent is improving its policy over time: feedback is therefore
policy-dependent [70]. On the other hand, the experiments of [19] offer evidence
to support that human users gradually diminish their positive feedback when the
agent shows that it is adopting the proposed strategies. Fachantidis et al. performed
experiments to determine the impact of the quality and distribution of feedback on
the performance of their interactive RL algorithm [30].
Motivating users to give feedback—elements of gamification have been adopted
with good results; gamification strategies have been shown to improve the quantity
and quality of human feedback [57, 58].
Some studies focus on improving the quality and quantity of human feedback
by incorporating an active question procedure in the learning agent [7, 30, 32, 59];
that is, the agent can ask the user to give feedback in particular states. In [7], its
authors present an active interactive RL algorithm in which both the agent and the
demonstrator (another agent) work together to decide when to make use of feedback
in a setting with a limited advice budget. First, the agent determines if it needs help in
a particular state and asks the demonstrator for attention. Depending on the situation
of the agent, the demonstrator determines if it will help or not. The results of these
experiments are promising because they achieve a good level of performance while
requiring less attention from the demonstrator.
Maximizing the use of feedback is necessary because interactive RL in complex
environments might require up to millions of interaction cycles to get good results
[19, 73]. It has been demonstrated that the inclusion of an approximation function
that propagates human feedback to all similar states is effective to tackle the sample
inefficiency of interactive RL in complex environments such as video games [110].
willingness to interact with the agent will diminish [74]. The quality and quantity of
feedback are therefore also affected.
An end-user typification using these features would help researchers to select the
best combination of feedback type and design dimension for a particular interactive
RL and for the expected type of end-user. For instance, although all design dimensions
assume that the user knows a policy that is at least good enough to solve the task,
some design dimensions can better handle non-optimal policies from human feedback
[32]. Empirical evidence suggests that the exploration process is the most economical
design dimension for interactive RL applications [32, 51, 103]. Nonetheless, myopic
agents—which use the policy design dimension—have demonstrated great success
in some tasks [30, 55].
The hierarchical decomposition and function approximation design dimensions
require human users with a deep understanding of the problem to make effective use
of them. Another possibility is combining different design dimensions in the same
interactive RL method. It would also be useful to design an active interactive RL that
learns which design dimension best suits the application and type of use; in this way,
the interactive RL algorithm can minimize the feedback needed from the user [32].
For example, a combination of function approximation and policy design dimensions
would enable the use in a video game environment of an interactive RL that needs
only 15 min of human feedback to get positive results [110].
Our analysis has found some design factors that need more exploration to better
understand their impact on interactive RL. These include adaptive systems that can
choose between different design dimensions, enabling the use of demonstration as
feedback in complex robot environments where it is difficult for the user to show
the agent how to perform the task it has to learn. This makes interactive RL-based
applications accessible to more types of users (e.g., non-experts in RL, non-experts
in the task at hand, and those with physical limitations), so this design dimension is
better for non-experts in the task at hand.
6 Open Challenges
Below, we list what we consider the most promising research directions in interactive
RL.
a way to organize the workers and tasks that minimizes noise, incorrect feedback, and
bias. Furthermore, there are no user-experience studies to better understand which
interactive RL dimension works best for a crowd-sourced environment.
Another related setting is a framework that enables an agent to learn from both
humans and other agents (through action advice) that know the current task [113].
This approach can help on reducing the needed human feedback in interactive RL.
In the same vein, the use of simulated oracles in the early implementation stages of
interactive RL applications is useful to save time and human feedback [22, 32, 116].
However, it has been found that results with simulated oracles can differ from those
with human users [51, 116]. More studies are needed to determine what features
make human feedback a better fit for those coming from simulated oracles or agents.
Besides action advice, humans can provide feedback to agents using more abstract
information. For instance, eye gaze is an indicator of engagement, attention, and
internal cognitive state [119] in tasks such as playing video games [118], watching
pictures [18], and in human-robot social interactions [89]. Existing works have pro-
posed to learn gaze prediction models to guide learning agents [62, 117]. Using this
approach, we do not need to learn a human-like policy which can be more challenging
in real-world environments. The HCI community can contribute to this direction by
proposing other types of high-level feedback (e.g., facial expressions) that is natural,
rich in information, and machine learning models can learn to predict.
One shortcoming of RL is that it tends to overfit; it is hard to create an RL-based
agent that generalizes well to previously unseen states. On the other hand, we can
partially overcome this by creating diverse testbeds for the learning agents [85].
Although there are RL-based procedural content generation (PCG) algorithms [43],
there is not an interactive version that takes advantage of having a human-in-the-loop.
The human user working together with the PCG algorithm could create more varied
and complex challenges for the learning agent.
policy.), but there are not many studies on the type of feedback depending on the
design dimension or depending on the feedback the bot gives to the user.
As far as we know, a formal model of users for interactive RL (or IML in general) has
not been proposed. Such a model could be used to create interactive RL applications
with an active initiative that avoids user fatigue by detecting and predicting user
behavior. That is, the interactive RL system would ask for help from the user with
the correct frequency to avoid user fatigue. Another possibility is implementing RL-
based applications that adapt the interaction channel and feedback type according
to the user’s preferences. This would require empirical studies to find a way to map
between user types and their preferred interaction channels (see next subsection).
A better understanding of the strengths and weaknesses of each design dimension
and feedback type in interactive RL would lead the community to develop effective
combinations of interactive RL approaches. Achieving this would require extensive
user-experience studies with all different combinations of design dimensions and
feedback types in testbeds with small and big state spaces. Furthermore, it would
be favorable to find a mapping that considers the type of end-user. Using this type
of mapping and a model of the end-user would enable the design of interactive RL
applications that adapt to the current end-user.
The crowd-sourced approach in interactive RL has the potential to enable the use
of RL in real-world environments. However, there is not a model that typifies users
(e.g., preferred teaching strategies and feedback accuracy) and key aspects of crowd-
sourced settings such as division of work (e.g., partitioning a task into subtasks).
Finally, high-level behaviors such as facial expressions, gestures, eye gaze, and
posture are useful to express the internal cognitive state of humans while interacting
with physical robots [4]. However, there are no user-experience studies aimed at
modeling these human-robot interactions for interactive RL applications.
lives [21]. The HCI community can contribute to this research sub-area by proposing
novel interaction channels for agents to communicate their internal state.
One effective way to communicate the internal state of agents is through visual-
ization techniques that users can interact with to trigger adaptations in the model (to
fix errors) [88]. We can expose elements of a Markov Decision Process, such as the
uncertainty of performing a given action or the main goal the agent is pursuing at a
given time. As far as we know, there are no works on this subject that evaluate the
user experience.
Another effective communication method consists of natural language [61, 63].
The main challenge here is creating explanations that provide users relevant infor-
mation to diagnose the agent’s policy. Then, to repair the policy, we would need to
design an input method based on natural language.
Similar to the previous challenges, a crowd-sourced setting for debugging is rel-
evant to have a broad understanding of human-AI interaction [107]. This setting has
been applied to evaluate the interpretability of explanations of predictions in super-
vised learning settings [84] and to debug the components of a pipeline in a complex
computer vision system [75, 76, 81]. Nevertheless, more advances in techniques to
speed up the evaluation of behaviors are needed.
7 Conclusion
Acknowledgements This work was supported by JST CREST Grant Number JPMJCR17A1,
Japan. Additionally, we would like to thank the reviewers of our original DIS paper. Their kind
suggestions helped to improve and clarify this manuscript.
References
5. Amershi S et al (2014) Power to the people: the role of humans in interactive machine learning.
AI Mag 35(4):105–120
6. Amir D, Amir O (2018) Highlights: summarizing agent behavior to people. In: Proceedings of
the 17th international conference on autonomous agents and multiagent systems. International
Foundation for Autonomous Agents and Multiagent Systems, pp 1168–1176
7. Amir O et al (2016) Interactive teaching strategies for agent training. In: In Proceedings
of CAI 2016. https://www.microsoft.com/en-us/research/publication/interactive-teaching-
strategies-agent-training/
8. Amodei D et al (2016) Concrete problems in AI safety. arXiv:1606.06565
9. Arakawa R et al (2018) DQN-TAMER: human-in-the-loop reinforcement learning with
intractable feedback. arXiv:1810.11748
10. Arumugam D et al (2019) Deep reinforcement learning from policy-dependent human feed-
back. arXiv:1902.04257
11. Arzate Cruz C, Igarashi T (2020) A survey on interactive reinforcement learning: design prin-
ciples and open challenges. In: Proceedings of the 2020 ACM designing interactive systems
conference, pp 1195–1209
12. Arzate Cruz C, Igarashi T (2020) MarioMix: creating aligned playstyles for bots with inter-
active reinforcement learning. In: Extended abstracts of the 2020 annual symposium on
computer-human interaction in play, pp 134–139
13. Arzate Cruz C, Ramirez Uresti J (2018) HRLB∧2: a reinforcement learning based framework
for believable bots. Appl Sci 8(12):2453
14. Bai A, Wu F, Chen X (2015) Online planning for large markov decision processes with
hierarchical decomposition. ACM Trans Intell Syst Technol (TIST) 6(4):45
15. Bianchi RAC et al (2013) Heuristically accelerated multiagent reinforcement learning. IEEE
Trans Cybern 44(2):252–265
16. Brockman G et al (2016) OpenAI Gym. arXiv:1606.01540
17. Brys T et al (2015) Reinforcement learning from demonstration through shaping. In: Pro-
ceedings of the 24th international conference on artificial intelligence. CAI’15. Buenos Aires,
Argentina: AAAI Press, pp 3352–3358. isbn: 978-1-57735-738-4. http://dl.acm.org/citation.
cfm?id=2832581.2832716
18. Cerf M et al (2008) Predicting human gaze using low-level saliency combined with face
detection. Adv Neural Inf Process Syst 20:1–7
19. Christiano PF et al (2017) Deep reinforcement learning from human preferences. In: Advances
in neural information processing systems, pp 4299–4307
20. Clark J, Amodei D (2016) Faulty reward functions in the wild. Accessed: 2019–08-21. https://
openai.com/blog/faulty-reward-functions/
21. European Commission (2018) 2018 reform of EU data protection rules. Accessed: 2019–
06-17. https://ec.europa.eu/commission/sites/beta-political/files/data-protection-factsheet-
changes_en.pdf
22. Cruz F et al (2015) Interactive reinforcement learning through speech guidance in a domestic
scenario. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
23. Cruz F et al (2016) Training agents with interactive reinforcement learning and contextual
affordances. IEEE Trans Cogn Dev Syst 8(4):271–284
24. Cuccu G, Togelius J, Cudré-Mauroux P (2019) Playing atari with six neurons. In: Proceedings
of the 18th international conference on autonomous agents and multiagent systems. Interna-
tional Foundation for Autonomous Agents and Multiagent Systems, pp 998–1006
25. Dietterich TG (2000) Hierarchical reinforcement learning with the MAXQ value function
decomposition. J Artif Intell Res 13:227–303
26. Dodson T, Mattei N, Goldsmith J (2011) A natural language argumentation interface for expla-
nation generation in Markov decision processes. In: International conference on algorithmic
decision theory. Springer, pp 42–55
27. Dubey R et al (2018) Investigating human priors for playing video games. arXiv:1802.10217
28. Elizalde F, Enrique Sucar L (2009) Expert evaluation of probabilistic explanations. In: ExaCt,
pp 1–12
372 C. Arzate Cruz and T. Igarashi
29. Elizalde F et al (2008) Policy explanation in factored Markov decision processes. In: Pro-
ceedings of the 4th European workshop on probabilistic graphical models (PGM 2008), pp
97–104
30. Fachantidis A, Taylor ME, Vlahavas I (2018) Learning to teach reinforcement learning agents.
Mach Learn Knowl Extr 1(1):21–42. issn: 2504–4990. https://www.mdpi.com/2504-4990/1/
1/2. https://doi.org/10.3390/make1010002
31. Fails JA, Olsen Jr DR (2003) Interactive machine learning. In: Proceedings of the 8th inter-
national conference on intelligent user interfaces. ACM, pp 39–45
32. Griffith S et al (2013) Policy shaping: integrating human feedback with reinforcement learning.
In: Advances in neural information processing systems, pp 2625–2633
33. Griffith S et al (2013) Policy shaping: integrating human feedback with reinforcement learning.
In: Proceedings of the international conference on neural information processing systems
(NIPS)
34. Hadfield-Menell D et al (2017) Inverse reward design. In: Guyon I et al (eds) Advances in
neural information processing systems, vol 30. Curran Associates Inc, pp 6765–6774. http://
papers.nips.cc/paper/7253-inverse-reward-design.pdf
35. Ho MK et al (2015) Teaching with rewards and punishments: reinforcement or communica-
tion? In: CogSci
36. Isbell CL et al (2006) Cobot in LambdaMOO: an adaptive social statistics agent. Auton Agents
Multi-Agent Syst 13(3):327–354
37. Isbell Jr CL, Shelton CR (2002) Cobot: asocial reinforcement learning agent. In: Advances
in neural information processing systems, pp 1393–1400
38. Jaques N et al (2016) Generating music by fine-tuning recurrent neural networks with rein-
forcement learning
39. Jaques N et al (2018) Social influence as intrinsic motivation for multi-agent deep reinforce-
ment learning. arXiv:1810.08647
40. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell
Res 4:237–285
41. Kaochar T et al (2011) Towards understanding how humans teach robots. In: International
conference on user modeling, adaptation, and personalization. Springer, pp 347–352
42. Karakovskiy S, Togelius J (2012) The mario ai benchmark and competitions. IEEE Trans
Comput Intell AI Games 4(1):55–67
43. Khalifa A et al (2020) Pcgrl: procedural content generation via reinforcement learning.
arXiv:2001.09212
44. Knox WB, Stone P (2010) Combining manual feedback with subsequent MDP reward signals
for reinforcement learning. In: Proceedings of the 9th international conference on autonomous
agents and multiagent systems: volume 1-Volume 1. International Foundation for Autonomous
Agents and Multiagent Systems, pp 5–12
45. Knox WB, Stone P (2012) Reinforcement learning from simultaneous human and MDP
reward. In: Proceedings of the 11th international conference on autonomous agents and mul-
tiagent systems-volume 1. International Foundation for Autonomous Agents and Multiagent
Systems, pp 475–482
46. Knox WB, Stone P, Breazeal C (2013) Training a robot via human feedback: a case study. In:
International conference on social robotics. Springer, pp 460–470
47. Knox WB et al (2012) How humans teach agents. Int J Soc Robot 4(4):409–421
48. Knox WB, Stone P (2009) Interactively shaping agents via human reinforcement: the TAMER
framework. In: The fifth international conference on knowledge capture. http://www.cs.
utexas.edu/users/ai-lab/?KCAP09-knox
49. Korpan R et al (2017) Why: natural explanations from a robot navigator. arXiv:1709.09741
50. Krening S, Feigh KM (2019) Effect of interaction design on the human experience with inter-
active reinforcement learning. In: Proceedings of the 2019 on designing interactive systems
conference. ACM, pp 1089–1100
51. Krening S, Feigh KM (2018) Interaction algorithm effect on human experience with rein-
forcement learning. ACM Trans Hum-Robot Interact (THRI) 7(2):16
Interactive Reinforcement Learning for Autonomous Behavior Design 373
52. Krening S, Feigh KM (2019) Newtonian action advice: integrating human verbal instruc-
tion with reinforcement learning. In: Proceedings of the 18th international conference on
autonomous agents and multiagent systems. International Foundation for Autonomous Agents
and Multiagent Systems, pp 720–727
53. Lazic N et al (2018) Data center cooling using model-predictive control
54. Lee Y-S, Cho S-B (2011) Activity recognition using hierarchical hidden markov models
on a smartphone with 3D accelerometer. In: International conference on hybrid artificial
intelligence systems. Springer, pp 460–467
55. Leike J et al (2018) Scalable agent alignment via reward modeling: a research direction.
arXiv:1811.07871
56. Lelis LHS, Reis WMP, Gal Y (2017) Procedural generation of game maps with human-in-
the-loop algorithms. IEEE Trans Games 10(3):271–280
57. Lessel P et al (2019) “Enable or disable gamification” analyzing the impact of choice in a
gamified image tagging task. In: Proceedings of the 2019 CHI conference on human factors
in computing systems. CHI ’19. ACM, Glasgow, Scotland Uk , 150:1–150:12. isbn: 978-1-
4503-5970-2. https://doi.org/10.1145/3290605.3300380
58. Li G et al (2018) Social interaction for efficient agent learning from human reward. Auton
Agents Multi-Agent Syst 32(1):1–25. issn: 1573–7454. https://doi.org/10.1007/s10458-017-
9374-8
59. Li G et al (2013) Using informative behavior to increase engagement in the tamer framework.
In: Proceedings of the 2013 international conference on autonomous agents and multi-agent
systems. AAMAS ’13. International Foundation for Autonomous Agents and Multiagent Sys-
tems, St. Paul, MN, USA, pp 909–916. isbn: 978-1-4503-1993-5. https://dl.acm.org/citation.
cfm?id=2484920.2485064
60. Li J et al (2016) Deep reinforcement learning for dialogue generation. arXiv:1606.01541
61. Li TJ-J et al (2019) Pumice: a multi-modal agent that learns concepts and conditionals from
natural language and demonstrations. In: Proceedings of the 32nd annual ACM symposium
on user interface software and technology, pp 577–589
62. Li Y, Liu M, Rehg JM (2018) In the eye of beholder: joint learning of gaze and actions in first
person video. In: Proceedings of the European conference on computer vision (ECCV), pp
619–635
63. Little G, Miller RC (2006) Translating keyword commands into executable code. In: Pro-
ceedings of the 19th annual ACM symposium on User interface software and technology, pp
135–144
64. Liu Y et al (2019) Experience-based causality learning for intelligent agents. ACM Trans
Asian Low-Resour Lang Inf Process (TALLIP) 18(4):45
65. Liu Y et al (2019) Experience-based causality learning for intelligent agents. ACM Trans
Asian Low-Resour Lang Inf Process 18(4):45:1–45:22. issn: 2375–4699. https://doi.org/10.
1145/3314943
66. MacGlashan J et al (2017) Interactive learning from policy-dependent human feedback. In:
Proceedings of the 34th international conference on machine learning-volume 70. JMLR. org,
pp 2285–2294
67. Martins MF, Bianchi RAC (2013) Heuristically accelerated reinforcement learning: a compar-
ative analysis of performance. In: Conference towards autonomous robotic systems. Springer,
pp 15–27
68. McGregor S et al (2017) Interactive visualization for testing markov decision processes:
MDPVIS. J Vis Lang Comput 39:93–106
69. Meng Q, Tholley I, Chung PWH (2014) Robots learn to dance through interaction with
humans. Neural Comput Appl 24(1):117–124
70. Miltenberger RG (2011) Behavior modification: principles and procedures. Cengage Learning
71. Mindermann S et al (2018) Active inverse reward design. arXiv:1809.03060
72. Mnih V et al (2015) Human-level control through deep reinforcement learning. Nature
518(7540):529–533
374 C. Arzate Cruz and T. Igarashi
73. Mnih V et al (2015) Human-level control through deep reinforcement learning. Nature
518(7540):529
74. Morales CG et al (2019) Interaction needs and opportunities for failing robots. In: Proceedings
of the 2019 on designing interactive systems conference, pp 659–670
75. Mottaghi R et al (2013) Analyzing semantic segmentation using hybrid human-machine crfs.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3143–
3150
76. Mottaghi R et al (2015) Human-machine CRFs for identifying bottlenecks in scene under-
standing. IEEE Trans Pattern Anal Mach Intell 38(1):74–87
77. Myers CM et al (2020) Revealing neural network bias to non-experts through interactive
counterfactual examples. arXiv:2001.02271
78. Nagabandi A et al (2020) Deep dynamics models for learning dexterous manipulation. In:
Conference on robot learning. PMLR, pp 1101–1112
79. Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory
and application to reward shaping. In: ICML, vol. 99, pp 278–287
80. OpenAI et al (2019) Dota 2 with large scale deep reinforcement learning. arXiv: 1912.06680
81. Parikh D, Zitnick C (2011) Human-debugging of machines. NIPS WCSSWC 2(7):3
82. Peng B et al (2016) A need for speed: adapting agent action speed to improve task learning
from non-expert humans. In: Proceedings of the 2016 international conference on autonomous
agents & multiagent systems. International Foundation for Autonomous Agents and Multia-
gent Systems, pp 957–965
83. Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming.
Wiley
84. Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you? Explaining the predictions
of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining, pp 1135–1144
85. Risi S, Togelius J (2020) Increasing generality in machine learning through procedural content
generation. Nat Mach Intell 2(8):428–436
86. Rosenfeld A et al (2018) Leveraging human knowledge in tabular reinforcement learning: a
study of human subjects. In: The knowledge engineering review 33
87. Russell SJ, Norvig P (2016) Artificial intelligence: a modern approach. Pearson Education
Limited, Malaysia
88. Sacha D et al (2017) What you see is what you can change: human-centered machine learning
by interactive visualization. Neurocomputing 268:164–175
89. Saran A et al (2018) Human gaze following for human-robot interaction. In: 2018 IEEE/RSJ
international conference on intelligent robots and systems (IROS). IEEE, pp 8615–8621
90. Shah P, Hakkani-Tur D, Heck L (2016) Interactive reinforcement learning for task-oriented
dialogue management
91. Shah P et al (2018) Bootstrapping a neural conversational agent with dialogue self-play,
crowdsourcing and on-line reinforcement learning. In: Proceedings of the 2018 conference of
the North American chapter of the association for computational linguistics: human language
technologies, volume 3 (Industry Papers), pp 41–51
92. Silver D et al (2017) Mastering the game of go without human knowledge. Nature
550(7676):354–359
93. Sørensen PD, Olsen JM, Risi S (2016) Breeding a diversity of super mario behaviors through
interactive evolution. In: 2016 IEEE conference on computational intelligence and games
(CIG). IEEE, pp 1–7
94. Suay HB, Chernova S (2011) Effect of human guidance and state space size on interactive
reinforcement learning. In: 2011 Ro-Man. IEEE, pp 1–6
95. Sutton R, Littman M, Paris A (2019) The reward hypothesis. Accessed: 2019–08-21. http://
incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html
96. Sutton RS (1996) Generalization in reinforcement learning: successful examples using sparse
coarse coding. In: Advances in neural information processing systems, pp 1038–1044
97. Sutton RS (1985) Temporal credit assignment in reinforcement learning
Interactive Reinforcement Learning for Autonomous Behavior Design 375
Forrest Huang, Eldon Schoop, David Ha, Jeffrey Nichols, and John Canny
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 379
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_12
380 F. Huang et al.
1 Introduction
Sketching is a natural and effective means to express novel artistic and functional
concepts. It is an integral part of the creative process for many artists, engineers, and
educators. The abstract yet expressive nature of sketches enables sketchers to quickly
communicate conceptual and high-level ideas visually while leaving out unnecessary
details. These characteristics are most notably manifested in the use of sketches in
design processes, where sketches are used by designers to iteratively discuss and
critique high-level design concepts and ideas.
Recent advances in deep-learning (DL) models greatly improved machines’ abil-
ities to perform Computer Vision tasks. In particular, convolutional neural networks,
recurrent neural networks, and attention mechanisms dramatically outperform prior
state-of-the-art methods in comprehending and generating visual content. They can
perform these tasks even conditioned on user-specified natural language or other
accompanying semantic information. These architectures provide great opportuni-
ties for developing creativity support applications of the type we have argued for.
They support sketch inputs and outputs for applications such as sketch-based image
retrieval and sketch generation systems.
This chapter explores and surveys multiple facets of research in deep-learning-
based sketching systems that support creative processes. We describe three research
projects spanning design and artistic applications, targeting amateur and professional
users, and using sketches as inputs and outputs. We outline three important aspects in
this area of research: (1) collecting appropriate sketch-based data; (2) adapting exist-
ing architectures and tasks to the sketch domain; and, (3) formulating new tasks that
support novel interactions using state-of-the-art model architectures and deploying
and evaluating these novel systems.
The key projects that we will include in this chapter are as follows:
Sketching plays a pivotal role in many types of creative activities because of its
highly visual nature and its flexibility for creation and manipulation: users can create
and imagine any kind of visual content, and continuously revise it without being
constrained by unnecessary details. Sketches are an independent art form, but are
also extensively used to draft and guide other forms of artistic expression such as oil
painting, or storyboarding in films and motion graphics. Moreover, because sketches
effectively communicate visual ideas, they are well-suited for design processes such
as User Interface (UI) or User Experience (UX) Design.
For these reasons, a plethora of research systems and tools that use sketches have
been developed by the HCI community. We survey several notable systems in the
domains of artistic sketches and design sketches in this section.
Prior works that aim to support artistic sketches have mostly taken the form of
real-time assistance to directly improve the final sketching product, or to generate
sketching tutorials for improving users’ sketching proficiency. A number of sketching
assistants use automatically-generated and crowd-sourced drawing guidance. Shad-
owDraw [23] and EZ-sketching [36] use edge images traced from natural images to
suggest realistic sketch strokes to users. PortraitSketch provides sketching assistance
specifically on facial sketches by adjusting geometry and stroke parameters [39].
Real-time, crowd-sourced feedback have also been used to correct and improve
users’ sketched strokes [25].
In addition to assisted sketching tools, researchers have developed tutorial systems
to improve users’ sketching proficiency. How2Sketch automatically generates multi-
step tutorials for sketching 3D objects [14]. Sketch-sketch revolution provides first-
hand experiences created by sketch experts for novice sketchers [10].
382 F. Huang et al.
Designers use sketches to expand novel ideas, visualize abstract concepts, and rapidly
compare alternatives [4]. They are commonplace in the modern design workspace
and typically require minimal effort for designers to produce. They are also some-
times preferred over high-fidelity artifacts because the details left out by abstract
sketches imply and indicate incompleteness of designs. They encourage designers to
more freely imagine and provide alternatives based on current designs without being
concerned about committed to an existing designs. From some of our informal con-
versations with designers, it is observed that designers will trace high-fidelity design
renderings using rough sketch strokes for soliciting higher-level, creative feedback.
Research in the HCI community has produced interfaces that use drawing input for
creating interactive design prototypes. SILK is the first system that allows designers to
author interactive, low-fidelity UI prototypes by sketching [21]. DENIM allows web
designers to prototype with sketches at multiple detail levels [26]. More recently,
Apparition uniquely allows users to sketch their desired interfaces while having
crowdworkers translate sketches into workable prototypes in near real time [22].
In this section, we explore the first step toward developing a novel deep-learning-
driven creativity support application: collecting a large-scale sketch dataset. We
specifically target the task of drawing correspondence between two types of visual
artifacts commonly used in the early stage of UI design: low-fidelity design sketches
and high-fidelity UI examples.
Both Sketches and UI design examples are commonly used in the UI design pro-
cess as reported by a variety of prior studies [15, 28] and our informal conversation
with designers. Designers search, consult and curate design examples to gain inspi-
ration, explore viable alternatives and form the basis for comparative evaluations [3,
15]. Similarly, designers frequently use sketches to expand novel ideas, visualize
abstract concepts, and rapidly compare alternatives [4]. As such, understanding cor-
respondences between these two modalities would allow machines to rapidly retrieve
popular visual illustrations, common flow patterns, and high-fidelity layout imple-
mentations [6] from large corpuses, which can greatly augment various design tasks
[20, 29, 37].
To solve this task using deep-learning-based approaches, we decided to collect a
dataset of actual sketches stylistically and semantically similar to designers’ sketches
of UIs. This also allows us to leverage large-scale UI datasets recently introduced
by mobile-interaction mining applications [6], such that we would only need to
collect sketches newly created by designers based on screenshots of original UIs
in the Rico dataset [7]. The dataset is now publicly available at https://github.com/
huang4fstudio/swire.
design and degrees in design-related fields. They were compensated 20 USD per
hour and worked for 60–73 h.
We collected 3702 sketches1 of 2201 UI examples from 167 popular apps in the
Rico dataset. Each sketch was created with pen and paper in 4.1 min on average.
Many UI examples were sketched by multiple designers. 66.5% of the examples were
sketched by 2 designers, 32.7% of the examples were sketched by 1 designer and
the remaining examples (<1%) were sketched by 3 designers in our dataset. Our 4
designers sketched 455/1017/1222/1008 UIs, respectively, based on their availability.
We allocated batches of examples to different combinations of designers to ensure
the generality of the dataset.
We did not have the resources to generate sketches for every UI in the Rico
dataset, so we curated a diverse subset of well-designed UI examples that cover 23
app categories in the Google Play Store and were of average to high design quality.
We omitted poorly designed UIs from the dataset because of the relatively small size
of the dataset for neural network training. Noise introduced into training by poor
designs would have the potential to negatively impact the training time and quality
of our model.
We supplied the screenshots of our curated UI examples to the recruited designers and
asked them to create sketches corresponding to the screenshots with pen and paper.
They were prompted to reconstruct a low-fidelity sketch from each screenshot as if
they were the designers of the interfaces. We instructed them to replace all actual
image content with a sketched placeholder (a square with a cross or a mountain)
and replace dynamic text with template text in each screenshot as shown in Fig. 1.
We added these instructions to obtain sketches with a more unified representation
focused on the design layout of various UIs. These instructions also make it easier
for the neural network to learn the concepts of images and text within the constraints
of our small dataset.
In order to efficiently collect and calibrate sketches created by multiple designers
in various formats of photos and scans, we supplied them with paper templates with
frames for them to sketch on as shown in Fig. 1. These frames are annotated with
four ArUco codes [27] at the corners to allow perspective correction. All photos and
1 This total of 3702 sketches differs from original Swire publication [17]. We discovered that 100
trial sketches from a pilot study were accidentally included in the original stated total and we have
corrected the numbers in this chapter.
386 F. Huang et al.
Template Text
Template Image
Fig. 1 Data collection procedure. We first send a UI screenshot (left) and paper templates with
ArUco markers to designers. Designers then sketch on the template and send back a photo or a scan
of the completed sketch (middle). We then post-process the photo using computer vision techniques
to obtain the final clean sketch dataset (right)
scans of the sketches are corrected with affine transformation and thresholded to
obtain binary sketches as final examples in the dataset.
This training scheme is shown to be useful for sketch-based image retrieval [33]. In
the querying phase, we use Swire’s trained neural network to encode a user’s sketch
query and retrieve UIs with the closest output to the sketch query’s output.
The model is trained with a Triplet Loss function [34, 40] that involves the neural-
network outputs of three inputs: an ‘anchor’ sketch s, a ‘positive’ matching screenshot
i and a ‘negative’ mismatched screenshot i . This forms two pairs of input during
3x3 conv x 64 3x3 conv x 128 3x3 conv x 256 3x3 conv x 512 3x3 conv x 512 fc 4096 x2
2x2 pooling 2x2 pooling 2x2 pooling 2x2 pooling 2x2 pooling
fc 64 x1
x1 x1 x2 x2 x2
Screenshot VGG-A Net (Same Network as above, Different Weights) Embedding Space
Fig. 2 Network architecture of Swire’s neural network. Swire’s neural network consists of two
identical sub-networks similar to the VGG-A deep convolutional neural network. These networks
have different weights and attempt to encode matching pairs of screenshots and sketches with similar
values
388 F. Huang et al.
training. The positive pair p(s, i)+ consists of a sketch-screenshot pair that corre-
spond to each other. The negative pair p(s, i )− consists of a sketch-screenshot pair
that does not correspond. The negative pair is obtained with the same sketch from
the positive pair and a random screenshot sampled from the mini-batch.
During training, each pair p(s, i) is passed through two sub-networks such that the
sketch sample s is passed through the sketch sub-network and outputs an embedding
f s (s), and we similarly obtain the neural-network output of the screenshot f i (i)
from the screenshot sub-network. We compute the Euclidean distance D between
the neural network outputs. For the positive pair,
We maintain a margin m between the positive and negative pairs to prevent the
network from learning trivial solutions (zero embeddings for all examples).
We train our network using the dataset described in Sect. 4. Since the sketches are
created by four separate designers, we split the data and used data collected from
three designers for training and from one designer for testing. This is to ensure that
the model generalizes across sketches produced by different designers. In addition,
we do not repeat interfaces from the same apps between the training and test sets.
This creates 1722 matching sketch-screenshot pairs for training and 276 pairs for
testing.
During training, the sketches and screenshots are resized to 224 × 224 pixels,
and the pixel values are normalized between (−1, 1) centered at 0. The network is
trained using a Stochastic Gradient Descent Optimizer with a mini-batch size of 32.
The learning rate is 1 × 10−2 . The margin is 0.2 in all models. All hyper-parameters
listed above were determined by empirical experiments on the training set.
Sketch-Based Creativity Support Tools Using Deep Learning 389
5.4 Querying
When a user makes a query with a drawn sketch, the model computes an output by
passing the sketch through the sketch sub-network. This output is then compared
with all neural-network outputs of the screenshots of UI examples in the dataset
using a nearest neighbor search. The UI results are ranked by the distance between
their outputs and the user’s sketch’s output.
5.5 Results
5.5.1 Baselines
We use a test set that consists of 276 UI examples to compare Top-1 and Top-
10 performances of BoW-HOG filters and Swire. The results are summarized in
Table 1. We observe that Swire significantly outperform BoW-HOG filters for Top-
10 performance at 60.9%. For Top-1 accuracy, Swire achieves an accuracy of 15.9%
which only slightly outperformed the strong baseline of BoW-HOG filters at 15.6%.
This shows Swire to be particularly effective for retrieving complex examples from
the dataset compared to the BoW-HOG filters. We believe deep-learning-based Swire
is advantageous compared to BoW-HOG filters that rely on matching edge-maps
because UI sketches have semantic complexities that are not captured by edge-maps
of screenshots.
390 F. Huang et al.
Table 1 Top-k accuracy of various models on the test set. Swire significantly outperforms BoW-
HOG filters
Technique Top-1 (%) Top-10 (%)
(Chance) 0.362 3.62
BoW-HOG filters 15.6 38.8
Swire 15.9 60.9
We visualize query results from the test set to qualitatively understand the perfor-
mance of Swire in Fig. 3. Swire is able to retrieve relevant menu-based interfaces
(Example a) despite the difference in visual appearance of the menu items. Swire
is also able to retrieve pop-up windows (Example b) implemented in various ways
despite the drastic difference in the dimensions of the pop-up windows. We observe
similar efficacy in retrieving settings (Example c), list-based layouts (Example f), and
login layouts (Example e). Nevertheless, we observe that Swire sometimes ignores
smaller details of the interfaces described by sketched elements. This limitation will
be further discussed in Sect. 7.1.
(a) (d)
(b) (e)
(c) (f)
Fig. 3 Query results for complete sketches. Swire is able to retrieve common types of UIs such as
sliding menus (a), settings (c), and login (e) layouts
On the other hand, some designers considered the ‘poor’ results unsatisfactory.
For example, designers were less satisfied with the model’s performance on a sign-
up sketch, commenting that the model only gathered screens with similar element
layouts while ignoring the true nature of the input fields and buttons in the query
(D3). However, D4 considered ‘rows of design elements’ common in the results
relevant to the sketch, and D1 considered two similar sign-up screens retrieved by
the model as strong results even they did not match up perfectly with the sketch.
In general, we observed that designers were more satisfied with the results when
the model was able to retrieve results that are semantically similar at a high-level
instead of those with matching low-level element layouts. Notably, D1 commented
that we ‘probably already considered the common UI sketch patterns and train[ed]
[our] system to match it up with image results,’ which reflects the effectiveness of
Swire in detecting common UI patterns in some instances provided that it was not
specifically trained to recognize these patterns. All designers also considered Swire to
be potentially useful in their workflows for researching, ideating, and implementing
novel designs.
5.6 Applications
In Sect. 5.5, we evaluated and validated Swire’s effectiveness for generally finding
design examples through sketch-based queries. Since both sketches and UI design
examples are commonly used in the early stages of the UI design process as reported
392 F. Huang et al.
Fig. 4 Alternative design query results. Swire is able to retrieve similar UIs in the dataset from
queries of complete, high-fidelity UI screenshots
by a variety of prior studies [15, 28], we explore the potential usage of Swire through
several design applications in this section. Prototypes of these applications imple-
mented with Jupyter Notebook are available at https://github.com/huang4fstudio/
swire.
Sketches are often used for rapid exploration of potential design solutions [4]. Design-
ers use partial sketches to express core ideas, while leaving out parts of the interface in
Sketch-Based Creativity Support Tools Using Deep Learning 393
Results
Query Results (Ranked 1, 2, 3) Query (Ranked 1)
(Match any
Results)
Fig. 5 Query results for a incomplete sketches and b flow queries. Swire is able to retrieve interfaces
only based on parts specified by users’ sketches while remaining agnostic to other parts of the UIs.
Swire is also able to retrieve user flows by querying multiple UIs in sequences concurrently
Beyond querying for single UIs, designers also use sketches to illustrate user expe-
rience at multiple scales [28], such as conveying transitions and animations between
multiple interfaces. Since the Rico dataset also includes user interaction data, we use
this data to enable flow querying with Swire. Designers can use this application to
interact with interaction design examples that can accelerate the design of effective
user flows.
To query flow examples in the dataset, since Swire creates a single embedding
for each UI, we can match an arbitrary number of interfaces in arbitrary order by
concatenating the embedding values during the ranking process of querying. We
qualitatively observe that Swire is able to retrieve registration (Fig. 5b) and ‘closing
menu’ flows that are commonly implemented by designers. Since Rico also contains
transition details between consequent UIs, these examples can demonstrate popular
animation patterns [6] that provide inspiration to interaction and animation designers.
394 F. Huang et al.
The creation of complex sketches often begins with semantic planning of scene
objects. Sketchers often construct high-level scene layouts before filling in low-level
details. Modeling ML systems after this high-to-low-level workflow has been shown
to be beneficial for transfer learning from other visual domains and for supporting
interactive interfaces for human users [16]. Inspired by this high-to-low-level process,
Scones adopts a hierarchical workflow that first proposes a scene-level composition
layout of objects using its Composition Proposer, then generates individual object
sketches, conditioned on the scene-level information, using its Object Generators
(Fig. 6).
Sketch-Based Creativity Support Tools Using Deep Learning 395
xn
... ...
1) Previous Scenes
put a campfire under the hot air Composition Object
balloon
Proposer Generators
2) Text Instruction
Fig. 6 Overall architecture of Scones. Scones take a two-stage approach toward generating and
modifying sketched scenes based on users’ instructions
The Composition Proposer in Scones uses text instructions to place and configure
objects in the scene. It also considers recent past iterations of text instructions and
scene context at each conversation turn. As text instructions and sketch components
occur sequentially in time, each with a variable length of tokens and objects, respec-
tively, we formulate composition proposal as a sequence modeling task. We use
a self-attention-only decoder component of the Transformer [38], a recent deep-
learning model architecture with high performance for this task.
To produce the output scene Si at turn i, the Composition Proposer takes inputs
of n = 10 previous scenes S(i−n),...,(i−1) and text instructions C(i−n),...,(i−1) as recent
context of the conversation. Each output scene Si contains li objects o(i,1),...,(i,li ) ∈ Si
and special tokens os marking the beginning and oe marking the end of the scene.
Each text instruction Ci contains m i text tokens t(i,1),...,(i,m i ) ∈ Ci that consist of words
and punctuation marks.
We represent each object o as a 102-dimensional vector o:
The first two dimensions 1s , 1e are Boolean attributes reserved for the start and end of
the scene object sequences. e(o) is a 58-dimensional one-hot vector2 representing one
of 58 classes of the scene object. e(u) is a 35-dimensional one-hot vector representing
one of 35 sub-types (minor variants) of the scene objects. e(s) is a three-dimensional
one-hot vector representing one of three sizes of the scene objects. e( f ) is a two-
dimensional one-hot vector representing the horizontal orientation of whether the
object is flipped in the x-direction. The last two dimensions x, y ∈ [0, 1] represents
the x- and y-position of the center of the object. This representation is very similar
to that of the CoDraw dataset the model was trained on, which is described in detail
in Sect. 6.2.1. For each text token t, we use a 300-dimensional GLoVe vector trained
on 42B tokens from the Common Crawl dataset [30] to semantically represent these
words in the instructions.
2 An encoding of class information that is an array of bits where only the corresponding position
for the class to be encoded is 1, and all other bits are 0s.
396 F. Huang et al.
o’ (i+1, 1) o’ (i+1, 2) o’ e
(64, 6) Transformer Decoder
Fig. 7 The scene layout generation process using the Transformer model of the Composition
Proposer
To train the Transformer network with the heterogeneous inputs of o and t across
the two modalities, we create a unified representation of cardinality |o| + |t| = 402
and adapt o and t to this representation by simply padding additional dimensions in
the representations with zeros as shown in Eq. 1.
Since the outputs of the Composition Proposer are scene layouts consisting of high-
level object specifications, we generate the final raw sketch strokes for each of these
objects based on their specifications with Object Generators. We adapt Sketch-RNN
Sketch-Based Creativity Support Tools Using Deep Learning 397
to generate sketches of individual object classes to present to users for evaluation and
revision in the next conversation turn. Each sketched object Q consists of h strokes
q1...h . The strokes are encoded using the Stroke-5 format [13]. Each stroke q =
[x, y, pd , pu , pe ] represents states of a pen performing the sketching process.
The first two properties x and y are offsets from the previous point that the pen
moved from. The last three elements [ pd , pu , pe ] are a one-hot vector representing the
state of the pen after the current point (pen down, pen up, end of sketch, respectively).
All sketches begin with the initial stroke q1 = [0, 0, 1, 0, 0].
Since Sketch-RNN does not constrain aspect ratios, directions and poses of its
output sketches, we introduce two additional conditions for the sketch generation
process: masks m and aspect ratios r . These conditions ensure our Object Generators
generate sketches with appearances that follow the object specifications generated
by the Composition Proposer. For each object sketch, we compute the aspect ratio
y
r= by taking the distance between the leftmost and rightmost stroke as x
x
and the distance between topmost and bottommost stroke as y. To compute the
object mask m, we first render the strokes into a pixel bitmap, then mark all pixels
as 1 if they are in between the leftmost pixel pyxmin and rightmost pixel pyxmax
that are passed through by any strokes for each row y, or if they are in between the
bottommost pixel px ymin and topmost pixel px ymax that are passed through by any
strokes for each column x (Eq. 3). As this mask-building algorithm only involves
pixel computations, we can use the same method to build masks for clip art objects
(used to train the Composition Proposer) to generate sketches with poses matching
the Composition Proposer’s object representations.
⎧
⎨ 1 if pyxmax ≥ x ≥ pyxmin , or;
m (x,y) = 1 if px ymax ≥ y ≥ px ymin (3)
⎩
0 otherwise
q1...h = Sketch-RNN Decoder([m, r, z]),z ∼ N (0, 1)128
z = Sketch-RNN Encoder(q1...h ). (4)
q1...h Δy = r
Δy
Δx
m q’1 q’2 ... q’h-1
CNN
Δx
(GMM) which will be sampled to obtain x and y. It also outputs probabilities for a
categorical distribution that will be sampled to obtain pd , pu and pe . This generation
process and the architecture of the model are illustrated in Fig. 8, and are described
in the Sketch-RNN paper [13].
We used the CoDraw dataset [19] to train the Composition Proposer to generate
high-level scene layout proposals from text instructions. The task used to collect
this data involves two human users taking on the roles of Drawer and Teller in each
session. First, the Teller is presented with an abstract scene containing multiple clip
art objects in certain configurations, and the Drawer is given a blank canvas. The
Teller provides instructions using only text in a chat interface to instruct the Drawer
on how to modify clip art objects in the scene. The Teller has no access to the Drawer’s
canvas in most conversation turns, except in one of the turns when they can decide to
Sketch-Based Creativity Support Tools Using Deep Learning 399
‘peek’ at the Drawer’s canvas. The dataset consists of 9993 sessions of conversation
records, scene modifications, and ground-truth scenes.
Using this dataset, we trained the Composition Proposer to respond to users’
instructions given past instructions and scenes. We used the same training/validation/
test split as the original dataset. Our model is trained to optimize the loss function
L cm that corresponds to various attributes of the scene objects in the training set:
L c is the cross-entropy loss between the one-hot vector of the true class label
and the predicted output probabilities by the model. Similarly L flip and L size are
cross-entropy losses for the horizontal orientation and size of the object. L x y is the
Euclidean Distance between predicted position and true position of the scene object.
We trained the model using an Adam Optimizer with the learning rate of lr = 1 ×
10−4 for 200 epochs. We set λsub = 5.0 × 10−2 , λflip = 5.0 × 10−2 , λsize = 5.0 ×
10−2 , λx y = 1.0. These hyper-parameters were tuned based on empirical experiments
on the validation split of the dataset.
The Quick, Draw! dataset consists of sketch strokes of 345 concept categories created
by human users in a game in 20 s [18]. We trained our 34 Object Generators on 34
categories of Quick, Draw! data to create sketches of individual objects.
Each sketch stroke in Quick, Draw! was first converted to the Stroke-5 format. xs
and ys of the sketch strokes were normalized with their standard deviations for all
sketches in their respective categories. Each category consists of 75000/2500/2500
sketches in the training/validation/test set.
The loss function of the conditional Sketch-RNN L s consists of the reconstruction
loss L R and KL loss L K L :
L s = λK L L K L + L R (6)
3For some object categories, we found that increasing the KL weight to 1.0 improves the authors’
perceived quality of generated sketches.
400 F. Huang et al.
6.3 Results
To evaluate the output of the Composition Proposer against the models introduced
with the CoDraw dataset, we adapted its output to match that expected by the well-
defined evaluation metrics proposed by the original CoDraw paper [19]. The original
task described in the CoDraw paper involves only proposing and modifying high-
level object representations in scenes agnostic to their appearance. The performance
of a ‘Drawer’ (a human or machine which generates scene compositions) can be
quantified by a similarity metric constrained between 0 and 5 (higher is more similar)
by comparing properties of and relations between objects in the generated scene and
objects in the ground truth from the dataset.
Running our Composition Proposer on the CoDraw test set, we achieved an aver-
age similarity metric of 3.55. This exceeded existing state-of-the-art performance
(Table 2) on the iterative scene authoring task using replayed text instructions (script)
from CoDraw.
To provide an illustrative example of our Composition Proposer’s output on this
task, we visualize two example scenes generated from the CoDraw validation set in
Fig. 9. In scene (a), the Composition Proposer extracted the class (slide), direction
(faces right), and position relative to parts of the object (ladder along left edge) from
the text instruction, to place a slide in the scene. Similarly, it was able to place the bear
in between the oak and pine trees in scene (b), with the bear touching the left edge of
the pine tree. It is important to note the Composition Proposer completely regenerates
the entire scene at each conversation turn. This means it correctly preserved object
attributes from previous scenes while making the requested modifications from the
current turn. In these instances, the sun in scene (a) and the trees in scene (b) were
left mostly unchanged while other attributes of the scenes were modified.
Fig. 9 Example scenes for the scene layout modification task. The Composition Proposer was able
to improve state-of-the-art performance for modifying object representations in scene compositions
6.3.2 Sketches with Clip Art Objects as Mask and Ratio Guidance
The Object Generators are designed to generate sketches which respect high-level
scene layout information under the guidance of the mask and aspect ratio conditions.
To inform generated object sketches with pose suggestions from scene composition
layouts, we built outline masks from clip art objects and computed aspect ratios using
the same method as building them for training sketches described in Sect. 6.1.2. We
demonstrate the Object Generators’ performance in two important scenarios that
allow Scones to adapt to specific subclass and pose contexts.
Generating Objects for Closely Related Classes
While the Composition Proposer classifies objects as one distinct class out of 58,
some of these classes are closely related and are not differentiated by the Object
Generators. In these cases, object masks can be used by an Object Generator to
effectively disambiguate the desired output subclass. For instance, the Composition
Proposer generates trees as one of three classes: Oak tree (tall and with curly edges),
Apple tree (round and short), and Pine tree (tall and pointy); while there is only a
single Object Generator trained on a general class of all types of tree objects. We
generated three different masks and aspect ratios based on three clip art images and
used them as inputs to a single tree-based Object Generator to generate appropriate
tree objects (by sampling z ∼ N (0, 1)128 ). The Object Generator was able to sketch
trees with configurations corresponding to input masks from clip art objects (Fig. 10).
The generated sketches for pine trees were pointy; for apple trees, had round leaves;
and for oak trees, had curvy edges.
402 F. Huang et al.
Fig. 10 Sketch generation results of trees conditioned on masks. The Object Generator was able
to sketch trees of three different classes based on mask and aspect ratio inputs
Fig. 11 Sketch generation results of racquets conditioned on masks. The Object Generator was
able to sketch racquets at two orientations consistent with the masks
We show the usage of Scones in six turns of conversation from multiple sessions
in Fig. 12. We curated these sessions by interacting with the system ourselves to
demonstrate various capabilities of Scones. In session (a), Scones was able to draw
and move the duck to the left, sketch a cloud in the middle, and place and enlarge the
tree on the right, following instructions issued by the user. In session (b), Scones was
similarly able to place and move a cat, a tree, a basketball and an airplane, but at
different positions from session (a). For instance, the tree was placed on the left as
opposed to the right, and the basketball was moved to the bottom. We also show the
ability of Scones to flip objects horizontally in session (b), such that the plane was
flipped horizontally and regenerated given the instructions of ‘flip the plane to point
to the right instead’. This flipping action demonstrates the Object Generator’s ability
to generate objects with the require poses by only sharing the latent vectors z, such
that the flipped airplane exhibits similar characteristics as the original airplane. In
both sessions, Scones was able to correlate multiple scene objects, such as placing
the owl on the tree in session (a), and basketball under the tree in session (b).
We can further verify the relationships between text and object representations
learned by the model by visualizing attention weights computed by the Transformer
model of the Composition Proposer. These weights also create the unique possibility
of generalizing and prompting for sketches of new objects specified by users.
The Transformer model in the Composition Proposer uses masked self-attention
to attend to scene objects and instructions from previous time steps most relevant to
generating the object specification at the current time step. We explore the attention
weights of the first two turns of a conversation from the CoDraw validation set. In
the first turn, the user instructed the system, ‘top left is an airplane medium size
pointing left’. When the model generated the first object, it attended to the ‘airplane’
and ‘medium’ text tokens to select class and output size. In the second turn, the
user instructed the model to place a slide facing right under the airplane. The model
similarly attended to the ‘slide’ token the most. It also significantly attended to the
‘under’ and ‘plane’ text tokens, and the airplane object. These objects and tokens are
important for situating the slide object at the desired location relative to the existing
airplane object (Figs. 13 and 14).
These attention weights could potentially be used to handle unknown scene objects
encountered in instructions. When the model does not output any scene objects, but
only a oe (scene end) token, we can inspect the attention weights for generating this
token to identify a potentially unknown object class, and ask the user for clarification.
For example, when a user requests an unsupported class, such as a ‘sandwich’ or
‘parrot’ (Fig. 15), Scones could identify this unknown object by taking the text token
with the highest attention weight, and prompting the user to sketch it by name.
404 F. Huang et al.
make the tree larger put a small airplane near the tree
a) b)
Transformer
5
Layer #
4
3
2
1
is
an
e
e
m
g
>
>
r t>
r t>
r t>
siz
ti n
lan
to
nd
nd
iu
ta
ta
ta
ed
in
<e
<e
p
<s
<s
<s
air
po
m
0.012 0.411
Fig. 13 Attention map of the Transformer across object and text tokens for the generation of an
airplane, the first object in the scene
6
Transformer Layer #
5
4
3 ...
2
1
e
to
a
st
p
er
r
is
ht
ue
t
e
de
es
g
>
rt>
rt>
de
ou
th
cin
an
to
nd
ju
et
in
rig
go
bl
sli
un
ta
ab
ta
pl
tim
<e
fa
<s
<s
n
ce
0.006 0.227
Fig. 14 Attention map of the Transformer across object and text tokens for the generation of
slide in the second turn of conversation. We observed that the Transformer model attended to the
corresponding words and objects related to the newly generated ‘slide’ object
Fig. 15 Attention map of the Transformer for text instructions that specify unseen objects
To determine how effectively Scones can assist users in creating sketches from nat-
ural language, we conducted an exploratory evaluation of Scones. We recruited 50
participants from English-speaking countries on Amazon Mechanical Turk (AMT)
for our study. We collected quantitative and qualitative results from user trials with
Scones, as well as suggestions for improving Scones. Participants were given a maxi-
mum of 20 min to complete the study and were compensated $3.00 USD. Participants
were only allowed to complete the task once.
406 F. Huang et al.
6.4.1 Method
The participants were asked to recreate one of five randomly chosen target scene
sketches by providing text instructions to Scones in the chat window. Each target
scene had between four and five target objects from a set of 17 possible scene objects.
Participants were informed that the final result did not have to be pixel perfect to
the target scene, and to mark the sketch as complete once they were happy with the
result. Instructions supplied in the chat window were limited to 500 characters, and
submitting an instruction was considered as taking a ‘turn’. The participants were
only given the sketch strokes of the target scene without class labels, to elicit natural
instructions.
Participants were first shown a short tutorial describing the canvas, chat inter-
face, and target scene in the Scones interface (Fig. 16), and were asked to give simple
instructions in the chat window to recreate the target scene. Only two sample instruc-
tions were given in the tutorial: ‘add a tree’, and ‘add a cat next to the table’. At each
turn, participants were given the option to redraw objects which remained in the scene
for over three turns using a paintbrush-based interface. After completing the sketch,
participants filled out an exit survey with likert-scale questions on their satisfaction
at the sketch and enjoyment of the system, and open-ended feedback on the system.
16 16
12 11 12
8 8 7
8 8
4 4
4 4
1
0 0
1 2 3 4 5 1 2 3 4 5
6.4.2 Results
Fig. 18 Recreated scenes during the user study. Users combined Scones-generated outputs with
their own sketch strokes to reproduce the target scenes presented to them
Participants offered suggestions for how they would improve Scones, providing
avenues for future work.
Object Translations and Spatial Relationships
A major theme of dissatisfaction came from the limited ability of our system to
respond to spatial relationships and translation-related instructions at times: ‘It does
not appear to understand spatial relationships that well’ (P35); ‘you are not able
to use directional commands very easily’ (P11). These situations largely originate
from the CoDraw dataset [19], in which users had a restricted view of the canvas,
resulting in limited relative spatial instructions. This limitation is discussed further
in Sect. 7.1.
To improve the usability of Scones, participants suggest its interface could benefit
from the addition of direct manipulation features, such as selecting and manually
transforming objects in the scene: ‘I think that I would maybe change how different
items are selected in order to change of modify an object in the picture’ (P33);
‘maybe there should be a move function, where we keep the drawing the same but
move it’ (P40). Moreover, some participants also recommended adding an undo
feature, ‘Maybe a separate button to get back’ (P31), or the ability to manually
invoke Scones to redraw an object, ‘I’d like a way to ask the computer to redraw
a specific object’ (P3). These features could help participants express corrective
feedback to Scones, potentially creating sketches that better match their intent.
More Communicative Output
Some participants expected Scones to provide natural language output and feedback
to their instructions. Some participants asked questions directly to elicit Scones’s
capabilities: ‘In the foreground is a table, with a salad bowl and a jug of what
may be lemonade. In the upper left is a roughly-sketched sun. Drifting down from the
top-center is a box, tethered to a parachute, Did you need me to feed you smaller sen-
tences? …’ (P38). P23 explicitly suggested users should be able to ask Scones ques-
tions to refine their intentions: ‘I would like the system to ask more questions if it
410 F. Huang et al.
does not understand or if I asked for several revisions. I feel that could help narrow
down what I am asking to be drawn’. Other participants used praise between their
sketching instructions, which could be used as a cue to preserve the sketch output
and guide further iteration: ‘…Draw an airplane, Good try, Draw a table …’ (P1);
‘Draw a sun in the upper left corner, The sun looks good! Can you draw a hot air
balloon in the middle of the page, near the top? …’ (P15). Providing additional nat-
ural language output and prompts from Scones could enable users to refine Scones’s
understanding of their intent and understand the system’s capabilities. A truly con-
versational interface with a sketching support tool could pave the way for advanced
mixed-initiative collaborative design tools.
Through the development of a sketch dataset and two sketch-based creativity support
systems, we identified some limitations of our work to date and opportunities for
future work in this area.
One crucial area for improvement for sketch-based application is the scale and qual-
ity of the datasets. Although the dataset we described in Sect. 4 can be used to train
a sketch-based UI retrieval model described in Sect. 5, we observed that the perfor-
mance of the system has been limited by the diversity and complexity of sketches
and UIs in the dataset. This is demonstrated by two major modes of failure in Swire
when it struggles to handle rare, custom UI elements as exhibited by Example a in
Fig. 19, and fails to understand UIs with diverse colors, such as those with image
backgrounds in Example b in Fig. 19. We believe with increased amount of training
data, Swire can better generalize to more complex and colorful UIs.
Similarly, we observe that Scones (Sect. 6) has been constrained by the differences
between the task protocol used to collect the CoDraw dataset (the dataset Scones was
trained on) and the user interactions in Scones. Each conversation in CoDraw only
offers the Teller one chance to ‘peek’ at the Drawer’s canvas, which significantly
decreases the number of modifications made to existing scene objects. As a result,
Scones performs well at adding objects of correct classes at appropriate locations
and sizes, but is not as advanced at modifying or removing objects. Moreover, the
current dataset isn’t end-to-end such that there is only a small, fixed number of styles
of sketched objects, which reduces Scones’ ability in handling stylistic instructions.
The ideal data to train this system on shall directly record iterative master-apprentice
interactions in creating, critiquing, modifying, and removing highly variable sketched
scene objects at a large scale. Nevertheless, these datasets are considered to be difficult
to collect due to the high sketching skill requirement for crowdworkers [42], such
Sketch-Based Creativity Support Tools Using Deep Learning 411
Fig. 19 Failure modes of UI retrieval using Swire. Swire failed to understand a) custom and b)
colorful UI elements
that even single-turn multi-object sketched scenes are difficult for crowdworkers to
create.
We believe a significant research direction is to enable the collection of legitimate
sketching data at a large scale, especially in domains that require prior expertise such
as UI and mechanical design. This can be achieved by either lowering the skill barrier,
or by using tools like Scones and Swire to support parts of joint tasks. For example
in the case of Scones, we can lower the skill barrier by decomposing each scene into
object components allowing crowdworkers to only sketch a single object at a time in
context. Alternatively, future research could explore other means of data collection
beyond the crowd. Tools that are used for joint tasks provide natural incentives
for designers during realistic design use-cases, and can allow the live collection
of realistic design data from professional users. Researchers can also investigate
sourcing sketches from students of sketching courses at various institutions offering
design education who would possess higher sketching expertise.
412 F. Huang et al.
Another significant area of research is to create systems that tightly integrate with
design and artistic applications in realistic use-cases. While many current research
projects (including those described in this book chapter) demonstrate deep-neural-
networks’ capability and potential in supporting design applications, these appli-
cations are currently rough prototypes that are not yet suitable for everyday use
by designers and artists. Further developing these applications and exploring how
they integrate into design and artistic processes will reveal important usability issues
and inform future design and implementation choices of similar tools. For instance,
to successfully support UI design processes with Swire, we need to carefully con-
sider the visual representation of UI examples in the application and the underlying
datasets to be queried.
Moreover, some of the capabilities of these tools can be best demonstrated when
applied to professional domains. For instance, Scones could participate in the UI/UX
design process by iteratively suggesting multiple possible modifications of UI design
sketches according to design critique. To enable this interaction, we could consider
complete UI sketches as ‘scenes’ and UI components as ‘scene objects’. Scones could
be trained on this data along with text critiques of UI designs to iteratively generate
and modify UI mockups from text. To allow Scones to generate multiple design
candidates, we can modify the current architecture to model probabilistic outputs
for both sketch strokes (which is currently probabilistic) and scene object locations.
While datasets of UI layouts and components, such as those presented in Sect. 4,
suggest this as a near possibility, this approach may generalize to other domains as
well, such as industrial design. Nevertheless, this requires significant data support
by solving the issues mentioned in Sect. 7.1.
8 Conclusion
on creative and innovative tasks in creative processes. We also hope these projects
can provide entirely new means for creative expression and rapid ideation. We are
excited to continue designing for this future of design, art, and engineering.
References
1. von Ahn L, Dabbish L (2008) Designing games with a purpose. Commun ACM 51(8):58–67.
https://doi.org/10.1145/1378704.1378719
2. Aksan E, Deselaers T, Tagliasacchi A, Hilliges O (2020) CoSE: compositional stroke embed-
dings. Adv Neural Inf Process Syst 33
3. Bonnardel N (1999) Creativity in design activities: the role of analogies in a constrained
cognitive environment. In: Proceedings of the 3rd conference on creativity & cognition, C&C
’99. ACM, New York, NY, USA, pp 158–165. https://doi.org/10.1145/317561.317589
4. Buxton B (2007) Sketching user experiences: getting the design right and the right design.
Morgan Kaufmann Publishers Inc., San Francisco
5. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach
Intell 6:679–698
6. Deka B, Huang Z, Kumar R (2016) ERICA: interaction mining mobile apps. In: Proceedings of
the 29th annual symposium on user interface software and technology, UIST’16. ACM, New
York, NY, USA, pp 767–776. https://doi.org/10.1145/2984511.2984581
7. Deka B, Huang Z, Franzen C, Hibschman J, Afergan D, Li Y, Nichols J, Kumar R (2017)
Rico: a mobile app dataset for building data-driven design applications. In: Proceedings of the
30th annual ACM symposium on user interface software and technology, UIST’17. ACM, New
York, NY, USA, pp 845–854. https://doi.org/10.1145/3126594.3126651
8. Dow SP, Glassco A, Kass J, Schwarz M, Schwartz DL, Klemmer SR (2010) Parallel prototyping
leads to better design results, more divergence, and increased self-efficacy. ACM Trans Comput-
Hum Interact 17(4):18:1–18:24. https://doi.org/10.1145/1879831.1879836
9. Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph (Proc
SIGGRAPH) 31(4):44:1–44:10
10. Fernquist J, Grossman T, Fitzmaurice G (2011) Sketch-sketch revolution: an engaging tutorial
system for guided sketching and application learning. In: Proceedings of the 24th annual ACM
symposium on user interface software and technology, UIST’11. ACM, New York, NY, USA,
pp 373–382. https://doi.org/10.1145/2047196.2047245
11. Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) SketchyCOCO: image generation from
freehand scene sketches. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp 5174–5183
12. Gervais P, Deselaers T, Aksan E, Hilliges O (2020) The DIDI dataset: digital ink diagram data.
arXiv:200209303
13. Ha D, Eck D (2018) A neural representation of sketch drawings. In: 6th international conference
on learning representations, ICLR 2018, conference track proceedings, Vancouver, BC, Canada,
April 30–May 3, 2018. https://openreview.net/forum?id=Hy6GHpkCW
14. Hennessey JW, Liu H, Winnemöller H, Dontcheva M, Mitra NJ (2017) How2Sketch: generating
easy-to-follow tutorials for sketching 3D objects. In: Symposium on interactive 3D graphics
and games
15. Herring SR, Chang CC, Krantzler J, Bailey BP (2009) Getting inspired!: understanding how and
why examples are used in creative design practice. In: Proceedings of the SIGCHI conference
on human factors in computing systems, CHI’09. ACM, New York, NY, USA, pp 87–96.
https://doi.org/10.1145/1518701.1518717
16. Huang F, Canny JF (2019) Sketchforme: composing sketched scenes from text descriptions for
interactive applications. In: Proceedings of the 32nd annual ACM symposium on user interface
414 F. Huang et al.
software and technology, UIST’19. Association for Computing Machinery, New York, NY,
USA, pp 209–220. https://doi.org/10.1145/3332165.3347878
17. Huang F, Canny JF, Nichols J (2019) Swire: sketch-based user interface retrieval. In: Proceed-
ings of the 2019 CHI conference on human factors in computing systems, CHI’19. Association
for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3290605.3300334
18. Jongejan J, Rowley H, Kawashima T, Kim J, Fox-Gieg N (2016) The quick, draw! - AI exper-
iment. https://quickdraw.withgoogle.com/
19. Kim JH, Kitaev N, Chen X, Rohrbach M, Zhang BT, Tian Y, Batra D, Parikh D (2019) CoDraw:
collaborative drawing as a testbed for grounded goal-driven communication. In: Proceedings
of the 57th annual meeting of the association for computational linguistics. Association for
Computational Linguistics, Florence, Italy, pp 6495–6513. https://doi.org/10.18653/v1/P19-
1651
20. Kumar R, Talton JO, Ahmad S, Klemmer SR (2011) Bricolage: example-based retargeting for
web design. In: Proceedings of the SIGCHI conference on human factors in computing sys-
tems, CHI’11. ACM, New York, NY, USA, pp 2197–2206. https://doi.org/10.1145/1978942.
1979262
21. Landay JA (1996) SILK: sketching interfaces like krazy. In: Conference companion on human
factors in computing systems, CHI’96. ACM, New York, NY, USA, pp 398–399. https://doi.
org/10.1145/257089.257396
22. Lasecki WS, Kim J, Rafter N, Sen O, Bigham JP, Bernstein MS (2015) Apparition: crowd-
sourced user interfaces that come to life as you sketch them. In: Proceedings of the 33rd annual
ACM conference on human factors in computing systems, CHI’15. ACM, New York, NY,
USA, pp 1925–1934. https://doi.org/10.1145/2702123.2702565
23. Lee YJ, Zitnick CL, Cohen MF (2011) ShadowDraw: real-time user guidance for freehand
drawing. ACM Trans Graph 30(4):27:1–27:10. https://doi.org/10.1145/2010324.1964922
24. Li M, Lin Z, Mech R, Yumer E, Ramanan D (2019) Photo-sketching: inferring contour drawings
from images. In: 2019 IEEE winter conference on applications of computer vision (WACV).
IEEE, pp 1403–1412
25. Limpaecher A, Feltman N, Treuille A, Cohen M (2013) Real-time drawing assistance
through crowdsourcing. ACM Trans Graph 32(4):54:1–54:8. https://doi.org/10.1145/2461912.
2462016
26. Lin J, Newman MW, Hong JI, Landay JA (2000) DENIM: finding a tighter fit between tools
and practice for web site design. In: Proceedings of the SIGCHI conference on human factors
in computing systems, CHI’00. ACM, New York, NY, USA, pp 510–517. https://doi.org/10.
1145/332040.332486
27. Munoz-Salinas R (2012) ArUco: a minimal library for augmented reality applications based
on OpenCV. Universidad de Córdoba
28. Newman MW, Landay JA (2000) Sitemaps, storyboards, and specifications: a sketch of web
site design practice. In: Proceedings of the 3rd conference on designing interactive systems:
processes, practices, methods, and techniques, DIS’00. ACM, New York, NY, USA, pp 263–
274. https://doi.org/10.1145/347642.347758
29. Nguyen TA, Csallner C (2015) Reverse engineering mobile application user interfaces with
REMAUI (T). In: 2015 30th IEEE/ACM international conference on automated software engi-
neering (ASE), pp 248–259. https://doi.org/10.1109/ASE.2015.32
30. Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation.
In: Empirical methods in natural language processing (EMNLP), pp 1532–1543. http://www.
aclweb.org/anthology/D14-1162
31. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A,
Bernstein MS, Berg AC, Li F (2014) ImageNet large scale visual recognition challenge. CoRR
abs/1409.0575. arXiv:1409.0575
32. Sain A, Bhunia AK, Yang Y, Xiang T, Song YZ (2020) Cross-modal hierarchical modelling for
fine-grained sketch based image retrieval. In: Proceedings of the 31st British machine vision
virtual conference (BMVC 2020). British Machine Vision Association, pp 1–14
Sketch-Based Creativity Support Tools Using Deep Learning 415
33. Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly
drawn bunnies. ACM Trans Graph 35(4):119:1–119:12. https://doi.org/10.1145/2897824.
2925954
34. Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recogni-
tion and clustering. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 815–823
35. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image
recognition. In: International conference on learning representations
36. Su Q, Li WHA, Wang J, Fu H (2014) EZ-sketching: three-level optimization for error-tolerant
image tracing. ACM Trans Graph 33(4):54:1–54:9. https://doi.org/10.1145/2601097.2601202
37. Swearngin A, Dontcheva M, Li W, Brandt J, Dixon M, Ko AJ (2018) Rewire: interface design
assistance from examples. In: Proceedings of the 2018 CHI conference on human factors in
computing systems, CHI’18. ACM, New York, NY, USA, pp 504:1–504:12. https://doi.org/
10.1145/3173574.3174078
38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I
(2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R,
Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30.
Curran Associates, Inc., pp 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-
need.pdf
39. Xie J, Hertzmann A, Li W, Winnemöller H (2014) PortraitSketch: face sketching assistance
for novices. In: Proceedings of the 27th annual ACM symposium on user interface software
and technology, UIST’14. ACM, New York, NY, USA, pp 407–417. https://doi.org/10.1145/
2642918.2647399
40. Yu Q, Liu F, Song YZ, Xiang T, Hospedales T, Loy CC (2016) Sketch me that shoe. In:
Computer vision and pattern recognition
41. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-
consistent adversarial networks. In: 2017 IEEE international conference on computer vision
(ICCV)
42. Zou C, Yu Q, Du R, Mo H, Song YZ, Xiang T, Gao C, Chen B, Zhang H (2018)
SketchyScene: richly-annotated scene sketches. In: ECCV. Springer International Pub-
lishing, pp 438–454. https://doi.org/10.1007/978-3-030-01267-0_26, https://github.com/
SketchyScene/SketchyScene
Generative Ink: Data-Driven
Computational Models for Digital Ink
Abstract Digital ink promises to combine the flexibility of pen and paper interaction
and the versatility of digital devices. Computational models of digital ink often
focus on recognition of the content by following discriminative techniques such
as classification, albeit at the cost of ignoring or losing personalized style. In this
chapter, we propose augmenting the digital ink framework via generative modeling
to achieve a holistic understanding of the ink content. Our focus particularly lies in
developing novel generative models to gain fine-grained control by preserving user
style. To this end, we model the inking process and learn to create ink samples similar
to users. We first present how digital handwriting can be disentangled into style and
content to implement editable digital ink, enabling content synthesis and editing.
Second, we address a more complex setup of free-form sketching and propose a novel
approach for modeling stroke-based data efficiently. Generative ink promises novel
functionalities, leading to compelling applications to enhance the inking experience
for users in an interactive and collaborative manner.
1 Introduction
Writing and drawing have served for centuries as our primary mean of communication
and cornerstone of our education and culture, and often is considered a form of
art [77] as being one of the most expressive ways of reflecting personal style. It has
been shown to be beneficial in tasks such as note-taking [63], reading in conjunction
with writing [79] and may have a positive impact on short- and long-term memory
[7]. Handwriting and sketching have also been effective drafting tools to ideate and
design [28, 76].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 417
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_13
418 E. Aksan and O. Hilliges
As being one of the most natural communication mediums inking has taken place
in digital devices. Digital pen, i.e., stylus, has been a versatile alternative to keyboard
and mouse in digital content creation. However, the traditional note-taking with pen
and paper has yet to be fully replaced by the digital counterparts albeit the flexibility
and versatility provided by digital platforms [3]. The discrepancy between the analog
and digital ink could be attributed to aesthetics or mechanical limitations such as
stylus accuracy, latency, and texture [4]. Though there is more to do to meet or
exceed the pen and paper experience, digital ink offers unique affordances otherwise
not possible with the analog inking such as dynamic editing, annotations [76], and
sketch-based content retrieval [55, 97].
Digital ink is arguably the most convenient for interacting with a reference mate-
rial. It allows us to take notes, highlight content and organize the information swiftly
and effectively. Consider Fig. 1 illustrating a user taking notes on an image, augment-
ing the content via sketches and organizing the semantic groups by relating them via
arrows. A holistic experience is not always limited to a single type of digital ink such
as handwriting or drawings only. Instead, a more natural experience often involves
various inking modalities and digital media. We envision a system that is capable of
controlling the ink content semantically (i.e., segmentation and editing of entities),
modeling the digital ink environment holistically (i.e., relating different contents)
and even parsing the reference material to understand the user’s mental state and
construct a context internally, allowing for collaboration between the model and the
user. This vision motivates us to explore novel computational modeling techniques
for digital ink. Current state of the digital ink framework consists of the hardware
layer involving the sensory and a software layer providing the basic functionality of
editing, coloring, and beautification. Our goal is to extend this framework with a new
layer promising semantic understanding and fine-grained control of digital ink.
Fig. 1 Digital note taking involves handwritten text, sketching, and interaction with digital context
Generative Ink: Data-Driven Computational Models for Digital Ink 419
Our focus lies on the generative modeling of ink that is able to learn the data cre-
ation process (i.e., how we actually write and draw), dubbed as the generative ink.
Deep generative models have shown to be powerful for content creation and manip-
ulation in various domains such as speech synthesis [67], image editing [56, 68], and
music creation [44]. First Graves [35], then Ha and Eck [38] pioneered generative
modeling of the ink data on handwriting and doodle drawings, respectively, and have
shown the potential it bears. Generative ink presents new opportunities for both the
developers and the users and has potential for compelling applications. Learning to
create the content enables editing of the available user data and synthesizing novel
content to support the users in a collaborative manner.
In this chapter, we present two techniques addressing the computational chal-
lenges of digital ink. We show how to develop generative models tailored for the
underlying problems by leveraging the domain priors. We particularly focus on
modeling the data in time-series representation rather than the images, allowing
us to operate on the raw ink representation. This enables interfacing the models with
the applications directly.
In the first part, we focus on editable digital ink. Our goal is to develop a system for
handwritten text that allows for editing the content while preserving the user’s hand-
writing style. To process digital handwriting one has typically to resort to character
recognition techniques (e.g., [30]) thus invariably losing the personalized aspect of
written text. As the digital ink and the character recognition systems [17, 58] are
decoupled and requiring a faithful integration, fully digital text remains easier to pro-
cess, search, and manipulate than handwritten text which has lead to a dominance of
typed text. To enhance the user experience with the digital ink, we propose a gener-
ative model that is capable of maintaining the author’s original style, thus allowing
for a seamless transition between handwritten and digital text. This is a challenging
problem: while each user has a unique handwriting style [82, 103], the parameters
that determine style are not well defined. Moreover, handwriting style is not fixed but
changes temporally based on context, writing speed, and other factors [82]. Hence
so far it has been elusive to algorithmically recreate style faithfully, while being able
to control content.
In the second part, we explore a novel approach in a more complex setup; free-
form sketching including flowchart diagrams, objects, and handwriting. We raised
the question of how a model could complete a drawing in a collaborative manner.
The answer to this question is highly context-sensitive and requires reasoning at the
local (i.e., stroke) and global (i.e., the drawing) level. To this end, we focus on learning
the compositional structure of the strokes where we model the local appearance of
the strokes and the global structure explicitly. We consider it as the first step toward
a holistic approach.
Before diving into generative model examples, we present the related work cov-
ering the computational ink models, provide a formal definition of the digital ink
data, and introduce the datasets we use in our works.
420 E. Aksan and O. Hilliges
2 Related Work
Our work touches the research areas of human–computer interaction and machine
learning. In the following, we mention the previous work focusing on computational
models for digital ink and relevant machine learning studies.
Research into the recognition of handwritten text has led to drastic accuracy improve-
ments [27, 71] and such technology can now be found in mainstream UIs (e.g., Win-
dows, Android, iOS). However, converting digital ink into ASCII characters removes
individual style. Understanding what exactly constitutes style has been the subject
of much research to inform font design [16, 26, 66] and the related understanding of
human reading has served as a source of inspiration for the modern parametric-font
systems [45, 50, 80]. Nonetheless no equivalent parametric model of handwritten
style exists and analysis and description of style remains an inexact science [66, 73].
The variety of styles also poses a challenge for the handwriting recognition systems.
Bhunia et al. [9] presents a few-shot learning approach to adapt the handwritten text
recognition models to novel user styles at test time.
Given the naturalness of the medium [21, 79], pen-based interfaces have seen endur-
ing interest in both the graphics and HCI literature [84]. Ever since Ivan Sutherland’s
Sketchpad [85] researchers have explored sensing and input techniques for small
screens [47, 99], tablets [41, 70] and whiteboards [64, 69, 96] and have proposed
ways of integrating paper with digital media [12, 40]. Furthermore many domain
specific applications have been proposed. For instance, manipulation of hand-drawn
diagrams [6] and geometric shapes [5], note-taking (e.g., NiCEBook [12]), sharing
of notes (e.g., NotePals [25]), browsing and annotation of multimedia content [91],
including digital documents [100] using a stylus. Others have explored creation, man-
agement and annotation of handwritten notes on large screen displays [69, 96]. Typi-
cally such approaches do not convert ink into characters to preserve individual style.
Zitnick [104] proposes a method for beautification of digital ink by exploiting the
smoothing effect of geometric averaging of multiple instances of the same stroke.
While generating convincing results, this method requires several samples of the
Generative Ink: Data-Driven Computational Models for Digital Ink 421
same text for a single user. A supervised machine learning method to remove slope
and slant from handwritten text and to normalize it’s size [29] has been proposed.
Zanibbi et al. [102] introduce a tool to improve legibility of handwritten equations
by applying style-preserving morphs on user-drawn symbols. Lu et al. [60] propose
to learn style features from trained artist, and to subsequently transfer the strokes of a
different writer to the learnt style and therefore inherently remove the original style.
A large body of work is dedicated to the synthesis of handwritten text (for a compre-
hensive survey see [27]). Attempts have been made to formalize plausible biological
models of the processes underlying handwriting [42] or by learning sequences of
motion primitives [92]. Such approaches primarily validate bio-inspired models but
do not produce convincing sequences. In [8, 72] sigma-lognormal models are pro-
posed to synthesize handwriting samples by parameterizing rapid human movements
and hence reflecting writer’s fine motor control capability. The model can naturally
synthesize variances of a given sample, but it lacks control of the content.
Realistic handwritten characters such as Japanese Kanji or individual digits can
be synthesized from learned statistical models of stroke similarity [18] or control
point positions (requiring characters to be converted to splines) [89]. Follow-up
work has proposed methods that connect such synthetic characters [19, 90] using
a ligature model. Haines et al. [39] take character-segmented images of a single
author’s writing and attempts to replicate the style via dynamic programming. These
approaches either ignore style entirely or learn to imitate a single reference style
from a large corpus of data. With the introduction of neural networks to handwriting
synthesis task, the models have achieved generalization across various user styles.
The works [10, 24] present approaches for offline handwritten text synthesis. The
proposed models generate images of handwritten characters and text by imitating
a given user style. In [31], the authors use handwriting synthesis to augment and
increase the amount of training data for handwritten text recognition systems.
Graves [35] proposes an autoregressive model with long short-term memory recur-
rent (LSTM) neural networks to generate complex sequences with long-range struc-
ture such as handwritten text. The work demonstrates synthesis of handwritten text
in specific styles, however lacking a notion of disentangling content from style. In
our work [1], we decouple the content from style and represent them via separate
random variables to achieve fine-grained control over the written text and the style.
In [51], our disentanglement assumption is further applied to style component itself
via decoupled style descriptors for character- and writer-level styles. This approach
considers writer-independent and writer-dependent character representations as well
as global writer styles, mitigating the loss of fine-grained style attributes in the syn-
thesized text.
422 E. Aksan and O. Hilliges
Free-form sketching includes a diverse set of tasks from drawings of basic doodles
to more complex structures such as flowcharts. Previous works address recognition
and synthesis of free-form sketches.
Ha et al. [38] and Ribeiro et al. [75] build LSTM/VAE-based and Transformer-
based models, respectively, to generate samples from the QuickDraw dataset [34].
These approaches model the entire drawing as a single sequence of points. The
different categories of drawings are modeled holistically without taking their internal
structure into account.
Costagliola et al. [23] present a parsing-based approach using a grammar of shapes
and symbols where shapes and symbols are independently recognized and the results
are combined using a non-deterministic grammar parser. Bresler et al. [13, 14] inves-
tigated flowchart and diagram recognition using a multi-stage approach including
multiple independent segmentation and recognition steps. [101] have applied graph
attention networks to 1,300 diagrams from [13, 14, 23] for text/non-text classification
using a hand-engineered stroke feature vector. Yang et al. [98] use graph convolu-
tional networks for semantic segmentation at the stroke level to extensions of the
QuickDraw data [54, 95]. For an in-depth treatment of free-form sketch models,
we refer the reader to the recent survey by Xu et al. [97].
3 Background
In the following, we formally define the digital ink data as time-series representations,
present datasets we utilize in this chapter, and provide dataset statistics.
Fig. 3 Strokes are illustrated in different colors for various ink samples, namely flowchart (left),
cat and elephant drawings (right) and handwritten text (bottom). A stroke may correspond to both
semantically meaningful or arbitrary building blocks. A semantic entity (e.g., shapes or letters) may
consist of a single or multiple strokes
3.2 Datasets
In this section, we provide a summary of the digital ink datasets we use throughout
the chapter. Our models are trained on IAM-OnDB, Deepwriting, DiDi, and
QuickDraw datasets.
In our work, presented in Sect. 5, we use samples with shape only and ignored
the textual labels. We re-sample all the data points by using the available timestamps
such that the sampling frequency becomes 20. Samples with less than 4 strokes and
points are discarded during training.
We explore novel ways to combine the benefits of digital ink with the versatility
and efficiency of typing, allowing for a seamless transition between handwritten and
digital text. Our focus lies on a generative handwriting model to achieve editable
digital ink representation via disentanglement of style and content.
We seek a model that is capable of capturing and reproducing local variability of
handwriting and can mimic different user styles with high-fidelity. Importantly, the
model is expected to provide full control over the content of the synthetic sequences,
1 https://quickdraw.withgoogle.com/.
Generative Ink: Data-Driven Computational Models for Digital Ink 427
a c
Fig. 7 Editable digital ink enables applications synthesizing handwriting from typed text while
giving users control over the visual appearance (a), transferring style across handwriting samples
(b, solid line box synthesized sample, dotted line box reference style), and editing handwritten
samples at the word level (c)
enabling processing and editing of digital ink at the word level, enabling compelling
applications such as beautification, synthesis, spell-checking, and correction (Fig. 7).
To make digital ink fully editable, one has to overcome a number of technical
problems such as character recognition and synthesis of realistic handwriting. None
is more important than the disentanglement of style and content. Each author has
a unique style of handwriting [82, 103], but at the same time, they also display a
lot of intra-variability, such as mixing connected and disconnected styles, variance
in usage of glyphs, character spacing and slanting (see Fig. 8). Hence, it is hard to
define or predict the appearance of a character, as often its appearance is strongly
influenced by its content. A comprehensive approach to handwriting synthesis must
be able to maintain global style while preserving local variability and context (e.g.,
many users mix cursive and disconnected styles dynamically).
Embracing this challenge we follow a data-driven approach capable of disentan-
gling handwritten text into their content and style components, necessary to enable
editing and synthesis of novel handwritten samples in a user-specified style. The
key idea underlying our approach is to treat style and content as two separate latent
random variables (Fig. 9a). While the content component is defined as the set of
alphanumeric characters and punctuation marks, the style term is an abstraction of
the factors defining appearance. It is learned by the model and projected into a
continued-valued latent space. One can make use of content and style variables to
edit either style, content or both, or one can generate entirely new samples (Fig. 9b).
In this section, we present a conditional variational recurrent neural network
architecture to disentangle the handwritten text into content and style. First, we
428 E. Aksan and O. Hilliges
Fig. 8 Example of intra-author variation that leads to entanglement of style and content, making
conditional synthesis of realistic digital ink very challenging
Fig. 9 High-level representation of our approach. x, z and π are random variables corresponding to
handwritten text, style and content, respectively. a A given handwritten sample can be decomposed
into style and content components. b Similarly, a sample can be synthesized using style and content
components. c Our model learns inferring and using latent variables by reconstructing handwrit-
ing samples. g inp and g out are feed-forward networks projecting the input into an intermediate
representation and predicting outputs, respectively
We use the point-wise data representation u t = (xt , yt , pt ) presented in Sect. 3.1 and
introduce the labels for character ct , end of character et and beginning of a new word
wt . The ct takes one of the categorical labels determined by the alphabet. The et and
Generative Ink: Data-Driven Computational Models for Digital Ink 429
Fig. 10 Schematic overview of our handwriting model in training (a) and sampling phases (b),
operating at the point level. Subscripts denote time step t. Superscripts correspond to layer names
q
such as input, latent and output layers or the distributions of the random variables such as z t ∼
p
q(z t |u t ) and z t ∼ p(z t |u t ). (τ and h) An RNN cell and its output. (g) A multi-layer feed-forward
neural network. (Arrows) Information flow color-coded with respect to source. (Colored circles)
Latent random variables. Outgoing arrows represent a sample of the random variable. (Green branch)
Gaussian latent space capturing style related information along with latent RNN (τtlatent ) cell at
individual time steps t. (Blue branch) Categorical and GMM random variables capturing content
information. (Small black nodes) An auxiliary node for concatenation of incoming nodes
wt are binary and set to 1 only if the criterion is met. The beginning of a new word
label wt correspond to the first point of a new word.
We propose a novel autoregressive neural network (NN) architecture that contains
continuous and categorical latent random variables. Here, the continuous latent vari-
able which captures the appearance properties is modeled by an isotropic Normal
distribution (Fig. 10, green). Whereas the content information is captured via a Gaus-
sian Mixture Model (GMM), where each character in the dataset is represented by an
isotropic Gaussian (shown in Fig. 10, blue). We train the model by reconstructing a
given handwritten sample x (Fig. 9c). Handwriting is inherently a temporal domain
430 E. Aksan and O. Hilliges
4.2 Background
Multi-layer recurrent neural networks (RNNs) [35] and variational RNN (VRNN)
[22] are most related to our work. We briefly recap these and highlight differences.
In our notation superscripts correspond to layer information such as input, latent, or
output while the subscript t denote the time step. Moreover, we drop parametrization
for the sake of brevity, and therefore readers should assume that all probability
distributions are modeled by using neural networks.
T
p(x) = p(u t+1 |u t ),
t=1 (1)
p(u t+1 |u t ) = g
out
(h t )
h t = τ (u t , h t−1 ),
T
Lr nn (x) = log p(x) = log p(u t+1 |u t ) (2)
t=1
Multi-layered LSTMs with a GMM output distribution have been used for hand-
writing modeling [35]. While capable of conditional synthesis, they can not disen-
tangle style from content due to the lack of latent random variables.
The model parameters are optimized by jointly maximizing the variational lower
bound:
T
log p(x) ≥ Eq(zt |u t ) log p(u t |z t ) − K L(q(z t |u t )|| p(z t )), (5)
t=1
432 E. Aksan and O. Hilliges
While multi-layer RNNs and VRNNs have appealing properties, neither is directly
capable of full conditional handwriting synthesis. For example, one can synthesize a
given text in a given style by using RNNs but samples will lack natural variability. Or
one can generate high quality novel samples with VRNNs. However, VRNNs lack
control over what is written. Neither model have inference networks to decouple
style and content, which lies at the core of our work.
We overcome this issue by introducing a new set of latent random variables, z
and π , capturing style and content of handwriting samples. More precisely our new
model describes the data as being generated by two latent variables z and π (Fig. 9)
such that
p(u t |z t , πt ) = g out (z t , πt ),
p(z t ) = g p,z (h latent
t−1 ),
p(πt ) = g p,π (h latent
t−1 ), (7)
h latent
t = τ (u t , z t , πt , h latent
latent
t−1 ),
q(z t |u t ) = g (u t , h t−1 ),
q,z latent
Since we aim to decouple style and content in handwriting, we assume that the
approximate distribution has a factorized form q(z t , πt |u t ) = q(z t |u t )q(πt |u t ). Both
q(π |x) and q(z|x) are used to infer content and style components of a given sample
x as described earlier.
We optimize the following variational lower bound:
T
log p(x) ≥ Llb (·) = Eq(zt ,πt |u t ) log p(u t |z t , πt )
t=1
(9)
−K L(q(z t |u t )|| p(z t )) − K L(q(πt |u t )|| p(πt )),
where the first term ensures that the input point is reconstructed by using its latent
samples. We model the output by using bivariate Gaussian and Bernoulli distributions
for 2D-pixel coordinates and binary pen-up events, respectively.
Note that our output function g out does not employ the internal cell state h. By
using only the latent variables z and π for synthesis, we aim to enforce the model to
capture the patterns only in the latent variables z and π .
The C-VRNN architecture as discussed so far enables the crucial component of sepa-
rating continuous components from categorical aspects (i.e., characters) which poten-
tially would be sufficient to conditionally synthesize individual characters. However,
to fully address the entire handwriting task several extension to control important
aspects such as word spacing and to improve quality of the predictions are necessary.
Although we assume that the latent random variables z and π capture style and
content information, respectively, and make a conditional independence assumption,
in practice full disentanglement is an ambiguous task. Since we essentially ask the
model to learn by itself what style and what content are we found further guidance
at training time to be necessary.
To prevent divergence during training we make use of character labels at training
time and add an additional cross-entropy classification loss Lclassi f ication on the
content component q(πt |u t ).
434 E. Aksan and O. Hilliges
Fig. 11 (top, green) Input samples used to infer style. (middle, red) Synthetic samples of a model
with π only. They are generated using one-hot-encoded character labels, causing problems with
pen-up events and with character placement. (bottom, blue) Synthetic samples of our model with
GMM latent space
Conditioning generative models is typically done via one-hot encoded labels. While
we could directly use samples from q(πt |u t ), we prefer using a continuous represen-
tation. We hypothesize and experimentally validate (see Fig. 11) that the synthesis
model can shape the latent space with respect to the loss caused by the content aspect.
For this purpose, we use a Gaussian mixture model where each character in K is
represented by an isotropic Gaussian
K
p(ϕt ) = πt,k N(ϕt |μk , σk ), (10)
k=1
where N(ϕt |μk , σk ) is the probability of sampling from the corresponding mixture
component k. π corresponds to the content variable in Eq. (8) which is here inter-
preted as weight of the mixture components. This means that we use q(πt |u t ) to
select a particular Gaussian component for a given point sample u t . We then sample
ϕt from the k-th Gaussian component and apply the “re-parametrization trick” [37,
48] so that the gradients can flow through the random variables, enabling the learning
of GMM parameters via standard backpropagation.
ϕt = μk + σk , (11)
where ∼ N(0, 1). Our continuous content representation results in similar letters
being located closer in the latent space while dissimilar letters or infrequent symbols
being pushed away. This effect is visualized in Fig. 12.
Importantly, the GMM parameters are sampled from a time-invariant distribution.
That is they remain the same for all data samples and across time steps of a given
input x, whereas z t is dynamic and employs new parameters per time step. For
each Gaussian component in ϕ, we initialize μk , 1 ≤ k ≤ K , randomly by using a
uniform distribution U(−1, 1) and σk with a fixed value of 1. The GMM components
are trained alongside the other network parameters.
In order to increase model convergence speed and to improve results, we use
ground-truth character labels during training. More precisely, the GMM components
Generative Ink: Data-Driven Computational Models for Digital Ink 435
Fig. 12 Illustration of our GMM latent space ϕ gmm . We select a small subset of our alphabet
and draw 500 samples from corresponding GMM components. We use the tSNEalgorithm [61]
to visualize 32-dimensional samples in 2D space. Note that the tSNEalgorithm finds an arbitrary
placement and hence the positioning does not reflect the true latent space. Nevertheless, letters form
separate clusters
are selected by using the ground-truth character labels c instead of predictions of the
inference network q(πt |u t ). Hence, q(πt |u t ) is trained only by using the classification
loss Lclassi f ication and not affected by the gradients of GMM with respect to πt .
At sampling time the model needs to automatically infer word spacing and which
character to synthesize (these are a priori unknown). In order to control when to
leave a space between words or when to start synthesizing the next character, we
introduce two additional signals during training, namely e and w signaling the end
of a character and beginning of a word, respectively. These labels are attained from
ground-truth character level segmentation.
The w signal is fed as input to the output function g out , and the output distribution
of our handwriting synthesis takes the following form:
forcing the model to learn when to leave empty space at training and sampling time.
The e signal, on the other hand, is provided to the model at training so that it can
predict when to stop synthesizing a given character. It is included in the loss function
in the form of Bernoulli log-likelihood Le . Along with the reconstruction of the input
point u t , the et label is predicted.
436 E. Aksan and O. Hilliges
Our model consists of two LSTM cells in the latent and at the input layers. Note that
the latent cell is originally contributing via the transition function τ latent in Eq. (6).
Using an additional cell at the input layer increases model capacity (similar to multi-
layered RNNs) and adds a new transition function τ inp . Thus, the synthesis model
can capture and modulate temporal patterns at the input levels. Intuitively, this is
motivated by the strong temporal consistency in handwriting where the previous
letter influences the appearance of the current (cf. Fig. 8).
inp
We now use a temporal representation h t of the input points u t . With the cumu-
lative modifications, our C-VRNN architecture becomes
In our style transfer applications, we first pass a reference sample to the model and
get the internal state of the latent LSTM cell h latent carrying style information. We
then initialize the sampling phase (see Fig. 10) by calculating the style and content
via corresponding prior distributions as in Eqs. 14 and 15.
By disentangling content from style, our approach makes digital ink truly editable.
This allows the generation of novel writing in user-defined styles and, similarly to
typed text, of seamless editing of handwritten text. Further, it enables a wide range
of exciting application scenarios, of which we discuss proof-of-concept implemen-
tations.
Generative Ink: Data-Driven Computational Models for Digital Ink 437
Fig. 13 Handwritten text synthesized from the paper abstract. Each sentence is “written” in the
style of a different author. For full abstract, see Appendix
Our model can furthermore transfer existing handwritten samples to novel styles,
thus preserving their content while changing their appearance. We implemented an
interactive tablet application that allows users to recast their own handwriting into a
selected style (see Fig. 14 for results). After scribbling on the canvas and selecting
an author’s handwriting sample, users see their strokes morphed to that style in
438 E. Aksan and O. Hilliges
Fig. 14 Style transfer. The input sequence (top) is transferred to a selected reference style (black ink,
dotted outlines). The results (blue ink, solid outline) preserve the input content, and its appearance
matches the reference style
real time. Such solution could be beneficial for a variety of domains. For example,
artist and comic authors could include specific handwritten lettering in their work,
or preserving style during localization to a foreign language.
Beautification
When using the users own input style as target style, our model re-generates smoother
versions of the original strokes, while maintaining natural variability and diversity.
Thus obtaining an averaging effect that suppresses local noise and preserves global
style features. The resulting strokes are then beautified (see Fig. 17), in line with
previous work that solely relied on token averaging for beautification (e.g., [104]) or
denoising (e.g., [15]).
At the core of our technique lies the ability to edit digital ink at the same level of
fidelity as typed text, allowing users to change, delete, or replace individual words.
Figure 15 illustrates a simple prototype allowing users to edit handwritten content,
while preserving the original style when re-synthesizing it. Our model recognizes
individual words and characters and renders them as (editable) overlays. The user may
select individual words, change the content, and regenerate the digital ink reflecting
the edits while maintaining a coherent visual appearance. We see many applications,
for example, note-taking apps, which require frequent edits but currently do not allow
for this without loosing visual appearance.
Handwriting Spell-checking and Correction
A further application of the ability to edit digital ink at the word level is the possibility
to spell-check and correct handwritten text. As a proof of concept, we implemented
Generative Ink: Data-Driven Computational Models for Digital Ink 439
a functional handwriting spell-checker that can analyze digital ink, detect spelling
mistakes, and correct the written samples by synthesizing the corrected sentence in
the original style (see Fig. 16). For the implementation we rely on existing spell-
checking APIs, feeding recognized characters into it and re-rendering the retrieved
corrections.
So far we have introduced our neural network architecture and have evaluated its
capability to synthesize digital ink. We now shift our focus on initially evaluating
users’ perception and the usability of our method. To this end, we conducted a
preliminary user study gathering quantitative and qualitative data on two separate
tasks. Throughout the experiment, 10 subjects (M = 27.9; S D = 3.34; 3 female)
from our institution evaluated our model using an iPad Pro and Apple Pencil.
Handwriting Beautification
440 E. Aksan and O. Hilliges
Fig. 17 Task 1. Top: Experiment Interface. Participants input on the left; beautified version on the
right. Bottom: Confidence interval plot on a 5-point Likert scale
The first part of our experiment evaluates text beautification. Users were asked to
compare their original handwriting with its beautified counterpart. Specifically, we
asked our subjects to repeatedly write extracts from the LOB corpus [46], for a
total of 12 trials each. In each trial, the participant copied down the sample, and we
beautified the strokes with the results being shown side-by-side (see Fig. 17, top).
Users were then asked to rate the aesthetics of their own script (Q: I find my own
handwriting aesthetically pleasing) and the beautified version (Q: I find the beautified
handwriting aesthetically pleasing), using a 5-point Likert scale. Importantly, these
were treated as independent questions (i.e., users were allowed to like both).
Handwriting Spell-Checking
In the second task, we evaluate the spell-checking utility (see Fig. 16). We randomly
sampled from the LOB corpus and perturbed individual words such that they con-
tained spelling mistakes. Participants then used our tool to correct the written text
(while maintaining its style), and subsequently were asked to fill in a standard system
usability scale (SUS) questionnaire and take part in an exit interview.
Results
Our results, summarized in Fig. 17 (bottom), indicate that users’ reception of our
technique is overall positive. The beautified strokes were on average rated higher
(M = 3.65, 95% CI [3.33–3.97]) with non-overlapping confidence intervals. The
SUS results further support this trend, with our system scoring positively (SU S =
85). Following the analysis technique suggested in [53], our system can be classified
as Rank A, indicating that users are likely to recommend it to others.
The above results are also echoed by participants’ comments during the exit
interviews (e.g., I have never seen anything like this, and Finally others can read my
notes.). Furthermore, some suggested additional applications that would naturally
fit our model capabilities (e.g., This would be very useful to correct bad or illegible
handwriting, I can see this used a lot in education, especially when teaching how
Generative Ink: Data-Driven Computational Models for Digital Ink 441
Fig. 18 Potential starting positions are illustrated as heatmaps for the next stroke for handwritten
text (left), animal drawings (middle), and a flowchart sample (right)
to write to kids and This would be perfect for note taking, as one could go back in
their notes and remove mistakes, abbreviations and so on). Interestingly, the ability
to preserve style while editing content were mentioned frequently as the most valued
feature of our approach (e.g., Having a spell-checker for my own handwriting feels
like writing personalized text messages!).
We have shown that generative modeling of the handwritten text promises to combine
the flexibility and aesthetics of handwriting and the ability to process, search and edit
digital text, offering an improved user experience. Handwriting is the basis of note
taking, yet it is usually much more involved as we often rely on sketching as well
(Fig. 1). Handwriting is relatively a more structured task as we follow an order while
writing. In a free-form sketch, strokes are composed into more complex structures by
following underlying semantics. The order of strokes in a drawing can be arbitrary and
yet very similar sketches can be achieved. The sequences x representing the drawing
sample will be different though due to the ordering of the strokes. This observation
indicates the compositional nature of drawings, introducing a new challenge to the
data representations and models we have been using for relatively more structured
handwriting data.
The existing work, including our handwriting model, considers the entire drawing
as a single sequence of points [1, 17, 38, 75]. Instead, in our paper [2], we explore
a novel compositional generative model, called CoSE, for complex stroke-based
data such as drawings, diagrams, and sketches. To this end, we treat a drawing
sample x as an unordered collection of strokes x={sk }k=1
K . Our key insight is to factor
local appearance of a stroke from the global structure of the drawing. Since the
stroke ordering does not impact the semantic meaning of the diagram, this modeling
decision has profound implications. In our approach, the model does not need to
understand the difference between the (K −1)! potential orderings of the previous
strokes to predict the k-th stroke, leading to a much more efficient utilization of
modeling capacity.
442 E. Aksan and O. Hilliges
Fig. 19 Architecture overview—(left) the input drawing as a collection of strokes {sk }; (middle)
our embedding architecture, consisting of a shared encoder Eθ , a shared decoder Dθ , and a relational
model Rθ ; (right) the input drawing with the next stroke s4 and its starting position s̄4 predicted by
Rθ and decoded by Dθ . Note that the relational model Rθ is permutation-invariant
In our generative modeling task, we follow a predictive setup where the model is
expected to complete a given sample. We consider a collaborative scenario between
the model and the user such that the model follows the user drawing, understands the
scene, and provides strokes when asked for. Such a task requires to know where and
what to draw next, heavily depending on the context. In handwriting, localization
of the next stroke is rather easy and often determined by the previous stroke while
in sketches the next stroke depends on the semantic category and the order of the
strokes is determined by the user. For a diagram sample, on the other hand, the next
stroke is not tied to global semantics and the start position is an important degree of
freedom (Fig. 18).
We demonstrate the predictive capabilities via a proof-of-concept interactive
demo2 in which the model suggests diagram completions based on initial user input.
We also show that our model outperforms existing models quantitatively and qualita-
tively and we analyze the learned latent space to provide insights into how predictions
are formed.
2 https://eth-ait.github.io/cose.
Generative Ink: Data-Driven Computational Models for Digital Ink 443
Fig. 20 Stroke embedding—The input stroke s is passed to the encoder, which produces a latent
code λ. The decoder parameterizes a Gaussian mixture model for arbitrary positions t∈[0, 1] from
which we sample points on the stroke. We only visualize the mixture model associated with t=.66
(non-grayed out arrow)
K
p(x; θ ) = p(sk , s̄k |s<k , s̄<k ; θ ), (21)
k=1
with s̄k referring to the starting position of the k-th stroke, and <k denotes {1 . . . k−1}.
Note that we assume a fixed but not chronological ordering of K . An encoder Eθ first
encodes each stroke s to its corresponding latent code λ. A decoder Dθ reconstructs
the corresponding s, given a code λ and the starting position s̄. A transformer-based
relational model Rθ processes the latent codes {λ<k } and their corresponding starting
positions {s̄<k } to generate the next stroke starting position s̄k and embedding λk ,
from which Dθ reconstructs the output stroke sk . Overall, our architecture factors
into a stroke embedding model (Eθ and Dθ ) and a relational model (Rθ ).
Stroke Embedding
We force the embedding model to capture local information such as the shape, size, or
curvature by preventing it from accessing any global information such as the canvas
position or existence of other strokes and their inter-dependencies. The autoencoder
generates an abstraction of the variable-length strokes s by encoding them into fixed-
length embeddings (λ, s̄)=Eθ (s) and decoding them into strokes s=Dθ (λ, s̄).
Relational Model
Our relational model learns how to compose individual strokes to create a sketch by
considering the relationship between latent codes. Given an input drawing encoded
as x={(λ<k , s̄<k )}, we predict: i) a starting position for the next stroke s̄k , and ii)
its corresponding embedding λk . Introducing the embeddings into Eq. 21, we obtain
our compositional stroke embedding model that decouples local drawing information
from global semantics:
K
p(x; θ ) = p(λk , s̄k |λ<k , s̄<k ; θ ) (22)
k=1
444 E. Aksan and O. Hilliges
Fig. 21 Sampling
frequency in decoding. For
arrow (top) and circle
(bottom) shapes, we decode
the corresponding stroke
embedding λ in different
resolutions by controlling the
number of output points
M
arg max Et∼[0,1] πt,m N(sk (t) | μt,m , σt,m ),
θ (23)
m=1
{μt,m , σt,m , πt,m } = Dθ (t|Eθ (s))
where we use mixture densities [11, 35] with M Gaussians with mixture coefficients
π , mean μ and variance σ ; t ∈ [0, 1] is the curve parameter. Note that we use log-
likelihood rather than Chamfer Distance as in [36]. While we do interpret strokes as
2D curves, we observe that modeling of prediction uncertainity is commonly done
in the ink modeling literature [1, 35, 38] and has been shown to result in better
regression performance compared to minimizing an L2 metric [52].
CoSE Encoder—Eθ (s)
We encode a stroke by viewing it as a sequence of 2D points, and generate the
corresponding latent code with a transformer Tθt , where the superscript t denotes
use of positional encoding in the temporal dimension [88]. The encoder outputs
(s̄, λ)=Eθ (s), where s̄ is the starting position of a stroke and λ=Tθt (s − s̄). The
use of positional encoding induces a point ordering and emphasizes the geometry,
where most sequence models focus strongly on capturing the drawing dynamics.
Furthermore, avoiding the modeling of explicit temporal dependencies between time
steps allows for inherent parallelism and is hence computationally advantageous over
RNNs.
Generative Ink: Data-Driven Computational Models for Digital Ink 445
p(λk , s̄k |λ<k , s̄<k ; θ ) = p(s̄ |λ , s̄<k ; θ ) p(λk |s̄k , λ<k , s̄<k ; θ ) (24)
k <k
starting position prediction latent code prediction
446 E. Aksan and O. Hilliges
Fig. 22 Relational model—A few snapshots from our live demo. (Left) Given a drawing, our
model proposes several starting position for auto-completion (we draw the most likely strokes
associated with the two most likely starting positions (red, gray)). (Right) Given a stating position,
our model can predict several stroke alternatives; here we show the top 3 most likely predictions
(orange, light blue, dark blue)
5.4 Training
Given a random pair of target (λk , s̄k ) and a subset of inputs {(λ=k , s̄=k )}, we make
a prediction for the position and the embedding of the target stroke. This subset is
obtained by selecting H ∈[1, K ] strokes from the drawing. We either pick H strokes
i) in order or ii) at random. This allows the model to be good in completing existing
partial drawings but also be robust to arbitrary subsets of strokes. During training, the
model has access to the ground-truth positions s̄ (like teacher forcing [94]). Note that
while we train all three sub-modules (encoder, relational model, decoder) in parallel,
we found that the performance is slightly better if gradients from the relational model
(Eq. 22) are not backpropagated through the stroke embedding model. We apply
augmentations in the form of random rotation and re-scaling of the entire drawing
(see supplementary for details).
5.5 Experiments
We evaluate our model on the recently released DiDi dataset [33]. In contrast to
existing drawing [34] or handwriting datasets [62], this task requires learning of the
compositional structure of flowchart diagrams, consisting of several shapes. In this
paper, we focus on the predictive setting in which an existing (partial) drawing is
extended by adding more shapes or by connecting already drawn ones. State-of-the-
art techniques in ink modeling treat the entire drawing as a single sequence. Our
experiments demonstrate that this approach does not scale to complex structures
448 E. Aksan and O. Hilliges
such as flowchart diagrams (cf. Fig. 25). We compare our method to the state-of-
the-art [38] via the Chamfer Distance [74] between the ground-truth strokes and the
model outputs (i.e. reconstructed or predicted strokes).
The task is inherently stochastic as the next stroke highly depends on where it is
drawn. To account for the high variability in the predictions across different genera-
tive models, the ground-truth starting positions passed the models in our quantitative
analysis (note that the qualitative results rely only on the predicted starting positions).
Moreover, similar to most predictive tasks, there is no single, correct prediction in
the stroke prediction task (see Fig. 22). To account for this multi-modality of fully
generative models, we employ a stochastic variant of the Chamfer distance (CD):
min CD Dθ (t|λ̂k ), sk . (25)
λk ∼ p(λk |s̄k ,λ<k ,s̄<k ;θ)
We evaluate our models by sampling one λk from each mixture component of the
relational model’s prediction which are decoded into 10 strokes (see Fig. 27). This
results in a broader exploration of the predicted strokes than a strict Gaussian mix-
ture sampling. Note that while our training objective is NLL (as is common in ink
modeling), the Chamfer Distance allows for a fairer comparison since it allows to
compare models trained on differently processed data (i.e., positions vs offsets).
We first evaluate the performance in the stroke prediction setting. Given a set of
strokes and a target position, the task is to predict the next stroke. For each drawing,
we start with a single stroke and incrementally add more strokes from the original
drawing (in the order they were drawn) to the set of given strokes and predict the
subsequent one. In this setting, we evaluate our method in an ablation study, where
we replace components of our model with standard RNN-based models: a sequence-
to-sequence (seq2seq) architecture [86] for stroke embeddings, and an autoregressive
RNN for the relational model. Furthermore, following the setting in [38], we compare
to the decoder-only setup from Sketch-RNN (itself conceptually similar to Graves et
al. [35]). For the seq2seq-based embedding model we use bi-directional LSTMs
[43] as the encoder, and a uni-directional LSTM as decoder. Informally, we deter-
mined that a deterministic encoder with a non-autoregressive decoder outperformed
other seq2seq architectures; see Sect. 5.5.2. The RNN-based relational model is an
autoregressive sequence model [35].
Analysis
The results are summarized in Table 2. While the stroke-wise reconstruction per-
formance across all models differs only marginally, the predictive performance of
our proposed model is substantially better. This indicates that a standard seq2seq
model is able to learn an embedding space that is suitable for accurate reconstruc-
tion; this embedding space, however, does not lend itself to predictive modeling.
Generative Ink: Data-Driven Computational Models for Digital Ink 449
Table 2 Stroke prediction—We evaluate reconstruction (i.e. CD(Dθ (Eθ (s)), s) and prediction
(i.e. CD(Rθ (λ<k ), sk ) for a number of different models. Note that performing well on reconstruction
does not necessarily correlate with good prediction performance
Eθ /Dθ Rθ Recon. CD↓ Pred. CD↓
seq2seq RNN 0.0144 0.0794
seq2seq CoSE-Rθ 0.0138 0.0540
CoSE-Eθ /Dθ RNN 0.0139 0.0713
CoSE-Eθ /Dθ CoSE-Rθ (Ord.) 0.0143 0.0696
CoSE-Eθ /Dθ CoSE-Rθ 0.0136 0.0442
Sketch-RNN Decoder [38] N/A 0.0679
Fig. 24 tSNE Embedding—Visualization of the latent spaces for different models (for quantitative
analysis see Table 3). We employ k-means in latent space (k = 10), and color by cluster ID. While a
VAE regularized objective leads to an overall compact latent space, clusters are not well separated,
ours produces the most compact clusters (from left to right) which we show to be correlated with
prediction quality
The combination of our embedding model (CoSE-Eθ /Dθ ) with our relational model
(CoSE-Rθ ) outperforms all other models in terms of predicting consecutive strokes,
giving an indication that the learned embedding space is better suited for the pre-
dictive downstream tasks. The results also indicate that the contributions of both
are necessary to attain the best performance. This can be seen by the increase in
prediction performance of the seq2seq when augmented with our relational model
(CoSE-Rθ ). However, a significant gap remains to the full model (cf. row 2 and 5).
We also evaluate our full model with additional positional encoding in the rela-
tional CoSE-Rθ (Ord.). The results support our hypothesis that an order-invariant
model is beneficial for the task of modeling compositional structures. It is also
observed in sequential modeling of the stroke embeddings by using an RNN (row 3).
Similarly, our model outperforms Sketch-RNN which treats drawings as sequence.
We show a comparison of flowchart completions by Sketch-RNN and CoSE in
Fig. 25. Our model is more robust to make longer predictions.
450 E. Aksan and O. Hilliges
CoSE Sketch-RNN
Fig. 25 Comparison with Sketch-RNN—For each pair of samples the first two strokes (denoted
by 1 and 2 in blue color) are given as context, the remaining strokes (in color) are model outputs,
numbers indicate prediction step. While Sketch-RNN produces meaningful completions for the first
few predictions, its performance quickly decreases with increasing complexity. In contrast, CoSE
is capable of predicting plausible continuations even over long prediction horizons
Our analysis in Sect. 5.5.1 revealed that good reconstruction accuracy is not nec-
essarily indicative of an embedding space that is useful for fully autoregressive
predictions. We now investigate the structure of our embedding space in qualitative
and quantitative measures by analyzing the performance of clustering algorithms
on the embedded data. Since there is only a limited number of shapes that occur
in diagrams, the expectation is that a well-shaped latent space should form clusters
consisting of similar shapes, while maintaining sufficient variation.
Silhouette Coefficient (SC)
This coefficient is a quantitative measure that assesses the quality of a clustering by
jointly measuring tightness of exemplars within clusters versus separation between
clusters [78]. It does not require ground-truth cluster labels (e.g. whether a stroke
is a box, arrow, arrow tip), and takes values between [−1, 1] where a higher value
is an indication of tighter and well separated clusters. The exact number of clusters
Generative Ink: Data-Driven Computational Models for Digital Ink 451
Table 3 Embedding space analysis—(Top) Variants of our model with different embedding
dimensionalities and a variant of our model with VAE. (Bottom) Results for a sequence-to-sequence
stroke autoencoder (seq2seq) and its variational (VAE) and/or autoregressive (AR) variants. All
stroke embedding models use our Transformer relational model Rθ . D indicates the dimension-
ality of the embedding space. CD and SC denote Chamfer Distance and Silhouette Coefficient,
respectively
Eθ /Dθ Rθ D Recon. CD ↓ Pred. CD↓ SC ↑
CoSE-Eθ /Dθ (Table 2) TR 8 0.0136 0.0442 0.361
CoSE-Eθ /Dθ TR 16 0.0091 0.0481 0.335
CoSE-Eθ /Dθ TR 32 0.0081 0.0511 0.314
CoSE-Eθ /Dθ -VAE TR 8 0.0198 0.0953 0.197
seq2seq (Table 2) TR 8 0.0138 0.0540 0.276
seq2seq TR 16 0.0076 0.0783 0.253
seq2seq TR 32 0.0047 0.0848 0.261
seq2seq-VAE TR 8 0.0161 0.0817 0.180
seq2seq-AR TR 8 0.0432 0.0855 0.249
seq2seq-AR-VAE TR 8 0.2763 0.1259 0.151
is not known and we therefore compute the SC for the clustering result of k-means
and spectral clustering [81] with varying numbers of clusters ({5, 10, 15, 20, 25})
with both Euclidean and cosine distance on the embeddings of all strokes in the test
data. This leads to a total of 20 different clustering results. In Table 3, we report the
average SC across these 20 clustering experiments for a number of different model
configurations along with the Chamfer distance (CD) for stroke reconstruction and
prediction. Note, the Pearson correlation between the SC and the prediction accuracy
is 0.92 indicating a strong correlation between the two.
Influence of the Embedding Dimensionality (D)
We performed experiments with different values of D—the dimensionality of the
latent codes. Table 3 shows that this parameter directly affects all components of the
task: While a high-dimensional embedding space improves reconstructions accuracy,
it is harder to predict valid embeddings in such a high-dimensional space and in
consequence both the prediction performance and SC deteriorate. We observe a
similar pattern with sequence-to-sequence architectures which benefit most from the
increased embedding capacity by achieving the lowest reconstruction error (Recon.
CD for seq2seq, D = 32). However, it also leads to a significantly higher prediction
error. Higher-dimensional embeddings result in less compact representation space,
making the prediction task more challenging.
Architectural Variants
In order to obtain a smoother latent space, we also introduce a KL-divergence reg-
ularizer [49] and follow the same annealing strategy as Ha et al. [38]. It is maybe
surprising to see that a VAE regularizer (line CoSE-VAE) hurts reconstruction accu-
racy and interpretability of the embedding space. Note that the prediction task does
452 E. Aksan and O. Hilliges
Fig. 26 Attention visualization over time—(top) with and (bottom) without conditioning on the
start position to make a prediction for the next stroke’s (in red) embedding. Attention weights
correspond to the average of all attention layers across the network
not require interpolation or latent space walks since latent codes represent entire
strokes that can be combined in a discrete fashion. The results further indicate that
our architecture yields a better behaved embedding space, while retaining a good
reconstruction accuracy. This is indicated by (i) the increase in reconstruction qual-
ity with larger D yet prediction accuracy and SC deteriorate; (ii) CoSE obtains much
better prediction accuracy and SC at similar reconstruction accuracy; (iii) smoothing
the embedding space using a VAE for regularization hurts reconstruction accuracy,
prediction accuracy and SC; iv) autoregressive approach hurts reconstruction and
prediction accuracy and SC—this is because autoregressive models tend to over-
fit to the ground-truth data (i.e., teacher forcing) and fail when forced to complete
drawings based on their own predictions.
Visualizations
To further analyse the latent space properties, we provide a tSNE visualization [61]
of the embedding space with color coding for cluster IDs as determined by k-means
with k = 10 in Fig. 24. The plots indicate that the VAE objective encourages a latent
space with overlapping clusters, whereas for CoSE, the clusters are better separated
and more compact. An interesting observation is that the smooth and regularized VAE
latent space does not translate into improved performance on either reconstruction or
inference, which is inline with prior findings on the connection of latent space behav-
ior and downstream behavior [59]. Clearly, the embedding spaces learned using a
CoSE model have different properties and are more suitable for predictive tasks that
are conducted in the embedding space. This qualitative finding is inline with the quan-
titative results of the SC and correlating performance in the stroke prediction task.
Generative Ink: Data-Driven Computational Models for Digital Ink 453
Fig. 27 Pred. CD
performance of our model
Rθ by using different number
of components in the GMM
for embedding predictions
5.5.3 Ablations
The quantitative results from Table 2 indicate that our model performs better in
the predictive modeling of complex diagrams compared to the baselines. Figure 25
provides further indication that this is indeed the case. We show predictions of
SketchRNN [38] which performs well on structures with very few strokes but strug-
gles to predict more complex drawings. In contrast ours continues to produce mean-
ingful predictions even for complex diagrams. This is further illustrated in Fig. 28
showing a number of qualitative results from a model trained on the DiDi (left),
IAM-OnDB (center) and the QuickDraw (right) datasets, respectively. Note that
all predictions are in the autoregressive setting, where only the first stroke (in light
454 E. Aksan and O. Hilliges
Fig. 28 Qualitative examples from CoSE—Drawings were sampled from the model given the
first stroke. Numbers denote the drawing order of the strokes
Generative Ink: Data-Driven Computational Models for Digital Ink 455
blue) is given as input. All other strokes are model predictions (numbers indicate
steps).
The advancing deep generative modeling techniques have given rise to a new class
of interactive and collaborative applications in various domains by enabling manip-
ulation of the user data or content synthesis for the users. Inspired by those novel
interaction techniques and motivated by the increasing utility of digital inking, we
address the problem of designing generative models for digital ink data. To this end,
we introduce the generative ink layer for the digital ink framework, which aims to
augment the inking platforms with a more fine-grained control. We present mod-
els that are able to generate realistic ink data, allowing us to create applications in
interactive and collaborative scenarios.
In our handwriting work, our focus lies on the personalization of the digital ink
data while preserving the versatility and efficiency of digital text. We have built a
variety of proof-of-concept applications, including conditional synthesis and editing
of digital ink at the word level.3 Initial user feedback, while preliminary, indicates
that users are largely positive about the capability to edit digital ink—in one’s own
handwriting or in the style of another author. The key idea underlying our approach
is to treat style and content as two separate latent random variables with distributions
learned during training.
A major challenge in learning disentangled representations and producing realis-
tic samples of digital handwriting is that the model needs to perform auxiliary tasks,
such as controlling the spacing in between words, character segmentation and recog-
nition. It implies that such a fine-grained control requires a powerful architecture
and extensive data labeling effort as we performed in our Deepwriting dataset.
Considering the high amount of training data required by the deep neural networks
and the cost of labeling operation, scaling of such fine-grained labeling to applica-
tion domains beyond handwriting is tedious. QuickDraw transforms this expensive
labeling process into a game for the users, harvesting over 50 million sketching sam-
ples, albeit to more coarse-grained and noisy labels. More work is needed to develop
efficient data collection strategies as well as data-efficient models in parallel.
A potential approach for improving the data efficiency is incorporating domain
priors into model design. In our second work, CoSE, we introduce compositionality
as an inductive bias and show that we mitigate the complexity induced by the com-
positional nature of the strokes, particularly in free-form drawings. This is achieved
by treating the digital ink data as a collection of strokes rather than a sequence of
points as in the previous works. We follow a hierarchical design to model the local
stroke appearance and global drawing semantics. We demonstrate experimentally
that our model outperforms baselines and previous approaches on complex draw-
3 https://www.youtube.com/watch?v=NVF-1csvVvc.
456 E. Aksan and O. Hilliges
4 https://www.youtube.com/watch?v=GENck9zmpMY.
Generative Ink: Data-Driven Computational Models for Digital Ink 457
References
1. Aksan E, Pece F, Hilliges O (2018) DeepWriting: making digital Ink editable via deep gener-
ative modeling, association for computing machinery, New York, NY, USA, pp 1–14. https://
doi.org/10.1145/3173574.3173779
2. Aksan E, Deselaers T, Tagliasacchi A, Hilliges O (2020) Cose: compositional stroke embed-
dings. arXiv:200609930
3. Annett M (2017) (digitally) inking in the 21st century. IEEE Comput Graph Appl 37(1):92–99.
https://doi.org/10.1109/MCG.2017.1
4. Annett M, Anderson F, Bischof WF, Gupta A (2014) The pen is mightier: Understanding
stylus behaviour while inking on tablets. In: Proceedings of graphics interface 2014, Canadian
information processing society, CAN, GI ’14, pp 193–200
5. Arvo J, Novins K (2000) Fluid sketches: continuous recognition and morphing of simple
hand-drawn shapes. In: Proceedings of the 13th annual ACM symposium on User interface
software and technology. ACM, pp 73–80
6. Arvo J, Novins K (2005) Appearance-preserving manipulation of hand-drawn graphs. In: Pro-
ceedings of the 3rd international conference on Computer graphics and interactive techniques
in Australasia and South East Asia. ACM, pp 61–68
7. Berninger VW (2012) Strengthening the mind’s eye: the case for continued handwriting
instruction in the 21st century. Principal 91:28–31
8. Bhattacharya U, Plamondon R, Chowdhury SD, Goyal P, Parui SK (2017) A sigma-lognormal
model-based approach to generating large synthetic online handwriting sample databases. Int
J Doc Anal Recogn (IJDAR) 1–17
9. Bhunia AK, Ghose S, Kumar A, Chowdhury PN, Sain A, Song YZ (2021a) Metahtr: towards
writer-adaptive handwritten text recognition. arXiv:210401876
10. Bhunia AK, Khan S, Cholakkal H, Anwer RM, Khan FS, Shah M (2021b) Handwriting
transformers. arXiv:210403964
11. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc,
USA
12. Brandl P, Richter C, Haller M (2010) Nicebook: supporting natural note taking. In: Proceed-
ings of the SIGCHI conference on human factors in computing systems. ACM, New York,
NY, USA, CHI ’10, pp 599–608. https://doi.org/10.1145/1753326.1753417
13. Bresler M, Phan TV, Průša D, Nakagawa M, Hlaváč V (2014) Recognition system for on-line
sketched diagrams. In: ICFHR
14. Bresler M, Průša D, Hlaváč V (2016) Online recognition of sketched arrow-connected dia-
grams. IJDAR
15. Buades A, Coll B, Morel JM (2005) A non-local algorithm for image denoising. In: 2005
IEEE computer society conference on computer vision and pattern recognition, CVPR’05,
vol 2, pp 60–65. https://doi.org/10.1109/CVPR.2005.38
16. Burgert HJ (2002) The calligraphic line: thoughts on the art of writing. H-J Burgert, translated
by Brody Neuenschwander
17. Carbune V, Gonnet P, Deselaers T, Rowley HA, Daryin A, Calvo M, Wang LL, Keysers D,
Feuz S, Gervais P (2020) Fast multi-language LSTM-based online handwriting recognition.
IJDAR
18. Chang WD, Shin J (2012) A statistical handwriting model for style-preserving and variable
character synthesis. Int J Doc Anal Recogn 15(1):1–19. https://doi.org/10.1007/s10032-011-
0147-7
19. Chen HI, Lin TJ, Jian XF, Shen IC, Chen BY (2015) Data-driven handwriting synthesis in a
conjoined manner. Comput Graph Forum 34(7):235–244. https://doi.org/10.1111/cgf.12762
20. Cheng Y, Wang D, Zhou P, Zhang T (2017) A survey of model compression and acceleration
for deep neural networks. arXiv:171009282
21. Cherubini M, Venolia G, DeLine R, Ko AJ (2007) Let’s go to the whiteboard: how and
why software developers use drawings. In: Proceedings of the SIGCHI conference on human
458 E. Aksan and O. Hilliges
factors in computing systems. ACM, New York, NY, USA, CHI ’07, pp 557–566. https://doi.
org/10.1145/1240624.1240714
22. Chung J, Kastner K, Dinh L, Goel K, Courville AC, Bengio Y (2015) A recurrent latent
variable model for sequential data. arXiv:1506.02216
23. Costagliola G, Deufemia V, Risi M (2006) A multi-layer parsing strategy for on-line recog-
nition of hand-drawn diagrams. In: Visual languages and human-centric computing
24. Davis B, Tensmeyer C, Price B, Wigington C, Morse B, Jain R (2020) Text and style condi-
tioned gan for generation of offline handwriting lines. arXiv:200900678
25. Davis RC, Landay JA, Chen V, Huang J, Lee RB, Li FC, Lin J, Morrey CB III, Schleimer B,
Price MN, Schilit BN (1999) Notepals: Lightweight note sharing by the group, for the group.
In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM,
New York, NY, USA, CHI ’99, pp 338–345. https://doi.org/10.1145/302979.303107
26. Drucker J (1995) The alphabetic labyrinth: the letters in history and imagination. Thames and
Hudson
27. Elarian Y, Abdel-Aal R, Ahmad I, Parvez MT, Zidouri A (2014) Handwriting synthesis:
classifications and techniques. Int J Doc Anal Recogn 17(4):455–469. https://doi.org/10.
1007/s10032-014-0231-x
28. Elsen C, Häggman A, Honda T, Yang MC (2012) Representation in early stage design: an
analysis of the influence of sketching and prototyping in design projects. Int Des Eng Tech
Conf Comput Inf Eng Conf Am Soc Mech Eng 45066:737–747
29. Espana-Boquera S, Castro-Bleda MJ, Gorbe-Moya J, Zamora-Martinez F (2011) Improving
offline handwritten text recognition with hybrid hmm/ann models. Trans Pattern Recogn Mach
Intell 33(4):767–779
30. Evernote Corporation (2017) How evernotes image recognition works. http://blog.evernote.
com/tech/2013/07/18/how-evernotes-image-recognition-works/. Accessed 10 Aug 2017
31. Fogel S, Averbuch-Elor H, Cohen S, Mazor S, Litman R (2020) Scrabblegan: semi-supervised
varying length handwritten text generation. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp 4324–4333
32. Gadelha M, Wang R, Maji S (2020) Deep manifold prior
33. Gervais P, Deselaers T, Aksan E, Hilliges O (2020) The DIDI dataset: digital ink diagram data
34. Google Creative Lab (2017) Quick, draw! The data. https://quickdraw.withgoogle.com/data.
Accessed 01 May 2020
35. Graves A (2013) Generating sequences with recurrent neural networks. arXiv:1308.0850
36. Groueix T, Fisher M, Kim V, Russell B, Aubry M (2018) Atlasnet: a papier-mâché approach
to learning 3D surface generation. In: CVPR
37. Gurumurthy S, Sarvadevabhatla RK, Radhakrishnan VB (2017) Deligan: generative adver-
sarial networks for diverse and limited data. arXiv:170602071
38. Ha D, Eck D (2017) A neural representation of sketch drawings
39. Haines TS, Mac Aodha O, Brostow GJ (2016) My text in your handwriting. In: Transactions
on graphics
40. Haller M, Leitner J, Seifried T, Wallace JR, Scott SD, Richter C, Brandl P, Gokcezade A,
Hunter S (2010) The nice discussion room: Integrating paper and digital media to support
co-located group meetings. In: Proceedings of the SIGCHI conference on human factors in
computing systems. ACM, New York, NY, USA, CHI ’10, pp 609–618. https://doi.org/10.
1145/1753326.1753418
41. Hinckley K, Pahud M, Benko H, Irani P, Guimbretière F, Gavriliu M, Chen XA, Matulic F,
Buxton W, Wilson A (2014) Sensing techniques for tablet+stylus interaction. In: Proceedings
of the 27th annual ACM symposium on user interface software and technology. ACM, New
York, NY, USA, UIST ’14, pp 605–614. https://doi.org/10.1145/2642918.2647379
42. Hinton G, Nair V (2005) Inferring motor programs from images of handwritten digits. In:
Proceedings of the 18th international conference on neural information processing systems.
MIT Press, Cambridge, MA, USA, NIPS’05, pp 515–522. http://dl.acm.org/citation.cfm?
id=2976248.2976313
Generative Ink: Data-Driven Computational Models for Digital Ink 459
43. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–
1780
44. Huang CZA, Vaswani A, Uszkoreit J, Shazeer N, Simon I, Hawthorne C, Dai AM, Hoffman
MD, Dinculescu M, Eck D (2018) Music transformer. arXiv:180904281
45. Hussain F, Zalik B (1999) Towards a feature-based interactive system for intelligent font
design. In: Proceedings of the 1999 IEEE international conference on information visualiza-
tion, pp 378–383. https://doi.org/10.1109/IV.1999.781585
46. Johansson S, Eric A, Roger G, Geoffrey L (1986) The tagged LOB corpus: user’s manual.
Norwegian computing centre for the humanities, Bergen, Norway
47. Kienzle W, Hinckley K (2013) Writing handwritten messages on a small touchscreen. In:
Proceedings of the 15th international conference on human-computer interaction with mobile
devices and services. ACM, New York, NY, USA, MobileHCI ’13, pp 179–182. https://doi.
org/10.1145/2493190.2493200
48. Kingma DP, Welling M (2013a) Auto-encoding variational bayes. In: Proceedings of the 2nd
international conference on learning representations (ICLR), 2014
49. Kingma DP, Welling M (2013b) Auto-encoding variational bayes
50. Knuth DE (1986) The metafont book. Addison-Wesley Longman Publishing Co Inc, Boston,
MA, USA
51. Kotani A, Tellex S, Tompkin J (2020) Generating handwriting via decoupled style descriptors.
In: European conference on computer vision. Springer, pp 764–780
52. Kumar A, Marks TK, Mou W, Feng C, Liu X (2019) UGLLI face alignment: estimating
uncertainty with gaussian log-likelihood loss. In: ICCV workshops, pp 0–0
53. Lewis JR, Sauro J (2009) The factor structure of the system usability scale. In: Kurosu M
(ed) Proceedings of the human centered design: first international conference, HCD 2009.
Springer, Berlin, Heidelberg, pp 94–103. https://doi.org/10.1007/978-3-642-02806-9_12
54. Li K, Pang K, Song YZ, Xiang T, Hospedales T, Zhang H (2019) Toward deep universal
sketch perceptual grouper. Trans image processing
55. Li Y, Li W (2018) A survey of sketch-based image retrieval. Mach Vis Appl 29(7):1083–1100
56. Liu G, Reda FA, Shih KJ, Wang TC, Tao A, Catanzaro B (2018) Image inpainting for irregular
holes using partial convolutions. In: Proceedings of the European conference on computer
vision (ECCV), pp 85–100
57. Liwicki M, Bunke H (2005) Iam-ondb. an on-line English sentence database acquired from
handwritten text on a whiteboard. In: In Proceedings of the 8th international conference on
document analysis and recognition, pp 956–961
58. Liwicki M, Graves A, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwrit-
ing recognition based on bidirectional long short-term memory networks. In: Proceedings of
the 9th international conference on document analysis and recognition, ICDAR 2007
59. Locatello F, Bauer S, Lucic M, Rätsch G, Gelly S, Schölkopf B, Bachem O (2018) Challenging
common assumptions in the unsupervised learning of disentangled representations
60. Lu J, Yu F, Finkelstein A, DiVerdi S (2012) Helpinghand: example-based stroke stylization.
ACM Trans Graph 31(4):46:1–46:10. https://doi.org/10.1145/2185520.2185542
61. Maaten LVD, Hinton G (2008) Visualizing data using t-sne. JMLR 9(Nov):2579–2605
62. Marti UV, Bunke H (2002) The IAM-database: an English sentence database for offline
handwriting recognition. IJDAR 5(1):39–46
63. Mueller PA, Oppenheimer DM (2014) The pen is mightier than the keyboard: Advantages of
longhand over laptop note taking. Psychol Sci. https://doi.org/10.1177/0956797614524581,
http://pss.sagepub.com/content/early/2014/04/22/0956797614524581.abstract
64. Mynatt ED, Igarashi T, Edwards WK, LaMarca A (1999) Flatland: new dimensions in office
whiteboards. In: Proceedings of the SIGCHI conference on human factors in computing
systems. ACM, New York, NY, USA, CHI ’99, pp 346–353. https://doi.org/10.1145/302979.
303108
65. MyScript (2016) MyScript: the power of handwriting. http://myscript.com/. Accessed 04 Oct
2016
66. Noordzij G (2005) The stroke: theory of writing. Hyphen, translated from the Dutch, London
460 E. Aksan and O. Hilliges
90. Wang J, Wu C, Xu HY, Ying-Qing nd Shum, (2005) Combining shape and physical models for
online cursive handwriting synthesis. Int J Doc Anal Recogn (IJDAR) 7(4):219–227. https://
doi.org/10.1007/s10032-004-0131-6
91. Weibel N, Fouse A, Emmenegger C, Friedman W, Hutchins E, Hollan J (2012) Digital pen
and paper practices in observational research. In: Proceedings of the SIGCHI conference on
human factors in computing systems. ACM, New York, NY, USA, CHI ’12, pp 1331–1340.
https://doi.org/10.1145/2207676.2208590
92. Williams BH, Toussaint M, Storkey AJ (2007) Modelling motion primitives and their timing
in biologically executed movements. In: Proceedings of the 20th international conference on
neural information processing systems, Curran associates Inc, USA, NIPS’07, pp 1609–1616.
http://dl.acm.org/citation.cfm?id=2981562.2981764
93. Williams F, Trager M, Panozzo D, Silva C, Zorin D, Bruna J (2019) Gradient dynamics of
shallow univariate relu networks. In: NeurIPS, pp 8376–8385
94. Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent
neural networks. Neural Comput
95. Wu X, Qi Y, Liu J, Yang J (2018) SketchSegNet: a RNN model for labeling sketch strokes.
In: MLSP
96. Xia H, Hinckley K, Pahud M, Tu X, Buxton B (2017) Writlarge: Ink unleashed by unified
scope, action, and zoom. In: Proceedings of the 2017 CHI conference on human factors in
computing systems. ACM, New York, NY, USA, CHI ’17, pp 3227–3240. https://doi.org/10.
1145/3025453.3025664
97. Xu P, Hospedales TM, Yin Q, Song YZ, Xiang T, Wang L (2020) Deep learning for free-hand
sketch: a survey.
98. Yang L, Zhuang J, Fu H, Zhou K, Zheng Y (2020) SketchGCN: semantic sketch segmentation
with graph convolutional networks
99. Yoon D, Chen N, Guimbretière F (2013) Texttearing: opening white space for digital ink
annotation. In: Proceedings of the 26th annual ACM symposium on user interface software
and technology. ACM, New York, NY, USA, UIST ’13, pp 107–112. https://doi.org/10.1145/
2501988.2502036
100. Yoon D, Chen N, Guimbretière F, Sellen A (2014) Richreview: Blending ink, speech, and
gesture to support collaborative document review. In: Proceedings of the 27th annual ACM
symposium on user interface software and technology. ACM, New York, NY, USA, UIST
’14, pp 481–490. https://doi.org/10.1145/2642918.2647390
101. Yun XL, Zhang YM, Ye JY, Liu CL (2019) Online handwritten diagram recognition with
graph attention networks. In: ICIG
102. Zanibbi R, Novins K, Arvo J, Zanibbi K (2001) Aiding manipulation of handwritten mathe-
matical expressions through style-preserving morphs. Graph Interf 2001:127–134
103. Zhang B, Srihari SN, Lee S (2003) Individuality of handwritten characters. In: Proceedings
of the 7th international conference on document analysis and recognition, pp 1086–1090
104. Zitnick CL (2013) Handwriting beautification using token means. ACM Trans Graph
32(4):53:1–53:8. https://doi.org/10.1145/2461912.2461985
Bridging Natural Language and
Graphical User Interfaces
1 Introduction
While natural language dominates how we human communicate with each other
in everyday life, Graphical User Interfaces (GUI), with direct manipulation, is the
commonplace for us to converse with a computer system. There are inherent connec-
tions between these two communication mediums. In natural language, the semantics
is realized via a sequence of word tokens, and in GUIs, a task is accomplished by
manipulating a set of graphical objects where each fulfills a building block action. The
attempt of combining Natural Language Processing (NLP) and Human–Computer
Interaction (HCI) research can be dated back to the early work on conversational
agents, which was an excellent example showing the synergy between the HCI and
Y. Li (B) · X. Zhou · G. Li
Google Research, Mountain View, CA, USA
e-mail: liyang@google.com
X. Zhou
e-mail: zhouxin@google.com
G. Li
e-mail: leebird@google.com
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 463
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_14
464 Y. Li et al.
AI fields. With the advance in modern AI methods, such a vision becomes more
obtainable than ever.
In this chapter, we particularly focus on two directions of work, which have been
pursued in our research group, for bridging the gap between natural language and
graphical user interfaces—we will review two recent papers on natural language
grounding [13] and generation [15] in the context of mobile user interfaces. There
are a rich collection of interaction scenarios where mobile interaction can benefit
from natural language grounding techniques. For example, mobile devices offer a
myriad of functionalities that can assist in our everyday activities. However, many of
these functionalities are not easily discoverable or accessible to users, forcing users to
look up how to perform a specific task, e.g., how to turn on the traffic mode in maps
or change notification settings in YouTube. While searching the web for detailed
instructions for these questions is an option, it is still up to the user to follow these
instructions step by step and navigate UI details through a small touchscreen, which
can be tedious and time-consuming, and results in reduced accessibility. As such,
it is important to develop a computational agent to turn these language instructions
into actions and automatically execute them on the user’s behalf.
On the other hand, we want to equip mobile interfaces, which are highly graphi-
cal and unconventional compared to traditional desktop applications, with language
descriptions so that they can be communicated with users verbally. We refer to the
language description of a UI element as a widget caption. For example, accessibility
services such as screen readers rely on widget captions to make UI elements acces-
sible to visually impaired users via text-to-speech technologies. In general, widget
captions are a foundation for conversational agents on GUIs where UI elements are
building blocks. The lack of widget captions has stood out as a primary issue for
mobile accessibility [25, 26]. More than half of image-based elements have missing
captions [26]. Beyond image-based ones, our analysis of a UI corpus here showed
that a wide range of elements have missing captions.
For the rest of the chapter, we will dive into two projects, one for each of these
research directions. For each project, we start with problem formulation and dataset
creation. We then detail on the design and training of deep models, and lastly report
and discuss our findings from experiments.
Given an instruction of a multi-step task, I = t1:n = (t1 , t2 , ..., tn ), where ti is the ith
token in instruction I , we want to generate a sequence of automatically executable
actions, a1:m , over a sequence of user interface screens S, with initial screen s1 and
screen transition function s j =τ (a j−1 , s j−1 ):
m
p(a1:m |s1 , τ, t1:n ) = p(a j |a< j , s1 , τ, t1:n ) (1)
j=1
seq2act.
466 Y. Li et al.
Fig. 1 Our model extracts the phrase tuple that describe each action, including its operation, object
and additional arguments, and grounds these tuples as executable action sequences in the UI
m
p(a1:m |s1 , τ, t1:n ) = p(a j |s j , t1:n ) (2)
j=1
s j , from which o j is chosen. λ j defines the structural relationship between the objects.
This is often a tree structure such as the View hierarchy for an Android interface2
(similar to a DOM tree for web pages).
An instruction I describes (possibly multiple) actions. Let ā j denote the phrases
in I that describes action a j . ā j = [r̄ j , ō j , ū j ] represents a tuple of descriptions with
each corresponding to a span—a subsequence of tokens—in I . Accordingly, ā1:m
represents the description tuple sequence that we refer to as ā for brevity. We also
define Ā as all possible description tuple sequences of I , thus ā ∈ Ā.
p(a j |s j , t1:n ) = p(a j |ā, s j , t1:n ) p(ā|s j , t1:n ) (3)
Ā
Because a j is independent on the rest of the instruction given its current screen
s j and description ā j , and ā is only related to the instruction t1:n , we can simplify (3)
as (4).
p(a j |s j , t1:n ) = p(a j |ā j , s j ) p(ā|t1:n ) (4)
Ā
This defines the action phrase extraction model, which is then used by the grounding
model:
m
p(a1:m |t1:n , S) ≈ p(a j |â j , s j ) p(â j |â< j , t1:n ) (7)
j=1
p(â j |â< j , t1:n ) identifies the description tuples for each action. p(a j |â j , s j ) grounds
each description to an executable action given in the screen.
2 https://developer.android.com/reference/android/view/View.html.
468 Y. Li et al.
Fig. 2 PixelHelp example: Open your device’s Settings app. Tap Network and Internet. Click
Wi-Fi. Turn on Wi-Fi. The example uses the App Drawer and Settings on the Google Pixel phone.
The instruction is paired with actions, each of which is shown as a red dot on a specific screen
2.2 Data
The ideal dataset would have natural instructions that have been executed by people
using the UI. Such data can be collected by having annotators perform tasks according
to instructions on a mobile platform, but this is difficult to scale. It requires significant
investment to instrument: different versions of apps have different presentation and
behaviors, and apps must be installed and configured for each task. Due to this, we
create a small dataset of this form, PixelHelp, for full task evaluation. For model
training at scale, we create two other datasets: AndroidHowTo for action phrase
extraction and RicoSCA for grounding. Our datasets are targeted for English. We
hope that starting with a high-resource language will pave the way to creating similar
capabilities for other languages.
Pixel Phone Help pages3 provide instructions for performing common tasks on
Google Pixel phones such as switch Wi-Fi settings (Fig. 2) or check emails. Help
pages can contain multiple tasks, with each task consisting of a sequence of steps.
We pulled instructions from the help pages and kept ones that can be automatically
executed. Instructions that requires additional user input such as Tap the app you
want to uninstall are discarded. Also, instructions that involve actions on a physical
button such as Press the Power button for a few seconds are excluded because these
events cannot be executed on mobile platform emulators.
3 https://support.google.com/pixelphone.
Bridging Natural Language and Graphical User Interfaces 469
No datasets exist that support learning the action phrase extraction model, p(â j |â< j ,
t1:n ), for mobile UIs. To address this, we extracted English instructions for operating
Android devices by processing web pages to identify candidate instructions for how-
to questions such as how to change the input method for Android. A web crawling
service scrapes instruction-like content from various websites. We then filter the web
contents using both heuristics and manual screening by annotators.
Annotators identified phrases in each instruction that describe executable actions.
They were given a tutorial on the task and were instructed to skip instructions that
are difficult to understand or label. For each component in an action description,
they select the span of words that describes the component using a web annotation
interface. The interface records the start and end positions of each marked span. Each
instruction was labeled by three annotators: three annotators agreed on 31% of full
instructions and at least two agreed on 84%. For the consistency at the tuple level,
the agreement across all the annotators is 83.6% for operation phrases, 72.07% for
object phrases, and 83.43% for input phrases. The discrepancies are usually small,
e.g., a description marked as your Gmail address or Gmail address.
The final dataset includes 32,436 data points from 9,893 unique How-To instruc-
tions and split into training (8K), validation (1K), and test (900). All test examples
have perfect agreement across all three annotators for the entire sequence. In total,
there are 190K operation spans, 172K object spans, and 321 input spans labeled.
The lengths of the instructions range from 19 to 85 tokens, with median of 59. They
describe a sequence of actions from one to 19 steps, with a median of 5.
Training the grounding model, p(a j |â j , s j ) involves pairing action tuples a j along
screens s j with action description â j . It is very difficult to collect such data at scale. To
get past the bottleneck, we exploit two properties of the task to generate a synthetic
command-action dataset, RicoSCA. First, we have precise structured and visual
470 Y. Li et al.
knowledge of the UI layout, so we can spatially relate UI elements to each other and
the overall screen. Second, a grammar grounded in the UI can cover many of the
commands and kinds of reference needed for the problem. This does not capture all
manners of interacting conversationally with a UI, but it proves effective for training
the grounding model.
Rico is a public UI corpus with 72K Android UI screens mined from 9.7K Android
apps [6]. Each screen in Rico comes with a screenshot image and a view hierarchy
of a collection of UI objects. Each individual object, c j,k , has a set of properties,
including its name (often an English phrase such as Send), type (e.g., Button,
Image, or Checkbox), and bounding box position on the screen. We manually
removed screens whose view hierarchies do not match their screenshots by asking
annotators to visually verify whether the bounding boxes of view hierarchy leaves
match each UI object on the corresponding screenshot image. This filtering results
in 25K unique screens.
For each screen, we randomly select UI elements as target objects and synthesize
commands for operating them. We generate multiple commands to capture different
expressions describing the operation r̂ j and the target object ô j . For example, the
Tap operation can be referred to as tap, click, or press. The template for referring to
a target object has slots Name, Type, and Location, which are instantiated using
the following strategies:
• Name-Type: the target’s name and/or type (the OK button or OK).
• Absolute-Location: the target’s screen location (the menu at the top right corner).
• Relative-Location: the target’s relative location to other objects (the icon to the
right of Send).
Because all commands are synthesized, the span that describes each part of an action,
â j with respect to t1:n , is known. Meanwhile, a j and s j , the actual action and the
associated screen, are present because the constituents of the action are synthesized.
In total, RicoSCA contains 295,476 single-step synthetic commands for operating
177,962 different target objects across 25,677 Android screens.
Equation 7 has two parts. p(â j |â< j , t1:n ) finds the best phrase tuple that describes the
action at the jth step given the instruction token sequence. p(a j |â j , s j ) computes
the probability of an executable action a j given the best description of the action, â j ,
and the screen s j for the jth step.
A common choice for modeling the conditional probability p(ā j |ā< j , t1:n ) (see Eq.
5) is encoder–decoders such as LSTMs [10] and Transformers [28]. The output of our
Bridging Natural Language and Graphical User Interfaces 471
Fig. 3 The Phrase Tuple Extraction model encodes the instruction’s token sequence and then
outputs a tuple sequence by querying into all possible spans of the encoded sequence. Each tuple
contains the span positions of three phrases in the instruction that describe the action’s operation,
object, and optional arguments, respectively, at each step. ∅ indicates the phrase is missing in the
instruction and is represented by a special span encoding
Outside (BIO) [24]—commonly used to indicate spans in tasks such as named entity
recognition—marks whether each token is beginning, inside, or outside a span. How-
ever, BIO is not ideal for our task because subsequences for describing different
actions can overlap, e.g., in click X and Y, click participates in both actions click
X and click Y. In our experiments, we consider several recent, more flexible span
representations [11, 12, 14] and show their impact in Sect. 2.4.2.
With fixed-length span representations, we can use common alignment techniques
in neural networks [2, 19]. We use the dot product between the query vector and
y y
the span representation: α(q j , h b:d )=q j · h b:d . At each step of decoding, we feed the
previously decoded phrase tuples, ā< j into the decoder. We can use the concatenation
of the vector representations of the three elements in a phrase tuple or sum their vector
representations as the input for each decoding step. The entire phrase tuple extraction
model is trained by minimizing the softmax cross-entropy loss between the predicted
and ground-truth spans of a sequence of phrase tuples.
Having computed the sequence of tuples that best describe each action, we connect
them to executable actions based on the screen at each step with our grounding model
(Fig. 4). In step-by-step instructions, each part of an action is often clearly stated.
Thus, we assume the probabilities of the operation r j , object o j , and argument u j
are independent given their description and the screen.
We simplify with two assumptions: (1) an operation is often fully described by its
instruction without relying on the screen information and (2) in mobile interaction
Fig. 4 The grounding model grounds each phrase tuple extracted by the phrase extraction model
as an operation type, a screen-specific object ID, and an argument if present, based on a contextual
representation of UI objects for the given screen. A grounded action tuple can be automatically
executed
Bridging Natural Language and Graphical User Interfaces 473
tasks, an argument is only present for the Text operation, so u j =û j . We parameterize
p(r j |r̂ j ) as a feedforward neural network:
p(r j |r̂ j ) = softmax(φ(r̂ j , θr )Wr ) (10)
d
ô j = φ( e(tk ), θo )Wo (12)
k=b
where b and d are the start and end index of the object description ô j . θo and Wo
are trainable parameters with Wo ∈R |φo |×|o| , where |φo | is the output dimension of
φ(·, θo ) and |o| is the dimension of the latent representation of the object description.
Contextual Representation of UI Objects. To compute latent representations of
each candidate object, c j,k , we use both the object’s properties and its context, i.e.,
the structural relationship with other objects on the screen. There are different ways
for encoding a variable-sized collection of items that are structurally related to each
other, including Graph Convolutional Networks (GCN) [20] and Transformers [28].
GCNs use an adjacency matrix predetermined by the UI structure to regulate how the
latent representation of an object should be affected by its neighbors. Transformers
allow each object to carry its own positional encoding, and the relationship between
objects can be learned instead.
The input to the Transformer encoder is a combination of the content embed-
ding and the positional encoding of each object. The content properties of an object
include its name and type. We compute the content embedding by concatenating the
name embedding, which is the average embedding of the bag of tokens in the object
name, and the type embedding. The positional properties of an object include both
474 Y. Li et al.
its spatial position and structural position. The spatial positions include the top, left,
right, and bottom screen coordinates of the object. We treat each of these coordinates
as a discrete value and represent it via an embedding. Such a feature representa-
tion for coordinates was used in ImageTransformer to represent pixel positions in
an image [22]. The spatial embedding of the object is the sum of these four coor-
dinate embeddings. To encode structural information, we use the index positions
of the object in the preorder and the postorder traversal of the view hierarchy tree,
and represent these index positions as embeddings in a similar way as representing
coordinates. The content embedding is then summed with positional encodings to
form the embedding of each object. We then feed these object embeddings into a
Transformer encoder model to compute the latent representation of each object, c j,k .
The grounding model is trained by minimizing the cross-entropy loss between the
predicted and ground-truth object and the loss between the predicted and ground-truth
operation.
2.4 Experiments
Our goal is to develop models and datasets to map multi-step instructions into auto-
matically executable actions given the screen information. As such, we use Pixel-
Help’s paired natural instructions and action-screen sequences solely for testing. In
addition, we investigate the model quality on phrase tuple extraction tasks, which is
a crucial building block for the overall grounding quality.4
We use two metrics that measure how a predicted tuple sequence matches the ground-
truth sequence.
• Complete Match: The score is 1 if two sequences have the same length and have
the identical tuple [r̂ j , ô j , û j ] at each step, otherwise 0.
• Partial Match: The number of steps of the predicted sequence that match the
ground-truth sequence divided by the length of the ground-truth sequence (ranging
between 0 and 1).
We train and validate using AndroidHowTo and RicoSCA, and evaluate on
PixelHelp. During training, single-step synthetic command-action examples are
dynamically stitched to form sequence examples with a certain length distribution.
To evaluate the full task, we use Complete and Partial Match on grounded action
sequences a1:m where a j =[r j , o j , u j ].
The token vocabulary size is 59K, which is compiled from both the instruction
corpus and the UI name corpus. There are 15 UI types, including 14 common UI
object types, and a type to catch all less common ones. The output vocabulary for
operations includes CLICK, TEXT, SWIPE, and EOS.
Tuple Extraction. For the action-tuple extraction task, we use a six-layer Trans-
former for both the encoder and the decoder. We evaluate three different span rep-
resentations. Area Attention [14] provides a parameter-free representation of each
possible span (one-dimensional
area), by summing up the encoding of each token
in the subsequence: h b:d = dk=b h k . The representation of each span can be com-
puted in constant time invariant to the length of the span, using a summed area
table. Previous work concatenated the encoding of the start and end tokens as the
span representation, h b:d = [h b ; h d ] [12] and a generalized version of it [11]. We
evaluated these three options and implemented the representation in [11] using a
summed area table similar to the approach in area attention for fast computation. For
hyperparameter tuning and training details, refer to the appendix.
Table 1 gives results on AndroidHowTo’s test set. All the span representations
perform well. Encodings of each token from a Transformer already capture sufficient
information about the entire sequence, so even only using the start and end encodings
yields strong results. Nonetheless, area attention provides a small boost over the
others. As a new dataset, there is also considerable headroom remaining, particularly
for complete match.
Grounding. For the grounding task, we compare Transformer-based screen
encoder for generating object representations h b:d with two baseline methods based
on graph convolutional networks. The Heuristic baseline matches extracted phrases
against object names directly using BLEU scores. Filter-1 GCN performs graph
convolution without using adjacent nodes (objects), so the representation of each
object is computed only based on its own properties. Distance GCN uses the dis-
tance between objects in the view hierarchy, i.e., the number of edges to traverse from
one object to another following the tree structure. This contrasts with the traditional
GCN definition based on adjacency, but is needed because UI objects are often leaves
Table 1 AndroidHowTo phrase tuple extraction test results using different span representations
h b:d in (8). êb:d = dk=b w(h k )e(tk ), where w(·) is a learned weight function for each token embed-
ding [11]. See the pseudocode for fast computation of these in Appendix
Span Rep. h b:d Partial Complete
SumPooling dk=b h k 92.80 85.56
StartEnd [h b ; h d ] 91.94 84.56
[h b ; h d , êb:d , φ(d − b)] 91.11 84.33
476 Y. Li et al.
Table 2 PixelHelp grounding accuracy. The differences are statistically significant based on t-test
over 5 runs ( p < 0.05)
Screen encoder Partial Complete
Heuristic 62.44 42.25
Filter-1 GCN 76.44 52.41
Distance GCN 82.50 59.36
Transformer 89.21 70.59
in the tree, as such they are not adjacent to each other structurally but instead are
connected through non-terminal (container) nodes. Both Filter-1 GCN and Distance
GCN use the same number of parameters (see the appendix for details).
To train the grounding model, we first train the Tuple Extraction sub-model on
AndroidHowTo and RicoSCA. For the latter, only language-related features (com-
mands and tuple positions in the command) are used in this stage, so screen and action
features are not involved. We then freeze the Tuple Extraction sub-model and train
the grounding sub-model on RicoSCA using both the command- and screen-action-
related features. The screen token embeddings of the grounding sub-model share
weights with the Tuple Extraction sub-model.
Table 2 gives full task performance on PixelHelp. The Transformer screen
encoder achieves the best result with 70.59% accuracy on Complete Match and
89.21% on Partial Match, which sets a strong baseline result for this new dataset
while leaving considerable headroom. The GCN-based methods perform poorly,
which shows the importance of contextual encodings of the information from other
UI objects on the screen. Distance GCN does attempt to capture context for UI objects
that are structurally close; however, we suspect that the distance information that is
derived from the view hierarchy tree is noisy because UI developers can construct
the structure differently for the same UI.5 As a result, the strong bias introduced by
the structure distance does not always help. Nevertheless, these models still outper-
formed the heuristic baseline that achieved 62.44% for partial match and 42.25% for
complete match.
2.5 Analysis
To explore how the model grounds an instruction on a screen, we analyze the relation-
ship between words in the instruction language that refer to specific locations on the
screen, and actual positions on the UI screen. We first extract the embedding weights
from the trained phrase extraction model for words such as top, bottom, left, and
right. These words occur in object descriptions such as the check box at the top of the
5While it is possible to directly use screen visual data for grounding, detecting UI objects from raw
pixels is nontrivial. It would be ideal to use both structural and visual data.
Bridging Natural Language and Graphical User Interfaces 477
Fig. 5 Correlation between location-related words in instructions and object screen position embed-
ding
screen. We also extract the embedding weights of object screen positions, which are
used to create object positional encoding. We then calculate the correlation between
word embedding and screen position embedding using cosine similarity. Figure 5
visualizes the correlation as a heatmap, where brighter colors indicate higher corre-
lation. The word top is strongly correlated with the top of the screen, but the trend
for other location words is less clear. While left is strongly correlated with the left
side of the screen, other regions on the screen also show high correlation. This is
likely because left and right are not only used for referring to absolute locations on
the screen, but also for relative spatial relationships, such as the icon to the left of
the button. For bottom, the strongest correlation does not occur at the very bottom
of the screen because many UI objects in our dataset do not fall in that region. The
region is often reserved for system actions and the on-screen keyboard, which are
not covered in our dataset.
The phrase extraction model passes phrase tuples to the grounding model. When
phrase extraction is incorrect, it can be difficult for the grounding model to predict a
correct action. One way to mitigate such cascading errors is using the hidden state of
the phrase decoding model at each step, q j . Intuitively, q j is computed with the access
to the encoding of each token in the instruction via the Transformer encoder–decoder
attention, which can potentially be a more robust span representation. However, in
our early exploration, we found that grounding with q j performs stunningly well for
grounding RicoSCA validation examples, but performs poorly on PixelHelp. The
learned hidden state likely captures characteristics in the synthetic instructions and
action sequences that do not manifest in PixelHelp. As such, using the hidden state
to ground remains a challenge when learning from unpaired instruction-action data.
The phrase model failed to extract correct steps for 14 tasks in PixelHelp. In
particular, it resulted in extra steps for 11 tasks and extracted incorrect steps for 3
tasks, but did not skip steps for any tasks. These errors could be caused by different
language styles manifested by the three datasets. Synthesized commands in RicoSCA
478 Y. Li et al.
Fig. 6 Widget captioning is a task to generate language descriptions for UI elements that have
missing captions, given multimodal input of UI structures and screenshot images. These captions
are crucial for accessibility and language-based interaction in general. The illustration uses a screen
from a Music Player app
3.1 Data
We first create a mobile UI corpus, and then ask crowd workers to create captions for
UI elements that have missing captions, which is followed by a thorough analysis of
the dataset.
We create a mobile UI corpus based on RICO [6]. We expanded the dataset using
a crawling robot to perform random clicks on mobile interfaces, which added 12K
novel screens to our corpus. Each screen comes with both a screenshot JPG/PNG
image and a view hierarchy9 in JSON. The view hierarchy is a structural tree repre-
Fig. 7 The percentage of elements that have missing captions (red) for each category and elements
labeled by crowd workers (green). The numbers in parentheses are total counts of the elements
sentation of the UI where each node has a set of properties such as content description,
Android class information, visibility, and bounding box.
Preprocessing the UI Corpus. We first exclude UI screens with missing or inac-
curate view hierarchies, which could occur when Android logging is out of sync.
This filtering step was conducted by asking crowd workers to visually examine each
UI and confirm that the bounding boxes of all the leaf nodes in the hierarchy match
the UI elements shown on the screenshot image. We focus on leaf nodes because
most interactive elements are leaf nodes. The filtering process resulted in 24,571
unique screens from 6,853 mobile apps.
We then select UI elements that are visible and clickable because they are responsi-
ble for many of the interaction tasks. Similar to previous work, we consider an element
missing captions when both its contentDescription and text properties in
the view hierarchy are missing, according to the Android accessibility guideline.10
Screen readers such as the TalkBack service1 rely on these fields to announce the
widget. Overall, in our dataset, there are 74,379 UI elements with missing captions,
across 10 categories of UI elements (see Fig. 7).
Understanding Missing Captions. Previous work analyzed missing captions for
image-based elements [26]. We include all types of elements in our dataset and anal-
ysis. The results from analyzing image-based elements in our corpus are comparable
to previous analysis, i.e., 95% of Floating Action Buttons, 83% of Image Views,
and 57% of Image Buttons have missing captions. Beyond these image-based ele-
ments, we found that missing captions is a serious issue for other types of elements
as well (see Fig. 7). More than 50% of the Switch, Compound Button, Check
10 https://developer.android.com/guide/topics/ui/accessibility/apps.
Bridging Natural Language and Graphical User Interfaces 481
Box, and Toggle Button have missing captions. 24.3% of the screens have none
pre-existing captions.
Crowdsourcing Widget Captions. To best match the target scenario of predicting
for elements with missing captions, we asked crowd workers to create captions for
these elements, which are used as labels for training and testing. Because pre-existing
captions in the corpus are not always correct, they are used as model input, to provide
the context, but not as output.
We developed a web interface for crowd workers to create language descriptions
for UI elements that have missing captions. The interface shows a screenshot of
the mobile interface, with the UI element that needs to be captioned highlighted.
Workers can input the caption using a text field, or indicate that they cannot describe
the element. In the annotation guidelines, we asked the workers to caption the element
for vision-impaired users to understand its functionalities and purposes. The captions
need to be concise but more descriptive than generic words such as “button” or
“image.” We recruited over 5,454 workers from Amazon Mechanical Turk11 over
multiple batches. While the elements to be labeled by each worker are randomly
selected, we instrumented the task in the way such that a worker can only label each
unique element once, and each element is labeled by three different workers.
Data Analyses. Human workers can skip elements when they were not sure how
to describe them. For all the elements of each type given to workers, the percentage
of elements being captioned ranges from 75% to 94% (see Fig. 7). In particular, the
View type has the lowest labeling ratio of 75%, which we suspect that elements
with the View type, a generic widget type, tend to be quite arbitrary and are difficult
for the workers to understand. We only kept the elements that received at least two
captions (from different workers). On average, each element received 2.66 captions.
In total, we collected 162,859 captions for 61,285 UI elements across 21,750 unique
screens, from 6,470 mobile apps.
To measure inter-annotator agreement, we computed the word-level precision and
recall for all the words with two or more occurrences in the collected captions (see
Fig. 8), as in the COCO image captioning dataset [5]. The results were generated on
about 6K most frequent words, which amount to 98.6% of all the word occurrences
in the captions. Figure 8 shows that our corpus has reasonable word-level agreement
among the captions of the same widget. Specifically, for the 6K most frequent words,
we report the mean precision and recall of every 10 consecutive words in the vocab-
ulary. Therefore, we have 600 data points, each representing precision/recall of 10
words. The ranks of the words in the vocabulary sorted by word frequency are used
to color the data points. Lower rank indicates higher word frequencies in the corpus.
Caption Phrase Analysis. We analyzed the distribution of caption lengths created
by human workers (see Fig. 9). We found most captions are brief, i.e., two to three
words. But a significant number of captions have more words, which are often long-
tail captions. The average length of captions from human workers is 2.72. Overall,
the length distribution of captions created by human workers is similar to those pre-
11 https://www.mturk.com.
482 Y. Li et al.
Fig. 8 The distribution of precision and recall for the top 6K words of the collected captions
8
to ≥ 10
7
6
5
4
3
2
1
0 2 4 6 8
Number of captions 10 4
existing in the UI corpus, which are from app developers. The latter will be used as
a feature input to the model, which we will discuss later.
The captions in our dataset include a diverse set of phrases. The most frequent
caption is “go back” that amounts to 4.0% of the distribution. Other popular captions
among the top 5 are “advertisement” (2.4%), “go to previous” (0.8%), “search”
(0.7%), and “enter password” (0.6%).
A common pattern of the phrases we observe is Predicate + Object. Table 3 lists
the seven common predicates and their most frequent objects. As we can see, the
phrases describe highly diverse functionalities of the UI elements. It is difficult to
Bridging Natural Language and Graphical User Interfaces 483
Table 3 In our dataset, the popular predicates are often associated with a diverse set of objects that
are contextually determined
Predicate Object
Search Location, contact, app, music, map, image, people, recipe, flight, hotel
Enter Password, email, username, phone, last name, first name, zip code, location, city
Select Image, ad, color, emoji, app, language, folder, location, ringtone, theme
Toggle Autoplay, favorite, menu, sound, advertisement, power, notification, alarm,
microphone
Share (to) Article, Facebook, Twitter, image, app, video, Instagram, recipe, location,
Whatsapp
Download App, sound, song, file, image, video, theme, game, wallpaper, effect
Close Window, ad, screen, tab, menu, pop-up, notification, file, settings, message
classify them into a few common categories. This linguistic characteristic motivated
us to choose sequence decoding for caption generation instead of classification based
on a predefined phrase vocabulary. The diversity of caption phrases indicates that
widget captioning is a challenging machine learning task.
Furthermore, to distinguish different objects for the same predicate, it is necessary
to take into account the screen context that the element belongs to. For example,
Fig. 10 shows two examples of the “search” predicate. The two UI elements have
very similar images (magnifiers) although they are for searching different objects.
Thus, context information is critical for models to decode the correct objects.
View Hierarchy Complexities. A unique modality in widget captioning is UI
structures as represented by view hierarchy trees. To better understand the complexity
of the UI structures, we analyze the size and depth of the view hierarchy of each UI.
The size of a view hierarchy is the total number of nodes in the hierarchy tree,
including both non-terminal nodes, i.e., layout containers and leaf nodes. The size
distribution is highly skewed and with a long tail toward large view hierarchies (see
the left of Fig. 11). The median size of view hierarchies is 61, with a minimum of
6 and a maximum of 1,608 nodes. Many view hierarchies have a large depth with a
median depth of 11, a minimum of 3, and a maximum of 26 (see the right of Fig. 11).
These show that view hierarchies are complex and contain rich structural information
about a user interface.
Fig. 10 Two UI elements (outlined in red) of “search” predicate. Left: search contact; Right: search
music. Both screens are from the RICO dataset [6]: the left screen is from a call app and the right
screen is from a music player app
5000 7000
6000
4000
5000
3000 4000
2000 3000
2000
1000
1000
0 0
0.5 1 1.5 2 2.5 3 3.5 0 5 10 15 20 25 30
Log10(size) Depth
Fig. 11 The histogram of log10 transformed view hierarchy sizes on the left, and the histogram of
tree depths on the right
Bridging Natural Language and Graphical User Interfaces 485
<START> <START>
Element
Encoding
Element
Image
Element
Structural
Encoding
Structural
Encoder
View Hierarchy
of the UI
Screen
using multi-head neural attention. The input to a Transformer model requires both
the content embedding and positional encoding. Similar to previous work [13], we
derive these embeddings for each element on the screen in the following manner.
Each UI element in the view hierarchy consists of a tuple of properties. The
widget_text property includes a collection of words possessed by the element.
We acquire the embedding of the widget_text property of the i-th element on the
screen, eiX , by max pooling over the embedding vector of each word in the property.
When the widget_text property is empty, i.e., the element is missing a caption,
a special embedding, e∅ , is used. With eiT , the embedding of the widget_type
property (see Fig. 7), and eiC , the embedding of whether the widget is clickable,
[eiX ; eiT ; eiC ] form the content embedding of the element.
The widget_bounds property contains four coordinate values on the screen:
left, top, right, and bottom, which are normalized to the range of [0, 100).
The widget_dom property contains three values describing the element tree
position in the view hierarchy: the sequence position in the preorder and the
postorder traversal, and the depth in the view hierarchy tree. These are all
treated as categorical values and represented as embedding vectors. The sum of
these coordinate embeddings forms the positional embedding vector of the element,
eiB .
The concatenation of all these embeddings forms the representation of a UI ele-
ment: ei = [eiX ; eiT ; eiC ; eiB ]W E , where W E is the parameters to linearly project the
concatenation to the dimension expected by the Transformer model. The output of
the Transformer encoder model, h i , is the structural encoding of the i-th element on
the screen.
The image of an element is cropped from the UI screenshot and rescaled to a fixed
dimension, which results in a 64 × 64 × 1 tensor, where 64 × 64 are the spatial
dimensions and 1 is the grayscale color channel. The image dimension strikes a good
balance for representing both small and large elements, which preserves enough
details for large elements after scaled down and enables a memory footprint good
for model training and serving.
We use a ResNet (CNN) [9] to encode an element image. Each layer in the image
encoder consists of a block of three sub-layers with a residual connection—the input
of the first sub-layer is added to the input of the third sub-layer. There are no pooling
used, and, instead, the last sub-layer of each block uses stride 2 that halves both
the vertical and horizontal spatial dimensions after each layer. At the same, each
layer doubles the channel dimension, starting from the channel dimension 4 of the
first layer. Most sub-layers use a kernel size of 3 × 3 except the initial and ending
sub-layers in the first layer that use a kernel size of 5 × 5. We will discuss further
details of model configuration for the image encoder in the experiment section. The
output of the multi-layer CNN is the encoding vector of the element image, which
we refer to as gi for the i-th element.
Bridging Natural Language and Graphical User Interfaces 487
We form the latent representation of the ith element on the screen by combining its
structural and image encoding: z i = σ ([h i ; gi ], θ z )W z , where σ (·) is the non-linear
activation function parameterized by θ z and W z is the trainable weights for linear
projection. Based on the encoding, we use a Transformer [28] decoder model for
generating a varying-length caption for the element.
l
ai,1:M = Masked_ATTN(xi,1:M
l
, WdQ , WdK , WdV )
l+1
xi,1:M = FFN(ai,1:M
l
+ z i , θd ),
where 0 ≤ l ≤ L is the layer index and M is the number of word tokens to decode.
0
xi,1:M , the input to the decoder model, is the token embedding with the sequence
positional encoding. WdQ , WdK , and WdV are trainable parameters for computing the
queries, keys, and values. Masked_ATTN in a Transformer decoder allows multi-
head attention to only attend to previous token representations. The element encoding,
l
z i , is added to the attention output of each decoding step, ai,1:M , before feeding into the
position-wise, multi-layer perception (FFN), parameterized by θd . The probability
distribution of each token of the caption is finally computed using the softmax over
y y
the output of the last Transformer layer: yi,1:M = softmax(xi,1:M L
Wd ) where Wd is
trainable parameters.
There is one instance of the decoder model for each element to be captioned. The
captions for all the elements with missing captions on the same screen are decoded in
parallel. The entire model, including both the encoder and decoder, is trained end to
end, by minimizing Lscr een , the average cross-entropy loss for decoding each token
of each element caption over the same screen.
1 1
M
Lscr een = Cross_Entropy(yi, j , yi, j )
|∇| i∈∇ M j=1
where ∇ is the set of elements on the same screen with missing captions and yi, j
is the ground-truth token. Training is conducted in a teacher-forcing manner where
the ground-truth caption words are fed into the decoder. During prediction time, the
model decodes autoregressively.
3.3 Experiments
We first discuss the experimental setup, and then report the accuracy of our model
as well as an analysis of the model behavior.
488 Y. Li et al.
3.3.1 Datasets
We split our dataset into training, validation, and test set for model development and
evaluation, as shown in Table 4. The UIs of the same app may have a similar style.
To avoid information leaks, the split was done app-wise so that all the screens from
the same app will not be shared across different splits. Consequently, all the apps
and screens in the test dataset are unseen during training, which allow us to examine
how each model configuration generalizes to unseen conditions at test.
Our vocabulary includes 10,000 most frequent words (that covers more than 95%
of the words in the dataset), and the rest of the words encountered in the training
dataset is assigned a special unknown token <UNK>. During validation and testing,
any <UNK> in the decoded phrase is removed before evaluation. Since each element
has more than one caption, one of its captions is randomly sampled each time during
training. For testing, all the captions of an element constitute its reference set for
computing automatic metrics such as CIDEr.
The training, validation, and test datasets have a similar ratio of 40% for caption
coverage, i.e., the number of elements with pre-existing captions with respect to the
total number of elements on each screen, with no statistical significance ( p > 0.05).
Screens with none pre-existing captions exist in all the splits.
vector. We used batch normalization for each convolutional layer. The final encoding
z i of an element is a 128-dimensional vector that is used for decoding.
We report our accuracy based on BLEU (unigram and bigram) [21], CIDEr [29],
ROUGE-L [16] METOER [7], and SPICE [1] metrics (see Table 5). For all these
metrics, a higher number means better captioning accuracy—the closer distances
between the predicted and the ground-truth captions.
We investigate how model variations impact the overall accuracy of captioning
(Table 5). Template Matching is an obvious baseline, which predicts the caption of
an element based on its image similarity with elements that come with a caption.
We use pixel-wise cosine similarity to compare the element images. Although this
heuristic-based method is able to predict captions for certain elements, it performs
poorly compared to the rest of the models that use deep architectures. Pixel Only
model, which only uses the image encoding of an element, performs significantly
better than Template Matching, which indicates that image encoding, gi , is a much
more efficient representation than raw pixels.
Pixel+Local, which uses both image encoding, gi , and the structural representa-
tion computed only based on the properties of the element offers further improvement
on the accuracy. Our full model, Pixel+Local+Context, uses both image encoding,
gi , and the screen context encoding, h i . It achieves the best results, which indicate
that screen context carries useful information about an element for generating its
Table 5 The accuracy of each model configuration on the full set and the predicate–object subset
of the test dataset
Model configuration BLEU-1 BLEU-2 ROUGE CIDEr METOER SPICE
Full test set
Template matching 20.2 11.2 20.9 38.0 13.2 6.5
Pixel Only 35.6 24.6 35.6 71.3 24.9 11.2
Pixel+Local 42.6 29.4 42.0 87.3 29.4 15.3
Pixel+Local+Context 44.9 32.2 44.7 97.0 31.7 17.6
(PLC)
PLC classification 36.2 25.7 36.9 78.9 26.0 13.6
Predicate–object subset
Template matching 20.8 11.2 21.3 34.5 12.6 7.5
Pixel Only 39.4 27.2 39.1 69.6 25.8 14.2
Pixel+Local 48.5 34.8 47.4 94.7 32.3 19.9
Pixel+Local+Context 52.0 38.8 51.3 110.1 36.4 23.3
(PLC)
PLC classification 38.5 27.0 38.4 78.9 26.3 16.8
490 Y. Li et al.
caption. Among all the structural features, the widget_text property plays an
important role.
In addition to examining the impact of input modality on captioning quality, we
compare strategies of caption generation: sequence decoding based on word tokens
versus classification based on common caption phrases. PLC classification model
uses the same input modality and encoding as Pixel+Local+Context but decodes
a single predefined phrase based on a vocabulary of top 10K caption phrases—the
same size as the token vocabulary for decoding. It performed poorly compared to
the decoding-based approach.
To further validate the usefulness of the context and the information from view
hierarchy, we evaluate the models on a subset of UI elements with one of their
reference caption phrases is of the Predicate + Object pattern (see Table 3). This
subset consists of about 40% of the UI elements from the test set. All the models
achieve better accuracy because the predicate–object subset consists of more common
words. Pixel+Local+Context remains the champion model, and more importantly
acquired the most significant gain across all the metrics (see Table 5). This indicates
that context information plays a crucial role for generating this type of captions whose
object parts need to be contextually determined. In contrast, PLC Classification still
performs worse than the champion decoding-based model. While the subset contains
more common words, their combination can form long-tail phrases. A classification-
based method such as PLC Classification is more vulnerable to the data sparsity of
long-tail phrases.
To assess the quality of the generated phrases by human, we asked another group of
crowd workers to manually verify the model generated captions for the entire test
set, by presenting each human rater a caption and its corresponding element in a
UI screenshot. For each phrase, we asked three raters to verify whether the caption
phrase correctly describes the functionality and purpose of the element. We compared
two of our models and the results are listed in Table 6. The overall endorsement of
raters for generated captions is 78.64% for the full model and 62.42% for the Pixel
Only model. These results indicate that our model can generate meaningful captions
for UI elements. We found shorter captions tend to receive more rater endorsements
than longer ones. The model with context still outperforms the one without context,
which is consistent with automatic evaluation.
Table 6 The human evaluation results. N+ in the header refers to N or more raters judge that the
caption correctly describes the element
Model 1+ 2+ 3+
Pixel Only 81.9 61.7 43.6
Pixel+Local+Context 93.9 81.1 61.0
Bridging Natural Language and Graphical User Interfaces 491
3.4 Analysis
• Nearby Elements (21): The model is confused by nearby elements on the screen,
e.g., outputting “enter phone number” for “write street address” on a sign-up
screen.
• Similar Appearance (10): The model is confused by elements with a similar appear-
ance, e.g., predicting “delete” for an X-shaped image that is labeled as “close.”
• Too Generic (9): The model generates captions that are too generic, e.g., “toggle
on” instead of “flight search on/off.”
• Model Correct (10): The model produces semantically correct captions but treated
as errors due to the limitation of automatic evaluation, e.g., “close” for “exit.”
There are two directions for future improvement. One is to improve encoders for UI
images and view hierarchies to better represent UI elements. The other is to improve
data sparsity, which we want to better address long-tail phrases by expanding the
dataset and having more elements and screens labeled.
4 Conclusion
In this chapter, we reviewed two projects for bridging natural language and graphical
user interfaces. In the first project, we discussed natural language grounding in mobile
user interfaces where a multi-step natural language instruction is grounded as a
sequence of executable actions on the user interfaces. Such a capability allows a
computational agent to automatically execute a multi-step task on behalf of the user,
which is valuable for accessibility and UI automation in general. In the second project,
we discussed natural language generation for user interface elements such that these
elements can be announced and communicated to mobile users. Both projects address
important interaction scenarios where natural language is an essential communication
medium of mobile user experiences. From these projects, we showcase research
strategies at the intersection of Natural Language Processing (NLP) and Human–
Computer Interaction (HCI), where we formulated novel machine learning tasks,
created new datasets and deep learning models, and made these resources available
to the public. These provide useful benchmarks for future research at the intersection
of NLP and HCI.
Acknowledgements We would like to thank Jiacong He, Yuan Zhang, Jason Baldridge, Song
Wang, Justin Cui, Christina Ou, Luheng He, Jingjie Zheng, Hong Li, Zhiwei Guan, Ashwin Kakarla,
and Muqthar Mohammad who contributed to these projects.
492 Y. Li et al.
References
1. Anderson, P., Fernando, B., Johnson, M., and Gould, S. SPICE: semantic propositional image
caption evaluation. CoRR abs/1607.08822 (2016)
2. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align
and translate. CoRR abs/1409.0473 (2014)
3. Branavan, S., Zettlemoyer, L., and Barzilay, R. Reading between the lines: Learning to map
high-level instructions to commands. In Proceedings of the 48th Annual Meeting of the Asso-
ciation for Computational Linguistics, Association for Computational Linguistics (Uppsala,
Sweden, July 2010), 1268–1277
4. Branavan, S. R. K., Chen, H., Zettlemoyer, L. S., and Barzilay, R. Reinforcement learning
for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP: Volume 1 - Volume 1, ACL ’09, Association for Computational Linguistics
(Stroudsburg, PA, USA, 2009), 82–90
5. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P., and Lawrence Zitnick, C.
Microsoft COCO captions: Data collection and evaluation server
6. Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., and Kumar,
R. Rico: A mobile app dataset for building data-driven design applications. In Proceedings
of the 30th Annual ACM Symposium on User Interface Software and Technology, UIST ’17,
ACM (New York, NY, USA, 2017), 845–854
7. Denkowski, M., and Lavie, A. Meteor universal: Language specific translation evaluation for
any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation,
Association for Computational Linguistics (Stroudsburg, PA, USA, 2014), 376–380
8. Gur, I., Rueckert, U., Faust, A., and Hakkani-Tur, D. Learning to navigate the web. In Inter-
national Conference on Learning Representations (2019)
9. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition (2015).
cite arxiv:1512.03385Comment: Tech report
10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9(8):1735–
1780 Nov
11. Lee, K., He, L., Lewis, M., and Zettlemoyer, L. End-to-end neural coreference resolution. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics (Copenhagen, Denmark, Sept. 2017), 188–197
12. Lee, K., Kwiatkowski, T., Parikh, A. P., and Das, D. Learning recurrent span representations
for extractive question answering. CoRR abs/1611.01436 (2016)
13. Li, Y., He, J., Zhou, X., Zhang, Y., and Baldridge, J. Mapping natural language instructions to
mobile ui action sequences. In ACL 2020: Association for Computational Linguistics (2020)
14. Li, Y., Kaiser, L., Bengio, S., and Si, S. Area attention. In Proceedings of the 36th Interna-
tional Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97 of
Proceedings of Machine Learning Research, PMLR (Long Beach, California, USA, 09–15 Jun
2019), 3846–3855
15. Li, Y., Li, G., He, L., Zheng, J., Li, H., and Guan, Z. Widget captioning: Generating natural lan-
guage description for mobile user interface elements. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics (Online, Nov. 2020), 5495–5510
16. Lin, C.-Y., and Och, F. J. Orange: A method for evaluating automatic evaluation metrics for
machine translation. In Proceedings of the 20th International Conference on Computational
Linguistics, COLING ’04, Association for Computational Linguistics (Stroudsburg, PA, USA,
2004)
17. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P.,
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: common objects in context.
CoRR abs/1405.0312 (2014)
Bridging Natural Language and Graphical User Interfaces 493
18. Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Reinforcement learning on web interfaces
using workflow-guided exploration. In International Conference on Learning Representations
(ICLR) (2018)
19. Luong, T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural
machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natu-
ral Language Processing, Association for Computational Linguistics (Lisbon, Portugal, Sept.
2015), 1412–1421
20. Niepert, M., Ahmed, M., and Kutzkov, K. Learning convolutional neural networks for graphs.
In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and
K. Q. Weinberger, Eds., vol. 48 of Proceedings of Machine Learning Research, PMLR (New
York, New York, USA, 20–22 Jun 2016), 2014–2023
21. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU: a method for automatic evaluation
of machine translation. In Proceedings of the 40th Annual Meeting on Association for Compu-
tational Linguistics, ACL ’02, Association for Computational Linguistics (USA, July 2002),
311–318
22. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. Image
transformer. In Proceedings of the 35th International Conference on Machine Learning, J. Dy
and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, PMLR (Stock-
holmsmässan, Stockholm Sweden, 10–15 Jul 2018), 4055–4064
23. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation.
In Empirical Methods in Natural Language Processing (EMNLP) (2014), 1532–1543
24. Ramshaw, L., and Marcus, M. Text chunking using transformation-based learning. In Third
Workshop on Very Large Corpora (1995)
25. Ross, A. S., Zhang, X., Fogarty, J., and Wobbrock, J. O. Epidemiology as a framework for large-
scale mobile application accessibility assessment. In Proceedings of the 19th International ACM
SIGACCESS Conference on Computers and Accessibility, ASSETS ’17, ACM (New York, NY,
USA, 2017), 2–11
26. Ross, A. S., Zhang, X., Fogarty, J., and Wobbrock, J. O. Examining image-based button label-
ing for accessibility in android apps through large-scale analysis. In Proceedings of the 20th
International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’18,
ACM (New York, NY, USA, 2018), 119–130
27. Sarsenbayeva, Z. Situational impairments during mobile interaction. In Proceedings of the
ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (2018), 498–503
28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. Attention is all you need. CoRR abs/1706.03762 (2017)
29. Vedantam, R., Zitnick, C., and Parikh, D. Cider: Consensus-based image description evaluation
(06 2015). 4566–4575
30. Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Advances in Neural Information
Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
Eds. Curran Associates, Inc., 2015, 2692–2700
31. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and
Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. CoRR
abs/1502.03044 (2015)
Demonstration + Natural Language:
Multimodal Interfaces for GUI-Based
Interactive Task Learning Agents
Abstract We summarize our past five years of work on designing, building, and
studying Sugilite, an interactive task learning agent that can learn new tasks and rel-
evant associated concepts interactively from the user’s natural language instructions
and demonstrations leveraging the graphical user interfaces (GUIs) of third-party
mobile apps. Through its multi-modal and mixed-initiative approaches for Human-
AI interaction, Sugilite made important contributions in improving the usability,
applicability, generalizability, flexibility, robustness, and shareability of interactive
task learning agents. Sugilite also represents a new human-AI interaction paradigm
for interactive task learning, where it uses existing app GUIs as a medium for users
to communicate their intents with an AI agent instead of the interfaces for users
to interact with the underlying computing services. In this chapter, we describe the
Sugilite system, explain the design and implementation of its key features, and
show a prototype in the form of a conversational assistant on Android.
1 Introduction
Interactive task learning (ITL) is an emerging research topic that focuses on enabling
task automation agents to learn new tasks and their corresponding relevant concepts
through natural interaction with human users [69]. This topic is also related to the
concept of end-user development (EUD) for task automation [65, 115]. Work in this
domain includes both physical agents (e.g., robots) that learn tasks that might involve
sensing and manipulating objects in the real world [7, 28], as well as software agents
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 495
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_15
496 T. J.-J. Li et al.
that learn how to perform tasks through software interfaces [3, 10, 68, 75]. This
paper focuses on the latter category.
A particularly useful application of ITL is on conversational virtual assistants
(e.g., Apple Siri, Google Assistant) running on mobile phones. With the widespread
popularity of mobile apps, users are utilizing them to complete a wide variety of
tasks [27, 150]. These apps interact with users through graphical user interfaces
(GUIs), where users usually provide inputs by direct manipulation, and read outputs
from the GUI display. Most GUIs are designed with usability in mind, providing
non-expert users low learning barriers to commonly-used computing tasks. App
GUIs also often follow certain design patterns that are familiar to users, which helps
them easily navigate around GUI structures to locate the desired functionalities [2,
38, 130].
However, GUI-based mobile apps have several limitations. First, performing tasks
on GUIs can be tedious. For example, the current version of the Starbucks app
on Android requires 14 taps to order a cup of venti Iced Cappuccino with skim
milk, and even more if the user does not have the account information stored. For
such tasks, users would often like to have them automated [6, 105, 127]. Second,
direct manipulation of GUIs is often not feasible or convenient in some contexts.
Third, many tasks require coordination among many apps. But nowadays, data often
remain siloed in individual apps [29]. Lastly, while some app GUIs provide certain
mechanisms of personalization (e.g., remembering and pre-filling the user’s home
location), they are mostly hard-coded. Users have few means of creating customized
rules and specifying personalized task parameters to reflect their preferences beyond
what the app developers have explicitly designed for.
Recently, intelligent agents have become popular solutions to the the limitations
of GUIs. They can be activated by speech commands to perform tasks on the user’s
behalf [102]. This interaction style allows the user to focus on the high-level spec-
ification of the task while the agent performs the low-level actions, as opposed to
the usual direct manipulation GUI in which the user must select the correct objects,
execute the correct operations, and control the environment [25, 135]. Compared
with traditional GUIs, intelligent agents can reduce user burden when dealing with
repetitive tasks, and alleviate redundancy in cross-app tasks. The speech modality in
intelligent agents can support hand-free contexts when the user is physically away
from the device, cognitively occupied by other tasks (e.g., driving), or on devices
with little or no screen space (e.g., wearables) [101]. The improved expressiveness
in natural language also affords more flexible personalization in tasks.
Nevertheless, current prevailing intelligent agents have limited capabilities. They
invoke underlying functionalities by directly calling back-end services. Therefore,
agents need to be specifically programmed for each supported application and ser-
vice. By default, they can only invoke built-in apps (e.g., phone, message, calendar,
music) and some integrated external apps and web services (e.g., web search, weather,
Wikipedia), lacking the capability of controlling arbitrary third-party apps and ser-
vices. To address this problem, providers of intelligent agents, such as Apple, Google,
and Amazon, have released developer kits for their agents, so that the developers of
third-party apps can integrate their apps into the agents to allow the agents to invoke
Demonstration + Natural Language: Multimodal Interfaces … 497
these apps from user commands. However, such integration requires significant cost
and engineering effort from app developers, therefore, only some of the most popular
tasks in popular apps have been integrated into prevailing intelligent agents so far.
The “long-tail” of tasks and apps have not been supported yet, and will likely not get
supported due to the cost and effort involved.
Prior literature [143] showed that the usage of “long-tail” apps made up significant
portion in user app usage. Smartphone users also have highly diverse usage patterns
within apps [150] and wish to have more customizability over how agents perform
their tasks [36]. Therefore, relying on third-party developers’ effort to extend the
capabilities of intelligent agents is not sufficient for supporting diverse user needs.
It is not feasible for end users to develop for new tasks in prevailing agents on their
own either, due to (i) their lack of technical expertise required, and (ii) the limited
availability of openly accessible application programming interfaces (APIs) for many
back-end services. Therefore, adding the support for interactive task learning from
end users in intelligent agents is particularly useful.
1Sugilite is named after a purple gemstone, and stands for: Smartphone Users Generating
Intelligent Likeable Interfaces Through Examples.
498 T. J.-J. Li et al.
1.2 Contributions
We argue that a key problem in the ITL process is to facilitate effective Human-
AI collaboration. In the traditional view, programming is viewed as the process of
transforming a user’s existing mental plan into a programming language that the
500 T. J.-J. Li et al.
computer can execute. However, in end-user ITL, this is not an accurate model. The
user often starts with only a vague idea of what to do and needs an intelligent system’s
help to clarify their intents. We view ITL as a joint activity where the user and the
agent share the same goal in a human-AI collaboration framework. In such mixed-
initiative interactions, the user’s goals and inputs come with uncertainty [4, 51]. The
agent needs to show guesses of user goals, assist the user to provide more effective
inputs, and engage in multi-turn dialogs with the user to resolve any uncertainties
and ambiguities.
Significant progress has been made on this topic in recent years in both AI and HCI.
Specifically on the AI side, advances in natural language processing (NLP) enable
the agents to process users’ instructions of task procedures, conditionals, concepts
definitions, and classifiers in natural language [2, 6, 10], to ground the instructions
(e.g., [12]), and to have dialog with users based on GUI-extracted task models (e.g.,
[11]). Reinforcement learning techniques allow the agent to more effectively explore
action sequences on GUIs to complete tasks [13]. Large GUI datasets such as RICO
[4] allow the analysis of GUI patterns at scale, and the construction of generalized
models for extracting semantic information from GUIs.
The HCI community also has presented new study findings, design implications,
and interaction designs in this domain. A key direction has been the design of multi-
modal interfaces that leverage both natural language instructions and GUI demon-
strations [1, 7]. Prior work also explored how users naturally express their task intents
[10, 15, 17] and designed new interfaces to guide the users to provide more effective
inputs (e.g., [8]).
On one hand, AI-centric task flow exploration and program synthesis techniques
often lack transparency for users to understand the internal process, and they pro-
vide the users with little control over the task fulfillment process to reflect their
personal preferences. On the other hand, machine intelligence is desired because the
users’ instructions are often incomplete, vague, ambiguous, or even incorrect. There-
fore, the system needs to provide adequate assistance to guide the users to provide
effective inputs to express their intents, while retaining the users’ agency, trust, and
control of the process. While relevant design principles have been discussed in early
foundational works in mixed-initiative interaction [5] and demonstrational interfaces
[16], incorporating these ideas into the design and implementation of actual systems
remains an important challenge.
A crucial factor in human-AI collaboration is the medium. Sugilite presents a
new human-AI interaction paradigm for interactive task learning, where it uses the
GUIs of the existing third-party mobile apps as the medium for users to communicate
their intents with an AI agent instead of the interfaces for users to interact with the
underlying computing services. Among common mediums for agent task learning,
app GUIs sit at a nice middle ground between (1) programming language, which can
be easily processed by a computing system but imposes significant learning barri-
ers to non-expert users; and (2) unconstrained visual demonstrations in the physical
work and natural language instructions, which are natural and easy-to-use for users
but infeasible for computing systems to fully understand without significant human-
annotated training data and task domain restrictions given the current state-of-art
Demonstration + Natural Language: Multimodal Interfaces … 501
3 Related Work
can only replay exactly the same procedure that was demonstrated, without the abil-
ity to generalize the demonstration to perform similar tasks. They are also brittle
to any UI changes in the app. Sikuli [147], VASTA [132], and Hilc [57] used the
visual features of GUI entities to identify the target entities for actions—while this
approach has some advantages over Sugilite’s approach, such as being able to work
with graphic entities without textual labels or other appropriate identifiers, the visual
approach does not use the semantics of GUI entities, which also limits its generaliz-
ability.
In human–robot interaction, PBD is often used in interactive task learning where
a robot learns new tasks and procedures from the user’s demonstration with physical
objects [7, 20, 45, 64]. The demonstrations are sometimes also accompanied by nat-
ural language instructions [112, 133] similar to Sugilite. While many recent works
have been done in enhancing computing systems’ capabilities for parsing human
activities (e.g., [126]), modeling human intents (e.g., [44]), representing knowledge
(e.g., [149]), from visual information from the physical world, it remains a major
AI challenge to recognize, interpret, represent, learn from, and reason with visual
demonstrations. In comparison, Sugilite avoids this grand challenge by using exist-
ing app GUIs as the alternative medium for task instruction, which retains the user
familiarity, naturalness, and domain generality of visual demonstration but is much
easier to comprehend for a computing system.
Sugilite uses natural language as one of the two primary modalities for end users
to program task automation scripts. The idea of using natural language inputs for
programming has been explored for decades [11, 18, 95, 109]. In NLP and AI
communities, this approach is also known as learning by instruction [10, 33, 68, 96].
The foremost challenge in supporting natural language programming is to deal
with the inherent ambiguities and vagueness in natural language [141]. To address
this challenge, a prior approach was to require users to use similar expression styles
that resembled conventional programming languages (e.g., [11, 77, 125]), so that
the system could directly translate user instructions into code. Despite that the user
instructions used in this approach seemed like natural language, it did not allow much
flexibility in expressions. This approach is not adequate for end-user development,
because it has a high learning barrier for users without programming expertise—users
have to adapt to the system by learning new syntax, keywords, and structures.
Another approach for handling ambiguities and vagueness in natural language
inputs is to seek user clarification through conversations. For example, Iris [43] asked
follow-up questions and presents possible options through conversations when ini-
tial user inputs are incomplete or unclear. This approach lowered the learning barrier
for end users, as it did not require them to clearly define everything up front. It
also allowed users to form complex commands by combining multiple natural lan-
guage instructions in conversational turns under the guidance of the system. This
Demonstration + Natural Language: Multimodal Interfaces … 503
Multi-modal interfaces process two or more user input modes in a coordinated manner
to provide users with greater expressive power, naturalness, flexibility, and porta-
bility [120]. Sugilite combines speech and touch to enable a “speak and point”
interaction style, which has been studied since early multi-modal systems like Put-
that-there [23]. Prior systems such as CommandSpace [1], Speechify [60], Quick-
Set [121], SMARTBoard [113], and PixelTone [70] investigated multi-modal inter-
faces that can map coordinated natural language instructions and GUI gestures to
system commands and actions. In programming, similar interaction styles have also
been used for controlling robots (e.g., [55, 104]). But the use of these systems are
limited to specific first-party apps and task domains, in contrast to Sugilite which
aims to be general-purpose.
When Sugilite addresses the data description problem (details in Sect. 5.2),
demonstration is the primary modality; verbal instructions are used for disambiguat-
ing demonstrated actions. A key pattern used in Sugilite is mutual disambigua-
tion [119]. When the user demonstrates an action on the GUI with a simultaneous
verbal instruction, our system can reliably detect what the user did and on which UI
object the user performed the action. The demonstration alone, however, does not
504 T. J.-J. Li et al.
explain why the user performed the action, and any inferences on the user’s intent
would be fundamentally unreliable. Similarly, from verbal instructions alone, the
system may learn about the user’s intent, but grounding it onto a specific action may
be difficult due to the inherent ambiguity in natural language. Sugilite utilizes these
complementary inputs to infer robust and generalizable scripts that can accurately
represent user intentions in PBD. A similar multi-modal approach has been used
for handling ambiguities in recognition-based interfaces [103], such as correcting
speech recognition errors [138] and assisting the recognition of pen-based handwrit-
ing [67]. The recent DoThisHere [144] system uses a similar multi-modal interface
for cross-app data query and transfer between multiple mobile apps.
In the parameterization and concept teaching components of Sugilite, the nat-
ural language instructions come first. During the parametrization, the user first ver-
bally describes the task, and then demonstrates the task from which Sugilite infers
parameters in the initial verbal instruction, and the corresponding possible values. In
concept teaching, the user starts with describing an automation rule at a high-level in
natural language, and then recursively defines any ambiguous or vague concepts by
referring to app GUIs. Sugilite’s approach builds upon prior work like Plow [3],
which uses user verbal instructions to hint possible parameters, to further explore
how GUI and GUI-based demonstrations can help enhance natural language inputs.
Some prior techniques specifically focus on the visual aspect of GUIs. The Rico
dataset [38] shows that it is feasible to train a GUI layout embedding with a large
screen corpus, and retrieve screens with similar layouts using such embeddings.
Chen et al.’s work [31] and Li et al.’s work [91] show that trained machine learning
models can generate semantically meaningful natural language descriptions for GUI
components based on their visual appearances and hierarchies. Compared with them,
the Screen2Vec method (Sect. 5.6) used in Sugilite provides a more holistic
representation of GUI screens by encoding textual content, GUI component class
types, and app-specific metadata in addition to the visual layout.
Another category of work in this area focuses on predicting GUI actions for
completing a task objective. Pasupat et al.’s work [122] maps the user’s natural
language commands to target elements on web GUIs. Li et al.’s work [90] goes a
step further by generating sequences of actions based on natural language commands.
These works use the supervised approach that require a large amount of manually-
annotated training data, which limits its utilization. In comparison, the Screen2Vec
method used in Sugilite uses a self-supervised approach that does not require any
manual data annotation of user intents and tasks. Screen2Vec also does not need
any annotation on the GUI screens themselves, unlike [148] which requires additional
developer annotations for the metadata of GUI components.
Sugilite faces a unique challenge—in Sugilite, the user talks about the under-
lying task of an app in natural language while making references to the app’s GUI.
The system needs to have sufficient understanding about the content of the app GUI
to be able to handle these verbal instructions to learn the task. Therefore, the goal
of Sugilite in understanding app interfaces is to abstract the semantics of GUIs
from their platform-specific implementations, while being sufficiently aligned with
the semantics of users’ natural language instructions, so that it can leverage the GUI
representation to help understanding the user’s instruction of the underlying task.
4 System Overview
We present the prototype of a new task automation agent named Sugilite.2 This
prototype integrates and implements the results from several of our prior research
works [80–82, 84, 86–89]. The implementation of our system is also open-sourced
on GitHub.3 This section explains how Sugilite learns new tasks and concepts from
the multi-modal interactive instructions from the users.
The user starts with speaking a command. The command can describe either an
action (e.g., “check the weather”) or an automation rule with a condition (e.g., “If
it is hot, order a cup of Iced Cappuccino”). Suppose that the agent has no prior
knowledge in any of the involved task domains, then it will recursively resolve the
unknown concepts and procedures used in the command. Although it does not know
Fig. 1 An example dialog structure while Sugilite learns a new task that contains a conditional
and new concepts. The numbers indicate the sequence of the utterances. The screenshot on the right
shows the conversational interface during these steps
these concepts, it can recognize the structure of the command (e.g., conditional), and
parse each part of the command into the corresponding typed resolve functions, as
shown in Fig. 1. Sugilite uses a grammar-based executable semantic parsing archi-
tecture [92]; therefore, its conversation flow operates on the recursive execution of the
resolve functions. Since the resolve functions are typed, the agent can generate
prompts based on their types (e.g., “How do I tell whether…” for resolveBool
and “How do I find out the value for…” for resolveValue).
When the Sugilite agent reaches the resolve function for a value query or
a procedure, it asks the users if they can demonstrate them. The users can then
demonstrate how they would normally look up the value, or perform the procedure
manually with existing mobile apps on the phone by direct manipulation (Fig. 3a).
For any ambiguous demonstrated action, the user verbally explains the intent behind
the action through multi-turn conversations with the help from an interaction proxy
overlay that guides the user to focus on providing more effective input (see Fig. 3,
more details in Sect. 5.2). When the user demonstrates a value query (e.g., finding out
the value of the temperature), Sugilite highlights the GUI elements showing values
with the compatible types (see Fig. 2) to assist the user in finding the appropriate
GUI element during the demonstration.
All user-instructed value concepts, Boolean concepts, and procedures automati-
cally get generalized by Sugilite. The procedures are parameterized so that they can
be reused with different parameter values in the future. For example, for Utterance
8 in Fig. 1, the user does not need to demonstrate again since the system can invoke
the newly-learned order_Starbucks function with a different parameter value
(details in Sect. 5.3). The learned concepts and value queries are also generalized so
that the system recognizes the different definitions of concepts like “hot” and value
queries like “temperature” in different contexts (details in Sect. 5.4).
Demonstration + Natural Language: Multimodal Interfaces … 507
5 Key Features
Sugilite allows users to use demonstrations to teach the agent any unknown proce-
dures and concepts in their natural language instructions. As discussed earlier, a major
challenge in ITL is that understanding natural language instructions and carrying out
the tasks accordingly require having knowledge in the specific task domains. Our use
of programming by demonstration (PBD) is an effective way to address this “out-of-
domain” problem in both the task fulfillment and the natural language understanding
processes [85]. In Sugilite, procedural actions are represented as sequences of GUI
operations, and declarative concepts can be represented as references to GUI content.
This approach supports ITL for a wide range of tasks—virtually anything that can
be performed with one or more existing third-party mobile apps.
Our prior study [88] also found that the availability of app GUI references can
result in end users providing clearer natural language commands. In one study where
we asked participants to instruct an intelligent agent to complete everyday computing
tasks in natural language, the participants who saw screenshots of relevant apps
used fewer unclear, vague, or ambiguous concepts in their verbal instructions than
those who did not see the screenshots. By using demonstrations in natural language
instructions, our multi-modal approach also makes understanding the user’s natural
language instructions easier by naturally constraining the user’s expressions.
508 T. J.-J. Li et al.
A major limitation of demonstrations is that they are too literal, and are, therefore,
brittle to any changes in the task context. They encapsulate what the user did, but not
why the user did it. When the context changes, the agent often may not know what
to do, due to this lack of understanding of the user intents behind their demonstrated
actions. This is known as the data description problem in the PBD community, and it
is regarded as a key problem in PBD research [37, 94]. For example, just looking at
the action shown in Fig. 3a, one cannot tell if the user meant “the restaurant with the
most reviews”, “the promoted restaurant”, “the restaurant with 1,000 bonus points”,
“the cheapest Steakhouse”, or any other criteria, so the system cannot generate a
description for this action that accurately reflects the user’s intent. A prior approach
is to ask for multiple examples from the users [106], but this is often not feasible
due to the user’s inability to come up with useful and complete examples, and the
amount of examples required for complex tasks [74, 116].
Sugilite’s approach is to ask users to verbally explain their intent for the demon-
strated actions using speech. Our formative study [84] with 45 participants found that
end users were able to provide useful and generalizable explanations for the intents
of demonstrated actions. They also commonly used in their utterances semantic ref-
erences to GUI content (e.g., “the close by restaurant” for an entry showing the text
“596 ft”) and implicit spatial references (e.g., “the score for Lakers” for a text object
that contains a numeric value and is right-aligned to another text object “Lakers”).
Based on these findings, we designed and implemented a multi-modal mixed-
initiative intent clarification mechanism for demonstrated actions. As shown in Fig. 3,
the user describes their intention in natural language, and iteratively refines the
descriptions to remove ambiguity with the help of an interactive overlay (Fig. 3d).
The overlay highlights the result from executing the current data description query,
and helps the user focus on explaining the key differences between the target object
(highlighted in red) and the false positives (highlighted in yellow) of the query.
Fig. 3 The screenshots of Sugilite’s demonstration mechanism and its multi-modal mixed-
initiative intent clarification process for the demonstrated actions
Demonstration + Natural Language: Multimodal Interfaces … 509
To ground the user’s natural language explanations about GUI elements, Sugilite
represents each GUI screen as a UI snapshot graph. This graph captures the GUI
elements’ text labels, meta-information (including screen position, type, and package
name), and the spatial (e.g., nextTo), hierarchical (e.g., hasChild), and semantic
relations (e.g., containsPrice) among them (Fig. 4). A semantic parser translates
the user’s explanation into a graph query on the UI snapshot graph, executes it on
the graph, and verifies if the result matches the correct entity that the user originally
demonstrated. The goal of this process is to generate a query that uniquely matches
the target UI element and also reflects the user’s underlying intent.
Fig. 4 Sugilite’s instruction parsing and grounding process for intent clarifications illustrated on
an example UI snapshot graph constructed from a simplified GUI snippet
Appinite also recognizes some semantic information from the raw strings found in
the GUI to support grounding the user’s high-level linguistic inputs (e.g., “item with
the lowest price”). To achieve this, Appinite applies a pipeline of data extractors
on each string entity in the graph to extract structured data (e.g., phone number,
email address) and numerical measurements (e.g., price, distance, time, duration),
and saves them as new entities in the graph. These new entities are connected to the
original string entities by contains relations (e.g., containsPrice). Values
in each category of measurements are normalized to the same units so they can be
directly compared, allowing flexible computation, filtering, and aggregation.
5.2.2 Parsing
Our semantic parser uses a Floating Parser architecture [123] and is implemented
with the SEMPRE framework [16]. We represent UI snapshot graph queries in a
Demonstration + Natural Language: Multimodal Interfaces … 511
simple but flexible LISP-like query language (S-expressions) that can represent joins,
conjunctions, superlatives and their compositions, constructed by the following 7
grammar rules:
where Q is the root non-terminal of the query expression, e is a terminal that rep-
resents a UI object entity, r is a terminal that represents a relation, and the rest of
the non-terminals are used for intermediate derivations. Sugilite’s language forms a
subset of a more general formalism known as Lambda Dependency-based Composi-
tional Semantics [93], which is a notationally simpler alternative to lambda calculus
which is particularly well-suited for expressing queries over knowledge graphs. More
technical details and the user evaluation are discussed in [84].
Another way Sugilite leverages GUI groundings in the natural language instructions
is to infer task parameters and their possible values. This allows the agent to learn
generalized procedures (e.g., to order any kind of beverage from Starbucks) from a
demonstration of a specific instance of the task (e.g., ordering an iced cappuccino).
Sugilite achieves this by comparing the user utterance (e.g., “order a cup of
iced cappuccino”) against the data descriptions of the target UI elements (e.g., click
on the menu item that has the text “Iced Cappuccino”) and the arguments (e.g., put
“Iced Cappuccino” into a search box) of the demonstrated actions for matches. This
process grounds different parts in the utterances to specific actions in the demon-
strated procedure. It then analyzes the hierarchical structure of GUI at the time of
demonstration, and looks for alternative GUI elements that are in parallel to the orig-
inal target GUI element structurally. In this way, it extracts the other possible values
for the identified parameter, such as the names of all the other drinks displayed in
the same menu as “Iced Cappuccino”
The extracted sets of possible parameter values are also used for disambiguating
the procedures to invoke, such as invoking the order_Starbucks procedure
for the command “order a cup of latte”, but invoking the order_PapaJohns
procedure for the command “order a cheese pizza.”
For Boolean concepts, Sugilite assumes that the type of the Boolean operation
and the types of the arguments stay the same, but the arguments themselves may dif-
fer. For example, for the concept “hot” in Fig. 1, it should still mean that a temperature
(of something) is greater than another temperature. But the two in comparison can be
different constants, or from different value queries. For example, suppose after the
interactions in Fig. 1, the user instructs a new rule “if the oven is hot, start the cook
timer.” Pumice can recognize that “hot” is a concept that has been instructed before in
a different context, so it asks “I already know how to tell whether it is hot when deter-
mining whether to order a cup of Iced Cappuccino. Is it the same here when determin-
ing whether to start the cook timer?” After responding “No”, the user can instruct
how to find out the temperature of the oven, and the new threshold value for the
condition “hot” either by instructing a new value concept, or using a constant value.
The generalization mechanism for value concepts works similarly. Pumice sup-
ports value concepts that share the same name to have different query implemen-
tations for different task contexts. For example, following the “if the oven is hot,
start the cook timer” example, suppose the user defines “hot” for this new context as
“The temperature is above 400 degrees.” Pumice realizes that there is already a value
concept named “temperature”, so it will ask “I already know how to find out the value
for temperature using the Weather app. Should I use that for determining whether
the oven is hot?”, to which the user can say “No” and then demonstrate querying the
temperature of the oven using the corresponding app (assuming the user has a smart
oven with an in-app display of its temperature).
This mechanism allows learned concepts like “hot” to be reused at three differ-
ent levels: (i) exactly the same (e.g., the temperature of the weather is greater than
85°F); (ii) with a different threshold (e.g., the temperature of the weather is greater
than x); and (iii) with a different value query (e.g., the temperature of something else
is greater than x).
Fig. 5 The interface of Sovite: a Sovite shows a app GUI screenshot to communicates its state
of understanding. The yellow highlight overlay specifies the task slot value. The user can drag the
overlay to fix slot value errors. b To fix intent detection errors, the user can refer to an app that
represents their desired task. Sovite will match the utterance to an app on the phone (with its icon
shown), and look for intents that use or are relevant to this app. c If the intent is still ambiguous
after referring to an app, the user can show a specific app screen relevant to the desired task
reported similar findings of the types of breakdowns encountered by users and the
common repair strategies. In a taxonomy of conversational breakdown repair strate-
gies by Ashktorab et al. [8], repair strategies can be categorized into dimensions of:
(1) whether there is evidence of breakdown (i.e., whether the system makes users
aware of the breakdown); (2) whether the system attempts to repair (e.g., provide
options of potential intents), and (3) whether assistance is provided for user self-
repair (e.g., highlight the keywords that contribute to the intent classifier’s decision).
Among them, the most preferred option by the users was to have the system attempt
to help with the repair process by providing options of potential intents. However,
as discussed, this approach requires domain-specific “deep knowledge” about the
task and error handling flows manually programmed by the developers [5, 107],
and therefore, is not practical for user-instructed tasks. The second most preferred
strategy in [8] was for the system to provide more transparency into the cause of the
breakdown, such as highlighting the keywords that contribute to the results.
Informed by these results, we developed Sovite,4 a new interface for Sugilite
that helps users discover, identify the causes of, and recover from conversational
breakdowns using a app-grounded multi-modal approach (Fig. 5). Compared with
4Sovite is named after a type of rock. It is also an acronym for System for Optimizing Voice
Interfaces to Tackle Errors.
514 T. J.-J. Li et al.
the domain-specific approaches that require “deep knowledge”, our approach does
not require any additional efforts from the developers. It only requires “shallow
knowledge” in a domain-general generic language model to map user intents to the
corresponding app screens.
“I don’t understand the command.”), the user can fix the error by indicating the correct
apps and app screens for their desired task.
References to Apps After the user says that the detected intent is incorrect after
seeing the app GUI screenshots, or when the system fails to detect an intent, Sovite
asks the user “What app should I use to perform…[the task]?”, for which the
user can say the name of an app for the intended task (shown in Fig. 5b). Sovite
looks up the collection of all supported task intents for not only the intents that use
this underlying app, but also intents that are semantically related to the supplied app.
References to App Screens In certain situations, the user’s intent can still be ambigu-
ous after the user indicates the name of an app; there can be multiple intents asso-
ciated with the app (for example, if the user specifies “Expedia” which can be used
for booking flights, cruises, or rental cars), or there can be no supported task intent
in the user-provided app and no intent that meets the threshold of being sufficiently
“related” to the user-provided app. In these situations, Sovite will ask the user a
follow-up question “Can you show me which screen in…[the app]] is most rel-
evant to…[the task]?” (shown in Fig. 5c). Sovite then launches the app and
asks the user to navigate to the target screen in the app. Sovite then finds intents
that are the most semantically related to this app screen among the ambiguous ones,
or asks the user to teach it a new one by demonstration.
Ease of Transition to Out-of-Domain Task Instructions An important advantage
of Sovite’s intent disambiguation approach is that it supports the easy transition to
the user instruction of a new task when the user’s intended task is out of scope.
An effective approach to support handling out of scope errors is programming-
by-demonstration (PBD) [85]. Sovite’s approach can directly connect to the user
instruction mode in Sugilite. Since at this point, Sovite already knows the most
relevant app and app screen for the user’s intended task and how to navigate to this
screen in the app, it can simply ask the user “Can you teach me how to…[the
task] using…[the app] in this screen”, switch back to this screen, and have
the user to continue demonstrating the intended task to teach the agent how to fulfill
the previously out of scope task intent. The user may also start over and demonstrate
from scratch if they do not want to start the instruction from this screen.
Design Rationale The main design rationale of supporting intent detection repairs
with app GUI references is to make Sovite’s mechanism of fixing intent detection
errors consistent with how users discover the errors from Sovite’s display of intent
detection results. When users discover the intent detection errors by seeing the wrong
apps or the wrong screens displayed in the confirmation screenshots, the most intu-
itive way for them to fix these errors is to indicate the correct apps and screens that
should be used for the intended tasks. Their references to the apps and the screens
also allow Sovite to extract richer semantic context (e.g., the app store descriptions
and the text labels found on app GUI screens) than having the user simply rephrase
their utterances, helping with finding semantically related task intents.
516 T. J.-J. Li et al.
Fig. 6 Sovite provides multiple ways to fix text-input slot value errors: LEFT : the user can click
the corresponding highlight overlay and change its value by adjusting the selection in the original
utterance, speaking a new value, or just typing in a new value. RIGHT : the user can drag the overlays
on the screenshot to move a value to a new slot, or swap the values between two slots
what the users would do for the intended task, the users can directly fix these incon-
sistencies through simple physical actions such as drag-and-drop and text selection
gestures, and see immediate feedback on the screenshots, which are major advantages
of direct manipulation [134].
With the rise of data-driven computational methods for modeling user interactions
with graphical user interfaces (GUIs), the GUI screens have become not only inter-
faces for human users to interact with the underlying computing services, but also
valuable data sources that encode the underlying task flow, the supported user inter-
actions, and the design patterns of the corresponding apps, which have proven useful
for AI-powered applications. For example, programming-by-demonstration (PBD)
intelligent agents such as [80, 88, 132] use task-relevant entities and hierarchical
structures extracted from GUIs to parameterize, disambiguate, and handle errors in
user-demonstrated task automation scripts. Erica [39] mines a large repository of
mobile app GUIs to enable user interface (UI) designers to search for example design
patterns to inform their own design. Kite [89] extracts task flows from mobile app
GUIs to bootstrap conversational agents.
We present a new self-supervised technique Screen2Vec for generating seman-
tic representations of GUI screens and components using their textual content, visual
design and layout patterns, and app context metadata. Screen2Vec’s approach
is inspired by the popular word embedding method Word2Vec [111], where
the embedding vector representations of GUI screens and components are gener-
ated through the process of training a prediction model. But unlike Word2Vec,
Screen2Vec uses a two-layer pipeline informed by the structures of GUIs and
GUI interaction traces and incorporates screen- and app-specific metadata.
The embedding vector representations produced by Screen2Vec can be used in
a variety of useful downstream tasks such as nearest neighbor retrieval, composability-
based retrieval, and representing mobile tasks. The self-supervised nature of
Screen2Vec allows its model to be trained without any manual data labeling
efforts—it can be trained with a large collection of GUI screens and the user inter-
action traces on these screens such as the Rico [38] dataset.
Screen2Vec addresses an important gap in prior work about computational
HCI research. The lack of comprehensive semantic representations of GUI screens
and components has been identified as a major limitation in prior work in GUI-based
interactive task learning (e.g., [88, 132]), intelligent suggestive interfaces (e.g., [30]),
assistive tools (e.g., [19]), and GUI design aids (e.g., [72, 139]). Screen2Vec
embeddings can encode the semantics, contexts, layouts, and patterns of GUIs, pro-
viding representations of these types of information in a form that can be easily and
effectively incorporated into popular modern machine learning models.
518 T. J.-J. Li et al.
Fig. 7 The two-level architecture of Screen2Vec for generating GUI component and screen
embeddings. The weights for the steps in teal color are optimized during the training process
a pre-trained Sentence-BERT [128] model. These GUI and layout vectors are com-
bined using a linear layer, resulting in a 768-dimensional vector. After training, the
description embedding vector is concatenated on, resulting in the 1536-dimensional
GUI screen embedding vector (if included in the training, the description dominates
the entire embedding, overshadowing information specific to that screen within the
app). The weights in the RNN layer for combining GUI component embeddings and
the weights in the linear layer for producing the final output vector are similarly
trained on a CBOW prediction task on a large number of interaction traces (each
represented as a sequence of screens). For each trace, a sliding window moves over
the sequence of screens. The model tries to use the representation of the context (the
surrounding screens) to predict the screen in the middle.
In the training process, we trained Screen2Vec5 on the open-sourced Rico6
dataset [38]. The Rico dataset contains interaction traces on 66,261 unique GUI
screens from 9,384 free Android apps collected using a hybrid crowdsourcing plus
automated discovery approach. The models are trained on a cross entropy loss func-
tion with an Adam optimizer [63]. In training the GUI screen embedding model,
we use negative sampling [110, 111] so that we do not have to recalculate and
update every screen’s embedding on every training iteration, which is computation-
ally expensive and prone to over-fitting. In each iteration, the prediction is compared
to the correct screen and a sample of negative data that consists of: a random sampling
of size 128 of other screens, the other screens in the batch, and the screens in the
same trace as the correct screen, used in the prediction task. We specifically include
the screens in the same trace to promote screen-specific learning in this process: This
way, we can disincentive screen embeddings that are based solely on the app7 , and
emphasize on having the model learn to differentiate the different screens within the
same app. You can refer to [87] for details on the training process.
Prediction Task Results
In the screen prediction task, the Screen2Vec model performs better than three
baseline models (TextOnly, LayoutOnly, and VisualOnly; see [87] for
details on the baseline models) in top-1 prediction accuracy, top-k prediction accu-
racy, and the normalized rooted mean square error (RMSE) of the predicted screen
embedding vector. See [87] for details on the results and the relevant discussions.
embedding, the prediction task favors having information about the specific app (i.e., app store
description embedding) dominate the embedding
520 T. J.-J. Li et al.
Nearest Neighbors
The nearest neighbor task is useful for data-driven design, where the designers want
to find examples for inspiration and for understanding the possible design solu-
tions [38]. The task focuses on the similarity between GUI screen embeddings: for
a given screen, what are the top-N most similar screens in the dataset? The similar
technique can also be used for unsupervised clustering in the dataset to infer different
types of GUI screens. In our context, this task also helps demonstrate the different
characteristics between Screen2Vec and the three baseline models.
We conducted a study with 79 Mechanical Turk workers, where we compared the
human-rated similarity of the nearest neighbors results generated by Screen2Vec
with the baseline models on 5,608 pairs of screen instances. The Mechanical Turk
workers rated the nearest neighbor screens generated by the Screen2Vec model
to be, on average, more similar ( p < 0.0001) to their source screens than the nearest
neighbor screens generated by the baseline models (details on study design and
results in [87]).
Subjectively, when looking at the nearest neighbor results, we can see the different
aspects of the GUI screens that each different model captures. Screen2Vec can cre-
ate more comprehensive representations that encode the textual content, visual design
and layout patterns, and app contexts of the screen compared with the two baselines,
which only capture one or two aspects. For example, Fig. 8 shows the example nearest
neighbor results for the “request ride” screen in the Lyft app. Screen2Vec model
retrives the “get direction” screen in the Uber Driver app, “select navigation type”
screen in the Waze app, and “request ride” screen in the Free Now (My Taxi) app.
Visual and component layout wise, the result screens all feature a menu/information
card at the bottom 1/3 to 1/4 of the screen, with a MapView taking the majority of the
screen space. Content and app domain wise, all these screens are from transportation-
related apps that allow the user to configure a trip. In comparison, the TextOnly
model retrieves the “request ride” screen from the zTrip app, the “main menu” screen
from the Hailo app (both zTrip and Hailo are taxi hailing apps), and the home screen
of the Paytm app (a mobile payment app in India). The commonality of these screens
is that they all include text strings that are semantically similar to “payment” (e.g.,
add payment type, wallet, pay, add money), and texts that are semantically similar
to “destination” and “trips” (e.g., drop off location, trips, bus, flights). But the model
neither considers the visual layout and design patterns of the screens, nor the app
context. Therefore, the result contains the “main menu” (a quite different type of
screen) in the Hailo app and the “home screen” in the Paytm app (a quite different
type of screen in a different type of app). The LayoutOnly model, on the other
hand, retrieves the “exercise logging” screens from the Map My Walk app and the
Map My Ride app, and the tutorial screen from the Clever Dialer app. We can see
that the content and app-context similarity of the result of the LayoutOnly model
is quite lower than those of the Screen2Vec and TextOnly models. However,
the result screens all share similar layout features as the source screen, such as the
Demonstration + Natural Language: Multimodal Interfaces … 521
Fig. 8 The example nearest neighbor results for the Lyft “request ride” screen generated by the
Screen2Vec, TextOnly, and LayoutOnly models
menu/information card at the bottom of the screen and the screen-wide button at the
bottom of the menu (Fig. 8).
Embedding Composability
A useful property of embeddings is that they are composable—meaning that we can
add, subtract, and average embeddings to form a meaningful new one. This prop-
erty is commonly used in word embeddings. For example, in Word2Vec, analogies
such as “man is to woman as brother is to sister” is reflected in that the vector
(man − woman) is similar to the vector (br other − sister ). Besides represent-
ing analogies, this embedding composability can also be utilized for generative
purposes—for example, (br other − man + woman) results in an embedding vector
that represents “sister”.
This property is also useful in screen embeddings. For example, we can run
a nearest neighbor query on the composite embedding of (Marriott app ’s “hotel
booking” screen + (Cheapoair app’s “search result” screen − Cheapoair app’s “hotel
booking” screen)). The top result is the “search result” screen in the Marriott app.
522 T. J.-J. Li et al.
Fig. 9 An example showing the composability of Screen2Vec embeddings: running the nearest
neighbor query on the composite embedding of (Marriott app ’s hotel booking page + Cheapoair
app’s hotel booking page − Cheapoair app’s search result page) can match the Marriott app’s search
result page, and the similar pages of a few other travel apps
When we filter the result to focus on screens from apps other than Marriott, we get
screens that show list results of items from other travel-related apps such as Booking,
Last Minute Travel, and Caesars Rewards.
The composability can make Screen2Vec particularly useful for GUI design
purposes—the designer can leverage the composability to find inspiring examples
of GUI designs and layouts.
GUI screens are not only useful data sources individually on their own, but also as
building blocks to represent a user’s task. A task in an app, or across multiple apps, can
be represented as a sequence of GUI screens that makes up the user interaction trace
of performing this task through app GUIs. We conduct a preliminary evaluation on
the effectiveness of embedding mobile tasks as sequences of Screen2Vec screen
embedding vectors (details in [87]).
While the task embedding method we explored is quite primitive, it illustrates
that the Screen2Vec technique can be used to effectively encode mobile tasks
into the vector space where semantically similar tasks are close to each other. For
the next steps, we plan to further explore this direction. For example, the current
method of averaging all the screen embedding vectors does not consider the order
Demonstration + Natural Language: Multimodal Interfaces … 523
of the screens in the sequence. In the future, we may collect a dataset of human
annotations of task similarity, and use techniques that can encode the sequences of
items, such as recurrent neural networks (RNN) and long short-term memory (LSTM)
networks, to create the task embeddings from sequences of screen embeddings. We
may also incorporate the Screen2Vec embeddings of the GUI components that
were interacted with (e.g., the button that was clicked on) to initiate the screen change
into the pipeline for embedding tasks.
This section describes several potential applications where the new Screen2Vec
technique can be useful based on the downstream tasks described in Sect. 5.6.2.
Screen2Vec can enable new GUI design aids that take advantage of the nearest
neighbor similarity and composability of Screen2Vec embeddings. Prior work
such as [38, 52, 66] has shown that data-driven tools that enable designers to
curate design examples are quite useful for interface designers. Unlike [38], which
uses a content-agnostic approach that focuses on the visual and layout similarities,
Screen2Vec considers the textual content and app metadata in addition to the visual
and layout patterns, often leading to different nearest neighbor results as discussed
in section. This new type of similarity results will also be useful when focusing on
interface design beyond just visual and layout issues, as the results enable designers
to query, for example, designs that display similar content or screens that are used
in apps in a similar domain. The composability in Screen2Vec embeddings enables
querying for design examples at a finer granularity. For example, suppose a designer
wishes to find examples for inspiring the design of a new checkout page for app A.
They may query for the nearest neighbors of the synthesized embedding App A’s
order page + (App B’s checkout page − App B’s order page). Compared with simply
querying for the nearest neighbors of App B’s checkout page, this synthesized query
can encode the interaction context (i.e., the desired page should be the checkout page
for App A’s order page) in addition to the “checkout” semantics.
The Screen2Vec embeddings can also be useful in generative GUI models.
Recent models such as the neural design network (NDN) [73] and LayoutGAN [79]
can generate realistic GUI layouts based on user-specified constraints (e.g., align-
ments, relative positions between GUI components). Screen2Vec can be used in
these generative approaches to incorporate the semantics of GUIs and the contexts
of how each GUI screen and component gets used in user interactions. For exam-
ple, the GUI component prediction model can estimate the likelihood of each GUI
component given the context of the other components in a generated screen, pro-
viding a heuristic of how likely the GUI components can fit well with each other.
Similarly, the GUI screen prediction model may be used as a heuristic to synthesize
GUI screens that can better fit with the other screens in the planned user interaction
flows. Since Screen2Vec has been shown effective in representing mobile tasks
in Sect. 5.6.2, where similar tasks will yield similar embeddings, one may also use
the task embeddings of performing the same task on an existing app to inform the
524 T. J.-J. Li et al.
6 User Evaluations
We conducted several lab user studies to evaluate the usability, efficiency, and effec-
tiveness of Sugilite. The results of these study showed that end users without signif-
icant programming expertise were able to successfully teach the agent the procedures
of performing common tasks (e.g., ordering pizza, requesting Uber, checking sports
score, ordering coffee) [80], conditional rules for triggering the tasks [88], and con-
cepts relevant to the tasks (e.g., the weather is hot, the traffic is heavy) [88] using
Sugilite. The users were also able to clarify their intents when ambiguities arise [84]
and successfully discover, identify the sources of, and repair conversational break-
downs caused by natural language understanding errors [82]. Most of our participants
found Sugilite easy and natural to use [80, 84, 88]. Efficiency wise, teaching a task
usually took the user 3–6 times longer than how long it took to perform the task
manually in our studies [80], which indicates that teaching a task using Sugilite can
save time for many repetitive tasks.
7 Limitations
7.1 Platform
Sugilite and its follow-up work have been developed and tested only on Android
phones. Sugilite retrieves the hierarchical tree structure of the current GUI screen
and manipulates the app GUI through Android’s Accessibility API. However, the
approach used in Sugilite should apply to any GUI-based apps with hierarchical-
Demonstration + Natural Language: Multimodal Interfaces … 525
based structures (e.g., the hierarchical DOM structures in web apps). In certain
platforms like iOS, while the app GUIs still use hierarchical tree structures, the
access to extracting information from and sending inputs to third-party apps has
been restricted by the operating system due to security and privacy concerns. In such
platforms, implementing a Sugilite-like system likely requires collaboration with
the OS provider (e.g., Apple) or limiting the domain to first-party apps. We also
expect working with desktop apps to be more challenging than with mobile apps due
to the increased difficulty in inferring their GUI semantics, as the desktop apps often
have more complex layouts and more heterogeneous design patterns.
7.3 Expressiveness
Lyft ride”, but not “if the price of a Uber ride is at least 10 dollars more expensive than
the price of a Lyft ride.”) mostly due to the extra complication in semantic parsing.
Correctly parsing the user’s natural language description of arithmetic operations
into our DSL would likely require a more complicated parsing architecture with a
much larger training corpus. It also does not support loops in automation (e.g., “order
one of each item in the “Espresso Drinks” category in the Starbucks app”). This is
due to Sugilite’s limited capability to capture the internal “states” within the apps
and to return to a specific previous state. For example, in the “ordering one of each
item” task, the agent needs to return to the GUI state showing the list of items after
completing the ordering of the first item in order to order the second item. This
cannot be easily done with the current Sugilite agent. Even if Sugilite was able to
find the “same” (visually similar or have the same activity name) screen, Sugilite
cannot know if the internal state of the underlying app has changed (e.g., adding the
first item to the cart affects what other items are available for purchase).
Another limitation in expressiveness is due to the input modalities that Sugilite
tracks in the user demonstrations—it only records a set of common input types
(clicks, long-clicks, text entries, etc.) on app GUIs. Gestures (e.g., swipes, flicks),
sensory inputs (e.g., tilting or shaking the phone detected by the accelerometer and
the gyroscope, auditory inputs from the microphone), and visual inputs (from the
phone camera) are not recorded.
7.4 Brittleness
While many measures have been taken to help Sugilite handle minor changes in
app GUIs, Sugilite scripts can still be brittle after the a change in the underlying
app GUI due to either an app update or an external event. As discussed in Sect. 5.2,
Sugilite uses a graph query to locate the correct GUI element to operate on when
executing an automation script. Instead of using the absolute (x, y) coordinates for
identifying a GUI element like some prior systems do, Sugilite picks one or more
features such as the text label, the ordinal position in a list (e.g., first item in the search
result), or the relative position to another GUI element (e.g., the “book” button next to
the cheapest flight) that corresponds to the user’s intent. Therefore, if a GUI change
does not affect the result of the graph query, the automation should still work. In the
future, it is possible to further enhance Sugilite’s capability of understanding screen
semantics, so that it can automatically detect and handle some of these unexpected
screens that do not affect the task without user intervention.
Demonstration + Natural Language: Multimodal Interfaces … 527
8 Future Work
Another future direction is to study the user adoption of Sugilite through a lon-
gitudinal field study. While the usability and the effectiveness of Sugilite have
been validated through task-based lab studies, deploying it to actual users can still
be useful for (i) further validating the feasibility and robustness of the system in
various contexts, (ii) measuring the usefulness of Sugilite in real-life scenarios,
and (iii) studying the characteristics of how users use Sugilite. The key goal of the
deployment is to study Sugilite within its intended context of use.
9 Conclusion
We described Sugilite, a task automation agent that can learn new tasks and rele-
vant concepts interactively from users through their GUI-grounded natural language
instructions and demonstrations. This system provides capabilities such as intent
clarification, task parameterization, concept generalization, breakdown repairs, and
embedding the semantics of GUI screens. Sugilite shows the promise of using app
GUIs for grounding natural language instructions, and the effectiveness of resolv-
ing unknown concepts, ambiguities, and vagueness in natural language instructions
using a mixed-initiative multi-modal approach.
Acknowledgements This research was supported in part by Verizon through the Yahoo! InMind
project, a J.P. Morgan Faculty Research Award, NSF grant IIS-1814472, AFOSR grant
FA95501710218, and Google Cloud Research Credits. Any opinions, findings or recommenda-
tions expressed here are those of the authors and do not necessarily reflect views of the sponsors.
We thank Amos Azaria, Yuanchun Li, Fanglin Chen, Igor Labutov, Xiaohan Nancy Li, Xiaoyi
Zhang, Wenze Shi, Wanling Ding, Marissa Radensky, Justin Jia, Kirielle Singarajah, Jingya Chen,
Brandon Canfield, Haijun Xia, and Lindsay Popowski for their contributions to this project.
References
4. Allen JF, Guinn CI, Horvtz E (1999) Mixed-initiative interaction. IEEE Intell Syst Appl
14(5):14–23
5. Amazon: Alexa Design Guide (2020). https://developer.amazon.com/en-US/docs/alexa/
alexa-design/get-started.html
6. Antila V, Polet J, Lämsä A, Liikka J (2012) RoutineMaker: towards end-user automation
of daily routines using smartphones. In: 2012 IEEE international conference on pervasive
computing and communications workshops (PERCOM workshops), pp 399–402. https://doi.
org/10.1109/PerComW.2012.6197519
7. Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demon-
stration. Robot Auton Syst 57(5):469–483. https://doi.org/10.1016/j.robot.2008.10.024
8. Ashktorab Z, Jain M, Liao QV, Weisz JD (2019) Resilient chatbots: repair strategy preferences
for conversational breakdowns. In: Proceedings of the 2019 CHI conference on human factors
in computing systems, p 254. ACM
9. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) Dbpedia: a nucleus
for a web of open data. The semantic web, pp 722–735. http://www.springerlink.com/index/
rm32474088w54378.pdf
10. Azaria A, Krishnamurthy J, Mitchell TM (2016) Instructable intelligent personal agent. In:
Proceedings of the 30th AAAI conference on artificial intelligence (AAAI), vol 4
11. Ballard BW, Biermann AW (1979) Programming in natural language “NLC” as a prototype.
In: Proceedings of the 1979 annual conference, ACM ’79, pp 228–237. ACM, New York, NY,
USA. https://doi.org/10.1145/800177.810072. http://doi.acm.org/10.1145/800177.810072
12. Banovic N, Grossman T, Matejka J, Fitzmaurice G (2012) Waken: reverse engineering usage
information and interface structure from software videos. In: Proceedings of the 25th annual
ACM symposium on user interface software and technology, UIST ’12, pp 83–92. ACM,
New York, NY, USA. https://doi.org/10.1145/2380116.2380129. http://doi.acm.org/10.1145/
2380116.2380129
13. Barman S, Chasins S, Bodik R, Gulwani S (2016) Ringer: web automation by demonstration.
In: Proceedings of the 2016 ACM SIGPLAN international conference on object-oriented
programming, systems, languages, and applications, OOPSLA 2016, pp 748–764. ACM,
New York, NY, USA. https://doi.org/10.1145/2983990.2984020. http://doi.acm.org/10.1145/
2983990.2984020
14. Beneteau E, Richards OK, Zhang M, Kientz JA, Yip J, Hiniker A (2019) Communication
breakdowns between families and alexa. In: Proceedings of the 2019 CHI conference on
human factors in computing systems, CHI ’19, pp 243:1–243:13. ACM, New York, NY, USA.
https://doi.org/10.1145/3290605.3300473. http://doi.acm.org/10.1145/3290605.3300473
15. Bentley F, Luvogt C, Silverman M, Wirasinghe R, White B, Lottridge D (2018) Understanding
the long-term use of smart speaker assistants. Proc ACM Interact Mob Wearable Ubiquitous
Technol 2(3). https://doi.org/10.1145/3264901
16. Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on freebase from question-
answer pairs. In: Proceedings of the 2013 conference on empirical methods in natural language
processing, pp 1533–1544
17. Bergman L, Castelli V, Lau T, Oblinger D (2005) DocWizards: a system for authoring follow-
me documentation wizards. In: Proceedings of the 18th annual ACM symposium on user
interface software and technology, UIST ’05, pp 191–200. ACM, New York, NY, USA.
https://doi.org/10.1145/1095034.1095067. http://doi.acm.org/10.1145/1095034.1095067
18. Biermann AW (1983) Natural Language Programming. In: Biermann AW, Guiho G (eds)
Computer program synthesis methodologies, NATO advanced study institutes series. Springer,
Netherlands, pp 335–368
19. Bigham JP, Lau T, Nichols J (2009) Trailblazer: enabling blind users to blaze trails through
the web. In: Proceedings of the 14th international conference on intelligent user interfaces,
IUI ’09, pp 177–186. ACM, New York, NY, USA. https://doi.org/10.1145/1502650.1502677
20. Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In:
Springer handbook of robotics, pp 1371–1394. Springer. http://link.springer.com/10.1007/
978-3-540-30301-5_60
530 T. J.-J. Li et al.
21. Bohus D, Rudnicky AI (2005) Sorry, I didn’t catch that!-An investigation of non-
understanding errors and recovery strategies. In: 6th SIGdial workshop on discourse and
dialogue
22. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created
graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD
international conference on Management of data, pp 1247–1250. ACM. http://dl.acm.org/
citation.cfm?id=1376746
23. Bolt RA (1980) “Put-that-there”: voice and gesture at the graphics interface. In: Proceedings
of the 7th annual conference on computer graphics and interactive techniques, SIGGRAPH
’80, pp 262–270. ACM, New York, NY, USA
24. Bosselut A, Rashkin H, Sap M, Malaviya C, Celikyilmaz A, Choi Y (2019) COMET: common-
sense transformers for automatic knowledge graph construction. In: Proceedings of the 57th
annual meeting of the association for computational linguistics, pp 4762–4779. ACL, Flo-
rence, Italy. https://doi.org/10.18653/v1/P19-1470. https://www.aclweb.org/anthology/P19-
1470
25. Brennan SE (1991) Conversation with and through computers. User Model User-Adap Int
1(1):67–86. https://doi.org/10.1007/BF00158952
26. Brennan SE (1998) The grounding problem in conversations with and through computers.
Social and cognitive approaches to interpersonal communication, pp 201–225
27. Böhmer M, Hecht B, Schöning J, Krüger A, Bauer G (2011) Falling asleep with angry birds,
facebook and kindle: a large scale study on mobile application usage. In: Proceedings of
the 13th international conference on human computer interaction with mobile devices and
services, MobileHCI ’11, pp 47–56. ACM, New York, NY, USA. https://doi.org/10.1145/
2037373.2037383. http://doi.acm.org/10.1145/2037373.2037383
28. Chai JY, Gao Q, She L, Yang S, Saba-Sadiya S, Xu G (2018) Language to action: towards
interactive task learning with physical agents. In: IJCAI, pp 2–9
29. Chandramouli V, Chakraborty A, Navda V, Guha S, Padmanabhan V, Ramjee R (2015) Insider:
towards breaking down mobile app silos. In: TRIOS workshop held in conjunction with the
SIGOPS SOSP 2015
30. Chen F, Xia K, Dhabalia K, Hong JI (2019) Messageontap: a suggestive interface to facilitate
messaging-related tasks. In: Proceedings of the 2019 CHI conference on human factors in
computing systems, CHI ’19. ACM, New York, NY, USA. https://doi.org/10.1145/3290605.
3300805
31. Chen J, Chen C, Xing Z, Xu X, Zhu L, Li G, Wang J (2020) Unblind your apps: predicting
natural-language labels for mobile gui components by deep learning. In: Proceedings of the
42nd international conference on software engineering, ICSE ’20
32. Chen JH, Weld DS (2008) Recovering from errors during programming by demonstration.
In: Proceedings of the 13th international conference on intelligent user interfaces, IUI ’08, pp
159–168. ACM, New York, NY, USA. https://doi.org/10.1145/1378773.1378794. http://doi.
acm.org/10.1145/1378773.1378794
33. Chkroun M, Azaria A (2019) Lia: a virtual assistant that can be taught new commands by
speech. Int J Hum–Comput Interact 1–12
34. Cho J, Rader E (2020) The role of conversational grounding in supporting symbiosis between
people and digital assistants. Proc ACM Hum-Comput Interact 4(CSCW1)
35. Clark HH, Brennan SE (1991) Grounding in communication. In: Perspectives on socially
shared cognition, pp 127–149. APA, Washington, DC, US. https://doi.org/10.1037/10096-
006
36. Cowan BR, Pantidi N, Coyle D, Morrissey K, Clarke P, Al-Shehri S, Earley D, Bandeira
N (2017) “what can i help you with?”: Infrequent users’ experiences of intelligent personal
assistants. In: Proceedings of the 19th international conference on human-computer interaction
with mobile devices and services, MobileHCI ’17, pp 43:1–43:12. ACM, New York, NY, USA.
https://doi.org/10.1145/3098279.3098539. http://doi.acm.org/10.1145/3098279.3098539
37. Cypher A, Halbert DC (1993) Watch what I do: programming by demonstration. MIT Press
Demonstration + Natural Language: Multimodal Interfaces … 531
53. Huang THK, Azaria A, Bigham JP (2016) InstructableCrowd: creating IF-THEN rules via
conversations with the crowd, pp 1555–1562. ACM Press. https://doi.org/10.1145/2851581.
2892502. http://dl.acm.org/citation.cfm?doid=2851581.2892502
54. Hutchins EL, Hollan JD, Norman DA (1986) Direct manipulation interfaces
55. Iba S, Paredis CJJ, Khosla PK (2005) Interactive multimodal robot programming. Int J Robot
Res 24(1):83–104. https://doi.org/10.1177/0278364904049250
56. IFTTT (2016) IFTTT: connects the apps you love. https://ifttt.com/
57. Intharah T, Turmukhambetov D, Brostow GJ (2019) Hilc: domain-independent pbd system via
computer vision and follow-up questions. ACM Trans Interact Intell Syst 9(2-3):16:1–16:27.
https://doi.org/10.1145/3234508. http://doi.acm.org/10.1145/3234508
58. Jain M, Kumar P, Kota R, Patel SN (2018) Evaluating and informing the design of chatbots.
In: Proceedings of the 2018 designing interactive systems conference, pp 895–906. ACM
59. Jiang J, Jeng W, He D (2013) How do users respond to voice input errors?: lexical and phonetic
query reformulation in voice search. In: Proceedings of the 36th international ACM SIGIR
conference on research and development in information retrieval, pp 143–152. ACM
60. Kasturi T, Jin H, Pappu A, Lee S, Harrison B, Murthy R, Stent A (2015) The cohort and
speechify libraries for rapid construction of speech enabled applications for android. In:
Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue,
pp 441–443
61. Kate RJ, Wong YW, Mooney RJ (2005) Learning to transform natural to formal lan-
guages. In: Proceedings of the 20th national conference on artificial intelligence - volume 3,
AAAI’05, pp 1062–1068. AAAI Press, Pittsburgh, Pennsylvania. http://dl.acm.org/citation.
cfm?id=1619499.1619504
62. Kim D, Park S, Ko J, Ko SY, Lee SJ (2019) X-droid: a quick and easy android prototyping
framework with a single-app illusion. In: Proceedings of the 32nd annual ACM symposium
on user interface software and technology, UIST ’19, pp 95–108. ACM, New York, NY, USA.
https://doi.org/10.1145/3332165.3347890
63. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y
(eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA,
USA, May 7–9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
64. Kirk J, Mininger A, Laird J (2016) Learning task goals interactively with visual demonstra-
tions. Biol Inspired Cogn Archit 18:1–8
65. Ko AJ, Abraham R, Beckwith L, Blackwell A, Burnett M, Erwig M, Scaffidi C, Lawrance J,
Lieberman H, Myers B, Rosson MB, Rothermel G, Shaw M, Wiedenbeck S (2011) The state
of the art in end-user software engineering. ACM Comput Surv 43(3), 21:1–21:44. https://
doi.org/10.1145/1922649.1922658. http://doi.acm.org/10.1145/1922649.1922658
66. Kumar R, Satyanarayan A, Torres C, Lim M, Ahmad S, Klemmer SR, Talton JO (2013)
Webzeitgeist: design mining the web. In: Proceedings of the SIGCHI conference on human
factors in computing systems, CHI ’13, pp 3083–3092. ACM, New York, NY, USA. https://
doi.org/10.1145/2470654.2466420
67. Kurihara K, Goto M, Ogata J, Igarashi T (2006) Speech pen: predictive handwriting based
on ambient multimodal recognition. In: Proceedings of the SIGCHI conference on human
factors in computing systems, pp 851–860. ACM
68. Labutov I, Srivastava S, Mitchell T (2018) Lia: a natural language programmable personal
assistant. In: Proceedings of the 2018 conference on empirical methods in natural language
processing: system demonstrations, pp 145–150
69. Laird JE, Gluck K, Anderson J, Forbus KD, Jenkins OC, Lebiere C, Salvucci D, Scheutz M,
Thomaz A, Trafton G, Wray RE, Mohan S, Kirk JR (2017) Interactive task learning. IEEE
Intell Syst 32(4):6–21. https://doi.org/10.1109/MIS.2017.3121552
70. Laput GP, Dontcheva M, Wilensky G, Chang W, Agarwala A, Linder J, Adar E (2013) Pixel-
Tone: a multimodal interface for image editing. In: Proceedings of the SIGCHI conference on
human factors in computing systems, CHI ’13, pp 2185–2194. ACM, New York, NY, USA.
https://doi.org/10.1145/2470654.2481301. http://doi.acm.org/10.1145/2470654.2481301
Demonstration + Natural Language: Multimodal Interfaces … 533
71. Lau T (2009) Why programming-by-demonstration systems fail: lessons learned for usable
AI. AI Mag 30(4):65–67. http://www.aaai.org/ojs/index.php/aimagazine/article/view/2262
72. Lee C, Kim S, Han D, Yang H, Park YW, Kwon BC, Ko S (2020) Guicomp: a gui design
assistant with real-time, multi-faceted feedback. In: Proceedings of the 2020 CHI conference
on human factors in computing systems, CHI ’20, pp 1–13. ACM, New York, NY, USA.
https://doi.org/10.1145/3313831.3376327
73. Lee HY, Yang W, Jiang L, Le M, Essa I, Gong H, Yang MH (2020) Neural design net-
work: graphic layout generation with constraints. In: European conference on computer vision
(ECCV)
74. Lee TY, Dugan C, Bederson BB (2017) Towards understanding human mistakes of program-
ming by example: an online user study. In: Proceedings of the 22nd international conference
on intelligent user interfaces, IUI ’17, pp 257–261. ACM, New York, NY, USA. https://doi.
org/10.1145/3025171.3025203. http://doi.acm.org/10.1145/3025171.3025203
75. Leshed G, Haber EM, Matthews T, Lau T (2008) CoScripter: automating & sharing how-to
knowledge in the enterprise. In: Proceedings of the SIGCHI conference on human factors in
computing systems, CHI ’08, pp 1719–1728. ACM, New York, NY, USA. https://doi.org/10.
1145/1357054.1357323. http://doi.acm.org/10.1145/1357054.1357323
76. Li F, Jagadish HV (2014) Constructing an interactive natural language interface for relational
databases. Proc VLDB Endow 8(1):73–84. https://doi.org/10.14778/2735461.2735468
77. Li H, Wang YP, Yin J, Tan G (2019) Smartshell: automated shell scripts synthesis from natural
language. Int J Softw Eng Knowl Eng 29(02):197–220
78. Li I, Nichols J, Lau T, Drews C, Cypher A (2010) Here’s What I Did: sharing and reusing
web activity with ActionShot. In: Proceedings of the SIGCHI conference on human factors
in computing systems, CHI ’10, pp 723–732. ACM, New York, NY, USA. https://doi.org/10.
1145/1753326.1753432. http://doi.acm.org/10.1145/1753326.1753432
79. Li J, Yang J, Hertzmann A, Zhang J, Xu T (2019) Layoutgan: synthesizing graphic layouts
with vector-wireframe adversarial networks. IEEE Trans Pattern Anal Mach Intell
80. Li TJJ, Azaria A, Myers BA (2017) SUGILITE: creating multimodal smartphone automation
by demonstration. In: Proceedings of the 2017 CHI conference on human factors in comput-
ing systems, CHI ’17, pp 6038–6049. ACM, New York, NY, USA. https://doi.org/10.1145/
3025453.3025483. http://doi.acm.org/10.1145/3025453.3025483
81. Li TJJ, Chen J, Canfield B, Myers BA (2020) Privacy-preserving script sharing in gui-based
programming-by-demonstration systems. Proc ACM Hum-Comput Interact 4(CSCW1).
https://doi.org/10.1145/3392869
82. Li TJJ, Chen J, Xia H, Mitchell TM, Myers BA (2020) Multi-modal repairs of conversational
breakdowns in task-oriented dialogs. In: Proceedings of the 33rd annual ACM symposium on
user interface software and technology, UIST 2020. ACM. https://doi.org/10.1145/3379337.
3415820
83. Li TJJ, Hecht B (2014) WikiBrain: making computer programs smarter with knowledge from
wikipedia
84. Li TJJ, Labutov I, Li XN, Zhang X, Shi W, Mitchell TM, Myers BA (2018) APPINITE:
a multi-modal interface for specifying data descriptions in programming by demonstration
using verbal instructions. In: Proceedings of the 2018 IEEE symposium on visual languages
and human-centric computing (VL/HCC 2018)
85. Li TJJ, Labutov I, Myers BA, Azaria A, Rudnicky AI, Mitchell TM (2018) Teaching agents
when they fail: end user development in goal-oriented conversational agents. In: Studies in
conversational UX design. Springer
86. Li TJJ, Li Y, Chen F, Myers BA (2017) Programming IoT devices by demonstration using
mobile apps. In: Barbosa S, Markopoulos P, Paterno F, Stumpf S, Valtolina S (eds) End-user
development. Springer, Cham, pp 3–17
87. Li TJJ, Popowski L, Mitchell TM, Myers BA (2021) Screen2vec: semantic embedding of gui
screens and gui components. In: Proceedings of the 2021 CHI conference on human factors
in computing systems, CHI ’21. ACM
534 T. J.-J. Li et al.
88. Li TJJ, Radensky M, Jia J, Singarajah K, Mitchell TM, Myers BA (2019) PUMICE: a multi-
modal agent that learns concepts and conditionals from natural language and demonstrations.
In: Proceedings of the 32nd annual ACM symposium on user interface software and technol-
ogy (UIST 2019), UIST 2019. ACM. https://doi.org/10.1145/3332165.3347899
89. Li TJJ, Riva O (2018) KITE: building conversational bots from mobile apps. In: Proceedings
of the 16th ACM international conference on mobile systems, applications, and services
(MobiSys 2018). ACM
90. Li Y, He J, Zhou X, Zhang Y, Baldridge J (2020) Mapping natural language instructions to
mobile UI action sequences. In: Proceedings of the 58th annual meeting of the association for
computational linguistics, pp 8198–8210. ACL, Online. https://doi.org/10.18653/v1/2020.
acl-main.729. https://www.aclweb.org/anthology/2020.acl-main.729
91. Li Y, Li G, He L, Zheng J, Li H, Guan Z (2020) Widget captioning: generating natural
language description for mobile user interface elements. In: Proceedings of the 2020 con-
ference on empirical methods in natural language processing (EMNLP), pp 5495–5510.
ACL, Online. https://doi.org/10.18653/v1/2020.emnlp-main.443. https://www.aclweb.org/
anthology/2020.emnlp-main.443
92. Liang P (2016) Learning executable semantic parsers for natural language understanding.
Commun ACM 59(9):68–76
93. Liang P, Jordan MI, Klein D (2013) Learning dependency-based compositional semantics.
Comput Linguist 39(2):389–446
94. Lieberman H (2001) Your wish is my command: programming by example. Morgan Kauf-
mann
95. Lieberman H, Liu H (2006) Feasibility studies for programming in natural language. In: End
user development, pp 459–473. Springer
96. Lieberman H, Maulsby D (1996) Instructible agents: software that just keeps getting better.
IBM Syst J 35(3.4):539–556. https://doi.org/10.1147/sj.353.0539
97. Lin J, Wong J, Nichols J, Cypher A, Lau TA (2009) End-user programming of mashups with
vegemite. In: Proceedings of the 14th international conference on intelligent user interfaces,
IUI ’09, pp 97–106. ACM, New York, NY, USA. https://doi.org/10.1145/1502650.1502667.
http://doi.acm.org/10.1145/1502650.1502667
98. Liu EZ, Guu K, Pasupat P, Shi T, Liang P (2018) Reinforcement learning on web interfaces
using workflow-guided exploration. CoRR. http://arxiv.org/abs/1802.08802
99. Liu TF, Craft M, Situ J, Yumer E, Mech R, Kumar R (2018) Learning design semantics for
mobile apps. In: Proceedings of the 31st annual ACM symposium on user interface software
and technology, UIST ’18, pp 569–579. ACM, New York, NY, USA. https://doi.org/10.1145/
3242587.3242650
100. LlamaLab: Automate: everyday automation for Android (2016). http://llamalab.com/
automate/
101. Luger E, Sellen A (2016) “like having a really bad pa”: the gulf between user expectation and
experience of conversational agents. In: Proceedings of the 2016 CHI conference on human
factors in computing systems, CHI ’16, pp 5286–5297. ACM, New York, NY, USA. https://
doi.org/10.1145/2858036.2858288. http://doi.acm.org/10.1145/2858036.2858288
102. Maes P (1994) Agents that reduce work and information overload. Commun ACM 37(7):30–
40. https://doi.org/10.1145/176789.176792. http://doi.acm.org/10.1145/176789.176792
103. Mankoff J, Abowd GD, Hudson SE (2000) Oops: a toolkit supporting mediation techniques
for resolving ambiguity in recognition-based interfaces. Comput Graph 24(6):819–834
104. Marin R, Sanz PJ, Nebot P, Wirz R (2005) A multimodal interface to control a robot arm via
the web: a case study on remote programming. IEEE Trans Ind Electron 52(6):1506–1520.
https://doi.org/10.1109/TIE.2005.858733
105. Maués RDA, Barbosa SDJ (2013) Keep doing what i just did: automating smartphones by
demonstration. In: Proceedings of the 15th international conference on human-computer inter-
action with mobile devices and services, MobileHCI ’13, pp 295–303. ACM, New York,
NY, USA. https://doi.org/10.1145/2493190.2493216. http://doi.acm.org/10.1145/2493190.
2493216
Demonstration + Natural Language: Multimodal Interfaces … 535
106. McDaniel RG, Myers BA (1999) Getting more out of programming-by-demonstration. In:
Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’99,
pp 442–449. ACM, New York, NY, USA. https://doi.org/10.1145/302979.303127. http://doi.
acm.org/10.1145/302979.303127
107. McTear M, O’Neill I, Hanna P, Liu X (2005) Handling errors and determin-
ing confirmation strategies–an object-based approach. Speech Commun 45(3):249–
269. https://doi.org/10.1016/j.specom.2004.11.006. http://www.sciencedirect.com/science/
article/pii/S0167639304001426. Special Issue on Error Handling in Spoken Dialogue Sys-
tems
108. Menon A, Tamuz O, Gulwani S, Lampson B, Kalai A (2013) A machine learning frame-
work for programming by example, pp 187–195. http://machinelearning.wustl.edu/mlpapers/
papers/ICML2013_menon13
109. Mihalcea R, Liu H, Lieberman H (2006) NLP (Natural Language Processing) for NLP (Natural
Language Programming). In: Gelbukh A (ed) Computational linguistics and intelligent text
processing. Lecture notes in computer science. Springer, Berlin, Heidelberg, pp 319–330
110. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations
in vector space. arXiv:1301.3781 [cs]. http://arxiv.org/abs/1301.3781. ArXiv: 1301.3781
111. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of
words and phrases and their compositionality. In: Advances in neural information process-
ing systems, pp 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-
words-and-phrases-and-their-compositionality
112. Mohan S, Laird JE (2014) Learning goal-oriented hierarchical tasks from situated interactive
instruction. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence,
AAAI’14, pp 387–394. AAAI Press
113. Myers B, Malkin R, Bett M, Waibel A, Bostwick B, Miller RC, Yang J, Denecke M, Seemann
E, Zhu J et al (2002) Flexi-modal and multi-machine user interfaces. In: Proceedings of the
fourth IEEE international conference on multimodal interfaces, pp 343–348. IEEE
114. Myers BA (1986) Visual programming, programming by example, and program visualization:
a taxonomy. In: Proceedings of the SIGCHI conference on human factors in computing sys-
tems, CHI ’86, pp 59–66. ACM, New York, NY, USA. https://doi.org/10.1145/22627.22349.
http://doi.acm.org/10.1145/22627.22349
115. Myers BA, Ko AJ, Scaffidi C, Oney S, Yoon Y, Chang K, Kery MB, Li TJJ (2017) Mak-
ing end user development more natural. In: New perspectives in end-user development, pp
1–22. Springer, Cham. https://doi.org/10.1007/978-3-319-60291-2_1. https://link.springer.
com/chapter/10.1007/978-3-319-60291-2_1
116. Myers BA, McDaniel R (2001) Sometimes you need a little intelligence, sometimes you need
a lot. Your wish is my command: programming by example. Morgan Kaufmann Publishers,
San Francisco, CA, pp 45–60. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.
8085&rep=rep1&type=pdf
117. Myers C, Furqan A, Nebolsky J, Caro K, Zhu J (2018) Patterns for how users overcome
obstacles in voice user interfaces. In: Proceedings of the 2018 CHI conference on human
factors in computing systems, pp 1–7
118. Norman D (2013) The design of everyday things: revised and expanded edition. Basic Books
119. Oviatt S (1999) Mutual disambiguation of recognition errors in a multimodel architecture. In:
Proceedings of the SIGCHI conference on human factors in computing systems, pp 576–583.
ACM
120. Oviatt S (1999) Ten myths of multimodal interaction. Commun ACM 42(11):74–81 https://
doi.org/10.1145/319382.319398. http://doi.acm.org/10.1145/319382.319398
121. Oviatt S, Cohen P (2000) Perceptual user interfaces: multimodal interfaces that process what
comes naturally. Commun ACM 43(3):45–53
122. Pasupat P, Jiang TS, Liu E, Guu K, Liang P (2018) Mapping natural language commands
to web elements. In: Proceedings of the 2018 conference on empirical methods in natural
language processing, pp 4970–4976. ACL, Brussels, Belgium. https://doi.org/10.18653/v1/
D18-1540. https://www.aclweb.org/anthology/D18-1540
536 T. J.-J. Li et al.
123. Pasupat P, Liang P (2015) Compositional semantic parsing on semi-structured tables. In:
Proceedings of the 53rd annual meeting of the association for computational linguistics and
the 7th international joint conference on natural language processing. http://arxiv.org/abs/
1508.00305. ArXiv: 1508.00305
124. Porcheron M, Fischer JE, Reeves S, Sharples S (2018) Voice interfaces in everyday life. In:
Proceedings of the 2018 CHI conference on human factors in computing systems, CHI ’18.
ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3174214
125. Price D, Rilofff E, Zachary J, Harvey B (2000) NaturalJava: a natural language interface for
programming in java. In: Proceedings of the 5th international conference on intelligent user
interfaces, IUI ’00, pp 207–211. ACM, New York, NY, USA. https://doi.org/10.1145/325737.
325845. http://doi.acm.org/10.1145/325737.325845
126. Qi S, Jia B, Huang S, Wei P, Zhu SC (2020) A generalized earley parser for human activity
parsing and prediction. IEEE Trans Pattern Anal Mach Intell
127. Ravindranath L, Thiagarajan A, Balakrishnan H, Madden S (2012) Code in the air: simplifying
sensing and coordination tasks on smartphones. In: Proceedings of the twelfth workshop on
mobile computing systems & applications, HotMobile ’12, pp 4:1–4:6. ACM, New York,
NY, USA. https://doi.org/10.1145/2162081.2162087. http://doi.acm.org/10.1145/2162081.
2162087
128. Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-
networks. In: Proceedings of the 2019 conference on empirical methods in natural language
processing. ACL. http://arxiv.org/abs/1908.10084
129. Rodrigues A (2015) Breaking barriers with assistive macros. In: Proceedings of the 17th
international ACM SIGACCESS conference on computers & accessibility, ASSETS ’15, pp
351–352. ACM, New York, NY, USA. https://doi.org/10.1145/2700648.2811322. http://doi.
acm.org/10.1145/2700648.2811322
130. Sahami Shirazi A, Henze N, Schmidt A, Goldberg R, Schmidt B, Schmauder H (2013) Insights
into layout patterns of mobile user interfaces by an automatic analysis of android apps. In:
Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing sys-
tems, EICS ’13, pp 275–284. ACM, New York, NY, USA. https://doi.org/10.1145/2494603.
2480308. http://doi.acm.org/10.1145/2494603.2480308
131. Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, Roof B, Smith NA, Choi
Y (2019) Atomic: an atlas of machine commonsense for if-then reasoning. Proc AAAI Conf
Artif Intell 33:3027–3035
132. Sereshkeh AR, Leung G, Perumal K, Phillips C, Zhang M, Fazly A, Mohomed I (2020) Vasta:
a vision and language-assisted smartphone task automation system. In: Proceedings of the
25th international conference on intelligent user interfaces, pp 22–32
133. She L, Chai J (2017) Interactive learning of grounded verb semantics towards human-robot
communication. In: Proceedings of the 55th annual meeting of the association for computa-
tional linguistics (volume 1: long papers), pp 1634–1644. ACL, Vancouver, Canada. https://
doi.org/10.18653/v1/P17-1150. https://www.aclweb.org/anthology/P17-1150
134. Shneiderman B (1983) Direct manipulation: a step beyond programming languages. Computer
16(8):57–69. https://doi.org/10.1109/MC.1983.1654471
135. Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N, Diakopoulos N (2016) Designing
the user interface: strategies for effective human-computer interaction, 6, edition. Pearson,
Boston
136. Srivastava S, Labutov I, Mitchell T (2017) Joint concept learning and semantic parsing from
natural language explanations. In: Proceedings of the 2017 conference on empirical methods
in natural language processing, pp 1527–1536
137. Su Y, Hassan Awadallah A, Wang M, White RW (2018) Natural language interfaces with fine-
grained user interaction: a case study on web apis. In: The 41st international ACM SIGIR
conference on research and development in information retrieval, SIGIR ’18, pp 855–864.
ACM, New York, NY, USA. https://doi.org/10.1145/3209978.3210013
138. Suhm B, Myers B, Waibel A (2001) Multimodal error correction for speech user interfaces.
ACM Trans Comput-Hum Interact 8(1):60–98. https://doi.org/10.1145/371127.371166.
http://doi.acm.org/10.1145/371127.371166
Demonstration + Natural Language: Multimodal Interfaces … 537
Abstract Medical imaging is the primary data source most physicians refer to when
making a diagnosis. However, examination of medical imaging data, due to its density
and uncertainty, can be time-consuming and error-prone. The recent advent of data-
driven artificial intelligence (AI) provides a promising solution, yet the adoption of
AI in medicine is often hindered by the ‘black box’ nature. This chapter reviews how
AI can distil new insights from medical imaging data and how a human-centered
approach can transform AI’s role as one that engages patients with self-assessment
and personalized models and as one that enables physicians to comprehend and
control how AI performs a diagnosis, thus able to collaborate with AI in making a
diagnosis.
1 Introduction
Medical diagnosis and treatment is contingent upon our ability to see and understand
the human body. Medical imaging is a class of imaging technology that achieves
such purposes by non-invasively creating a visual representation of the interior of
the human body—everything from a skeletal map to the histology of cells.
Imaging is the primary data source in medicine, followed by clinical notes and
lab reports [43]. Physicians rely on medical imaging data to facilitate the differen-
tial diagnosis process [26]. For example, consider meningioma—the primary major
brain tumor among adults. Physicians might start by using radiographs to formulate
the initial hypotheses; then tissue samples are collected where histological data is
examined to further confirm the detection of the tumor and the subsequent grading to
1 We consider pathology as a type of medical imaging specialty, as the recent development in dig-
ital pathology has fostered a growing body of work on processing digitalized whole slide images,
although traditionally images obtained from removed tissues for studying pathology are not con-
sidered medical imaging.
Human-Centered AI for Medical Imaging 541
navigation, and pathological change quantification. For example, Alansary et al. pin-
points right and left cerebellum and cavum septum pellucidum from head ultrasound
[1], and several works focus on producing masks of brain MRI for up to 133 func-
tional regions [14, 40]. While U-Net-like architectures are the most well-known for
the anatomical localization [21, 32, 67], recent studies have shown that better accu-
racy and robustness can be achieved by incorporating anatomical invariances, e.g.,
organ’s shapes and adjacency conditions, into modeling as priors [56, 57]. Image
reconstruction, which aims to form high-quality scans from low does and/or fast
acquisitions, is another practical task made possible with recent AI advances. Typ-
ical works include Shan et al. introducing a CNN to reduce the noise of low-dose
CT, which achieve comparable fidelity with commercial methods in largely reduced
time [83]; Song et al. provide low-dose three-dimensional bone structures from two-
dimensional panoramic dental X-ray, using prior knowledge learned with a generative
adversarial network (GAN) [90]. Image registration aims to spatially align a radiol-
ogy scan into a specific coordinate system. It acts as an essential prepossessing step
for many clinical studies, e.g., population analysis of anatomies from large-scale data
and longitudinal analysis of temporal structural changes across a patient’s multiple
scans. Traditional optimization-based methods are mostly time-consuming, costing
up to a few hours to register one three-dimensional scan [6]. However, recent pre-
dictive CNN models have shown the ability to reduce that time to a few seconds
by estimating the deformation map from a target scan, while achieving comparable
aligning accuracy with traditional methods [8, 64].
Given the ubiquitous use of visual information for diagnosis and treatment, there exist
many other medical imaging modalities in clinics. In this section, we cover three
of the modalities that have been widely discussed in the AI research community:
colonoscopy, dermatology, and eye imaging.
Colonoscopy images have been mainly studied for the automated detection of
polyps, a clinical task that not only is laborious but also suffers from a high mis-
544 Y. Liang et al.
detection rate for endoscopists [52]. This is because AI-enabled systems are the
potential to function as a second observer to improve endoscopist’s performance by
indicating the presence of polyps in real time during colonoscopies. With CNN mod-
els trained on large-scale colonoscopy datasets with manual annotations, studies have
shown that AI can achieve a per-image sensitivity of 92%, a perpolyp sensitivity of
100%, and polyp tracking accuracy of 89% [95, 99]. However, the above studies have
treated images independently with CNNs, without considering that colonoscopy data
is mostly in the format of videos in clinics. As such, several recent works target to fur-
ther reduce false-positive detections from AI by exploring the temporal dependency
between images in a colonoscopy video: intuitively, a polyp should show among
consecutive frames with correlated positions and appearances [76, 111]. Moreover,
considering the high variations of polyps, recent work also proposes to adopt the
knowledge learnt from a heterogeneous polyp dataset collected from multiple sites
to benefit the detection of polyp in site-specific colonoscopy videos by pseudo label-
ing [112].
Dermatology. Since inspection can assist dermatologists to diagnose many types
of skin lesions, there is continuous interest in developing automated diagnostic algo-
rithms for skin issues with photos as input. For example, Vardell et al. introduce a
clinical decision support tool for diagnosing multiple skin-related issues (e.g., drug
reactions) by indexing archived studies using skin photos and metadata (e.g., body
location and surface type) [96], and Wolf et al. explore classifying photos of benign
and malignant lesions by using detected lesion border information [105]. With the
recent availability of large-scale clinical photo datasets with expert annotations for
supervised training, a few studies have reported CNNs that outperform the aver-
age dermatologist in the task of diagnosing skin malignancy for a single lesion [31,
38]. Despite the success, one of the challenges is the gap between the retrospective
data used for model training/validation and the real-world data: some study suggests
that artifacts that are less seen during training, e.g., skin markings, can significantly
reduce the CNN’s accuracy by increasing its false-positive rate [104].
Eye imaging. Fundus photography is the most used modality for eyes, which
can help physicians detect many eye health conditions, e.g., glaucoma, neoplasms
of the retina, macular degeneration, and diabetes retinopathy. Two anatomies that
are of particular interest for diagnosis are the blood vessels and the optic disk. As
such, many AI studies have focused on the task of automated extracting of the two
structures, which can potentially assist physician’s diagnostic process. For example,
existing researches have achieved high accuracy for the detailed segmentation of
vessel networks [33, 34, 63] and optic disks [63, 117] with CNNs. Regarding the
automated detection of diseases, Gulshan et al. [37] examined the performance of
a CNN for classifying the existence of referable diabetic retinopathy from a fun-
dus photo, which shows that comparable accuracy with a panel of at least seven
ophthalmologists can be achieved.
Human-Centered AI for Medical Imaging 545
Parallel to the development of data-driven AI, several limitations and challenges have
arisen.
Explainability. Despite its ever-growing accuracy in identifying patterns of spe-
cific conditions, medical imaging AI models, in the foreseeable future, will always
make mistakes from time to time due to the uncertain and inexact nature of the
data and the limitation in the learning methods (e.g., overfitting to the training data
loses generalizability). However, being imperfect is not the deal-breaker. Humans
are not perfect, either—even the most experienced physicians err from time to time.
The ‘break-down’ seems to be a lack of ability to explain AI’s findings as human
physicians do with each other. Unlike humans, AI’s mistakes are obscured by its
‘black box’ nature—physicians using AI-enabled diagnostic tools will not be able to
tell when AI makes mistakes, why it makes mistakes, or how to fix such mistakes.
Likewise, patients using an AI-enabled self-assessment app will hesitate to trust AI’s
findings, especially when in doubt upon seeing the results. Caruana et al. reported a
hospital choosing a rule-based system over a neural model, which is considered too
risky because it could not be understood; rules, on the other hand, allow physicians to
see an error and prevent it by modifying the rules [19]. This is an example of trading
off performance with explainability, a reflection of how high-stake decision-making
in medicine imposes requirements on an AI model beyond that it simply works. In
order to enable human-centered AI for medical imaging, we need to enable physi-
cians to comprehend AI’s findings so that they can trust and act on such findings,
which we discuss in Sect. 4.1; explainability should also be addressed in designing
AI-enabled self-assessment workflow for patients, which we discuss in Sect. 3.1.
Teachability. Most AI models are developed in a data-driven manner (i.e., trained
with annotated datasets)—a process loosely connected with physicians. For exam-
ple, it is typical to only have just a few labels associated with a radiograph (e.g.,
keywords extracted from radiologists’ reports); as a result, a learning method can
only tell a model what condition an image is associated with but not why or how
such a conclusion is reached. To address this limit, we can ask physicians to convey
their knowledge by providing more annotations than simple labels (e.g., pixel-level
annotations that outline the contours of tumor cells); however, such approaches risk
adding too much efforts, thus unscalable for constructing large datasets required by
training most data-driven AI models. In order to enable human-centered AI for med-
ical imaging, we should strive to make AI more teachable by (i) enabling physicians
to cost-effectively express their domain knowledge, which is then translated to struc-
tured data learnable by a model or alternative models that can be ‘incorporated’ with
the existing model; (ii) allowing physicians to alter an AI’s behavior (e.g., control-
ling which range of threshold values to use and which subset of features is more/less
relevant given a patient’ case). Both approaches are nascent research areas, which
we will discuss in Sect. 4.3.
Integrability. Researchers have long realized the limitation of using AI as a
‘Greek Oracle’. Miller and Masarie pointed out that a ‘mixed initiative system’
546 Y. Liang et al.
With the advent of diagnostic AI for medical imaging, AI-enabled tools provide a
promising solution to detect health issues for everyday users as a supplement to clin-
ical visits. Moreover, photography, an imaging modality that can be conveniently
and cost-effectively captured by patients, has become the most popular medium
for self-assessment. As such, existing research has explored using the widely avail-
able and increasingly ubiquitous platform of smartphone cameras for the purpose of
health sensing [62]. To investigate the design of patient-centered diagnostic tools,
we describe the example of OralCam [54], an interactive tool for enabling the self-
examination of oral health. OralCam targeted at detecting five common oral condi-
tions (diseases or early disease signals) that are visually diagnosable using images
of one’s oral cavity: periodontal disease, caries, soft deposit, dental calculus, and
dental discoloration. Early detection of potential oral diseases is important, since it
enables interventions that alter its natural course and prevents the onset of adverse
outcomes with a minimized cost [27]. A routine dental visit is the most effective
way for oral disease detection [23]. However, due to the lack of dental care resources
and awareness of oral health, many oral health issues are often left unexamined and
untreated, affecting about 3.5 billion people worldwide according to an estimation
in [46]. From interviews with three dentists, three key requirements were identified
for such a tool to be clinically valid and user-friendly.
Requirement #1: Leveraging additional modalities: In clinical diagnosis, den-
tists often rely on more than the visuals of a patient’s oral cavity. Other factors,
e.g., oral cleaning routines, smoking habits, and patients’ descriptions of symptoms,
Human-Centered AI for Medical Imaging 547
also contribute to diagnosis and should be taken into consideration in tandem with
smartphone-captured images.
Requirement #2: Providing contextualized results: It is insufficient to simply
show whether a condition exists or not. To help users comprehensively understand
the self-examination results, it is useful to show the localization of oral diseases
as well as related background knowledge on demand, e.g., a layman description
with exemplar images of symptoms. Dentists also pointed out that such contextual
information (location, background, etc.) can also help gain trust from users, since it
sheds light on the mechanism of the underlying detection model.
Requirement #3: Accurate detection: Automatic assessment should be accurate
so that users can be informed of possible oral conditions without being misled by false
detection. Given that automatic detection algorithms have imperfect accuracy, the
level of confidence needs to be clearly conveyed to the user to ensure a comprehensive
understanding of their oral conditions.
Guided by the requirements, a prototype of OralCam was designed and imple-
mented as shown in Fig. 1. It has been evaluated with (i) 3,182 oral photos with expert
annotations for the accuracy of AI model; (ii) 18 users over a one-week period for
the usability; (iii) 2 board-certified dentists for the clinical validity of the design. We
describe the implementation of OralCam from the three aspects of input, modeling,
and output, and summarize the corresponding lessons learnt from the evaluation.
Input: OralCam allows users to provide multiple photos on various locations of
the oral cavity (Fig. 1b), hygiene habits and medical history from a questionnaire
(Fig. 1a), and descriptions of symptoms (e.g., pain and bleeding) by directly drawing
on the captured photos (Fig. 1c) (Requirement #1). The design of the information
collecting mechanism has taken usability into consideration: the questionnaire is
designed to consist of only multiple-choice questions; and the photo taking is guided
with marks to help users align the camera. All the information collected is incor-
porated into the model which serves as model priors to enhance AI’s performance.
According to the evaluation, the information collection design was validated by den-
tists and found usable by the majority of the users.
Modeling: OralCam formulates the diagnosis as a mix of object localization and
image classification by considering the clinical nature of the conditions: object local-
ization is formulated for periodontal disease, caries, and dental calculus since their
findings are sparse in locations, while classification is formulated for soft deposit
and dental discoloration since the findings usually spread over the whole oral cavity.
In order to provide contextualized results for the classification tasks, OralCam fur-
ther localizes related regions by reasoning the model attention with activation map
(Requirement #2). To fuse all information of photo, questionnaire, and drawing for
the prediction, OralCam proposes to encode questionnaire and draw into one-hot fea-
ture maps, and concatenate them channel-wise with deep features from photos. The
whole model is end-to-end trainable, and only a small amount of additional parame-
ters are introduced due to the feature fusion mechanism. The evaluation showed high
agreements between the AI’s predicted localization of and experts’ opinion, while the
activation maps can miss some regions of findings. Moreover, the information fusion
548 Y. Liang et al.
Fig. 1 The key interaction design of OralCam for enabling layman users to perform self-
examination of oral health
from different modalities boosted the prediction accuracy for four of the conditions,
and by up to 4.80% measured in the area under the curve values.
Output: To convey the confidence of the model on a finding (Requirement #3),
OralCam sets two operating points for the model to enable trade-offs between miss
rate and false positive for the imperfect model: a high-sensitivity point with higher
probability thresholds to highlight AI’s confident findings, and a high-specificity
point with lower probability thresholds for reducing misses. According to such
threshold settings, OralCam correspondingly visualizes the likelihood of having each
type of disease by grouping them into the following levels: (i) unlikely, (ii) likely, and
(iii) very likely (Fig. 1d). To provide contextualized results (Requirement #2), Oral-
Cam highlights the related regions with either bounding boxes (Fig. 1e) or heatmaps
(Fig. 1f) on the input image once a user clicks on a detected condition. Moreover,
OralCam expands the disease label with hierarchy information, e.g., typical appear-
ances of such disease, common symptoms, and backgrounds of the disease (Fig. 1g,
h). All this information serves to contextualize the user’s understanding of an oral
condition beyond a simple textual label. With the mentioned designs, the evaluation
showed that all the 18 users had no trouble understanding the AI’s results (i.e., con-
Human-Centered AI for Medical Imaging 549
fidence level and condition visualization). Meanwhile, it was found that most of the
users believe the results, with 12 out of 18 of them believing they were having some
oral conditions that they were not aware of or could not confirm before the study.
OralCam makes several attempts to improve the explainability of AI-enabled
self-assessment tools: By visualizing where the AI model is ‘looking at’, OralCam
allows users to glance into the process of result generation. By showing hierarchi-
cal information about a type of detection, OralCam further contextualizes the user’s
understanding. By presenting predictions probabilistically, OralCam conveys richer
information of model confidence. While the study showed the effectiveness of those
measures, there were cases that users can still remain skeptical about AI when the
results conflict with users’ beliefs. For future studies, we advocate that besides show-
ing the regions of conditions, it would help users to understand by giving reasons
why the regions are flagged out, e.g., what patterns trigger the prediction, and how
the patterns are related to medical knowledge. Challenging as the task is, recent
progress of explainable AI [113] aimed to unbox models by interpreting that the
learnt deep features might provide a promising solution. Moreover, regarding gain-
ing user’s trust, external evidence, e.g., the confirmation of physicians, Food and
Drug Administration approvals, and more comprehensive clinical trials, should be
investigated.
Patient-centered care is defined as providing care that is respectful of, and respon-
sive to, individual patient preferences, needs, and values, and ensuring that patient
values guide all clinical decisions [18]. The concept has been recognized as a key
dimension of quality within health care, whereas the patient-physician communi-
cation is a vital part of it. According to [68, 79, 91, 94], patients who understand
their physicians and procedures are more likely to follow medication schedules, feel
satisfied about treatments, and have better medical outcomes. Due to the educational
barriers between the patients and the clinicians, conventional verbal instructions may
not be effective for involving patients in the clinical process of diagnosis and treat-
ment. As such, a few researchers have investigated visualization and HCI techniques
to improve patients’ understanding. The most common approach is by accompa-
nying the education with the static 3D visualization of anatomy templates, where
example systems include those for abdominal [58], cardiac system [71, 103], and
more [102]. The above systems have observed improved understanding from patients
about clinical procedures and decisions with the aid of visualization. With the recent
advent of computer graphics and AI, construction and interactive manipulation of
patient-specific 3D models have become possible, and have shown to be helpful for
physician’s decision-making [30, 75]. For example, Capuano et al. [17, 61] apply the
cardiovascular reconstruction from MRI combined with computational fluid dynam-
ics to assist surgical planning by simulating post-operative pulmonary flow patterns;
Yang et al. [110] show that the preoperative simulation with reconstructed heads from
550 Y. Liang et al.
Fig. 2 OralViewer takes a patient’s 2D X-ray (a) as input, and reconstructs the 3D teeth structure
(b) with a novel deep learning model. The system then generates the complete oral cavity model
(c) by registering the pre-defined models of jaw bone and gum to the dental arch curve. Finally,
a dentist can demonstrate the forthcoming surgeries to a patient by animating the steps with our
virtual dental instruments (d)
CT can make condyle surgeries more accurate and more convenient; Endo et al. [28]
demonstrate a simulation system with detailed tissue and vein segmentations to nav-
igate operations. Despite those efforts, only limited work has studied the application
of AI for the purpose of involving patients in clinical diagnosis and treatment.
Here, we take the example of OralViewer [55] to discuss how the combination
of HCI and AI can help improve patient-physician communication. OralViewer is
a web-based tool to enable dentists to virtually demonstrate dental surgeries on 3D
oral cavity models. The need for dental surgery demonstration arises not only from
patient-centered care but also from relieving the anxiety of oral surgeries: up to
every fourth adult reports dental fear [72], and an effective solution is to unveil the
surgical steps with patient education to decrease patients’ fear of the unknown [4,
44]. To inform the design of OralViewer, three dentists were interviewed and two
key requirements for the visualization were elicited from a clinical point of view.
Requirement #1: Providing patient-specific 3D model from 2D imaging: Den-
tal surgical steps, e.g., how a fractured tooth is extracted or repaired, often depend on
an individual’s teeth condition. Thus, a patient-specific teeth model should be pro-
vided to make demonstrations contextualized to the patient’s conditions. Moreover,
3D screening of oral cavity (e.g., Cone-beam CT) is not standard practice for the
clinical diagnosis of many common surgeries, e.g., apicoectomy and root canal treat-
ment, for its high radiation and cost. As such, it is preferred to generate a patient’s 3D
teeth model from his/her 2D X-ray image to enable the widely available application
of the tool.
Requirement #2: Modeling complete oral cavity: Both the target oral structure
of a surgery and its nearby anatomies need to be incorporated into a surgical demon-
stration. For example, when a dentist removes a root tip in apicoectomy, procedures
on other structures should be simulated as well. Thus, to help patients understand
what to expect in a surgery, a complete oral cavity including teeth, gum, and jawbones
should be modeled.
Informed by the aforementioned requirements, a prototype of OralViewer was
designed and implemented as shown in Fig. 2. Overall, OralViewer consists of two
cascaded parts: (i) a 3D reconstruction pipeline for generating a patient’s oral cavity
from a single 2D panoramic X-ray with the aid of AI (Fig. 2a→c), and (ii) a demon-
Human-Centered AI for Medical Imaging 551
Fig. 3 OralViewer reconstructs 3D patient-specific tooth structures (6) from a single panoramic
X-ray (1)
stration interface for dentist’s animating steps on the 3D model with virtual dental
instruments (Fig. 2d). To meet the requirements of complete and patient-specific oral
cavity modeling, OralViewer combines the recent technique of 3D reconstruction
with deformable template modeling. In the first step, a novel deep CNN as shown
in Fig. 3 (Requirement #1) is proposed to estimate the patient’s 3D teeth struc-
tures from a single 2D panoramic X-ray (Fig. 31). The task is more challenging than
anatomy modeling in surgical planning scenarios, where detailed 3D medical imag-
ing (e.g., CT and MRI) are available for extracting 3D virtual models. To tackle it,
the model decomposes it into two easier sub-tasks of teeth localization (Fig. 3b) and
patch-wise single tooth reconstruction (Fig. 3c); meanwhile, feature maps (Fig. 3a)
are shared to increase the generalization of the model. A semi-automatic pipeline
is applied to extract the dental arch curve (Fig. 35) from occlusal surface photos of
the patient, since such information is lost during the rotational screening process
of a panoramic X-ray. In the second step, 3D templates of gum and jawbones are
defined from existing 3D head CTs, and non-rigidly registered to the estimated teeth
to tailor for specific patients’ oral anatomy, since the 3D structures of gum and jaw
cannot be well-reflected from X-ray. Finally, the deformed gum and jawbone model
can be assembled with the 3D reconstructed teeth for the complete oral cavity model
(Requirement #2). The technical evaluation with data of 10 patients showed that the
CNN-based 3D teeth reconstruction achieved an average intersection over a union of
0.771±0.062. The expert study with 3 board-certificated dentists further confirmed
that reconstructed oral cavity models could appropriately reflect patient-specific con-
ditions, and was clinically valid for patient education.
552 Y. Liang et al.
Fig. 4 A diagram that compares the prioritization matrices held by two different medical profes-
sionals (blue versus orange line) when making diagnosis decisions. While one doctor uses severity
as the primary parameter to weigh various data during diagnosis/treatment, the other cares most
about the calculation of risks and responsibilities
Human-Centered AI for Medical Imaging 553
Fig. 5 To enable physicians to comprehend AI’s findings, we conducted three user-centered activ-
ities to formulate the design of CheXplain [106]—a system that enables physicians to explore and
understand AI-enabled chest X-ray analysis
system for physicians to interactively explore and understand CXR analysis generated
by a state-of-the-art AI [42].
Iteration #1: Survey We conducted a paired survey on both referring physicians
(N = 39) and radiologists (N = 38) to learn about how radiologists currently explain
their analysis to referring physicians and how referring physicians expect expla-
nations from both human and (hyperthetical) AI radiologists. The findings reveal
whether, when, and what kinds of explanations are needed between referring physi-
cians and radiologists. By juxtaposing referring physicians’ responses to these ques-
tions with respect to human versus AI radiologists, we elicit system requirements
that encompass the current practices and a future of AI-enabled diagnosis.
Iteration #2: Low-fi prototype co-designed with three physicians manifested
the survey-generated system requirements into eight specific features (Fig. 6a–f):
augmenting input CXR images with specific inquiries, mediating system complexity
by the level of urgency, presenting explanation hierarchically, calibrating regional
findings with contrastive examples, communicating results probabilistically, contex-
tualizing impressions with additional information, and comparing with images in the
past or from other patients.
Iteration #3: High-fi prototype integrates the eight features into CheXplain
(Fig. 7)—a functional system front-end that allows physician users to explore real
results generated by an AI [42] while all the other explanatory information either
comes from real clinical reports or is manually generated by a radiologist collab-
orator. An evaluation with six medical professionals provides summative insights
on each feature. Participants provided more detailed and accurate explanations of
the underlying AI after interacting with CheXplain. We summarize the design and
implementation recommendations for the future development of explainable medical
AI.
One meaningful finding in this process is the distinction between explanation and
justification. Our research started with a focus on explanations of AI, which enables
554 Y. Liang et al.
Fig. 6 Low-fidelity prototypes of eight key system designs of CheXplain formulated by co-design
sessions with physicians from UCLA Health
Fig. 7 The High-fidelity prototype of CheXplain, which was evaluated in work sessions involving
eight physicians across the US
Use more explanation for medical data analysis (e.g., radiologists), and more
justification for medical data consumers (e.g., referring physicians). We find that
radiologists tend to expect more explanations for details such as low-level annota-
tions of CXR images, while referring physicians are generally less concerned about
the intrinsic details but care more about the extrinsic validity of AI’s results. Retro-
spectively, we can see that five of CheXplain’s features are a justification (#3-7).
Enable explanation and justification at different levels of abstraction, similar
to how CheXplain employs the examination-observation-impression hierarchy to
scaffold both explanation and justification. Holistically, as a physician follows a
bottom-up or top-down path to understand an AI’s diagnosis of a patient, at any step
along the way, they should be able to seek both explanation and justification. To
achieve this, the XAI community needs to consider explanation regulated by a user-
specified level of abstraction; research on Content-Based Image Retrieval (CBIR)
should enable search criteria at multiple levels of abstraction, e.g., from a region of
specific visual features to a global image that presents a specific impression.
4.2.1 OralViewer
Requirement: The dentists consider it important to show for each surgery step:
(i) how the step is performed—illustrating the applied instruments, and (ii) what
happens in the step—animating the dental structure changes upon the application
of instruments. Moreover, the demonstration should be carried out by dentists using
simple interaction techniques, which is more important than having to achieve real-
istic effects with high fidelity.
Guided by the requirement, a prototype of OralViewer’s interface was developed,
with the overall workflow as shown in Fig. 8. OralViewer provides a web-based inter-
face for dentists to demonstrate surgery steps with a set of virtual dental instruments.
The dental instruments allow dentists to express what and where an action is applied
to the oral cavity, and demonstrate the effect on the model in real time. A dentist starts
by importing a reconstructed 3D oral cavity model (Fig. 8a), which can be viewed
freely with rotation and scaling. To apply a virtual dental instrument, the dentist
selects the instrument from a list (Fig. 81). Upon the selection, the corresponding
instrument model (Fig. 8(4, 6)) is visualized, and can be controlled with a mouse to
move and operate on the oral cavity model. Moreover, dentists can use simple sliders
to customize the animation effect of the instruments to better suit their particular
needs and preferences Fig. 8(3, 5). The selected instrument can be directly applied
to a dental structure for demonstrating effects with clicking, pressing, and dragging
(Fig. 8d). Since a typical dental surgery consists of sequential steps using multiple
dental instruments, the aforementioned steps of instrument selection, adjusting, and
animating can be repeated to demonstrate.
OralViewer was validated for the demonstration of two common but complex
dental surgeries, crown lengthening and apicoectomy, each of which involves multi-
ple steps. The user study involving 12 patients compared the effectiveness of patient
education using OralViewer and the common practice of verbal descriptions, and
the result indicated that OralViewer led to a significantly better understanding of a
forthcoming surgery ( p < 0.05). The expert study with three dentists, who had the
user experience of OralViewer showed a high preference for the tool, pointed out
that such tool can be very necessary with the patients’ recently growing need for
improved dentist visit experience and their willingness to involve in treatment plan-
ning. Regarding the usability, experts pointed out that the virtual instrument control
with a mouse was unfamiliar to dentists: it is different from the way dentists use real
dental instruments in a surgery, which can lead to a steep learning curve. As such,
Human-Centered AI for Medical Imaging 557
an implementation on a touch screen, e.g., iPad, with the control using a stylus can
be more intuitive to dentists.
4.2.2 SmartReporting
For every medical study (e.g., X-ray, CT, and MRI) performed, a physician is required
to generate a report describing all the pertinent findings (i.e., both positive findings
and pertinent negative ones). Since reports need to be comprehensive and accurate,
the act of generating reports usually takes a large amount of physicians’ time for each
study performed. At the same time, with the large number of medical imaging studies
conducted annually [87, 88], it is imperative that physicians make diagnoses based
on studies efficiently and accurately. Thus, there is a need for tools that can reduce
the time cost of physicians generating a report. The state-of-the-art systems enable
the physician to input findings by voice while reading a study, which is converted
to textual sentences automatically in real time with voice recognition algorithms
[77, 77, 93]. However, such systems can lead to fatigue, since the physician needs
to consistently speak into a microphone on a typical workday. Moreover, they are
vulnerable to speech interpretation accuracy—the variety of speaker accents and
uncommon medical terminologies predispose them to errors, which from time to
time requires physicians to revise or re-enter the findings.
Different from those systems, SmartReporting alleviates medical knowledge and
AI-enabled automated awareness of anatomy/finding to save a physician’s effort in
creating reports. First, a template is pre-defined for each finding type at each anatomy,
which can include entries and options for detailing different aspects of a finding. An
example of such a template is shown in Fig. 9. The physician can fill in the template
by one or more types of human-machine interactions, such as mouse clicks, screen
taps, typing, and speaking into a microphone. By encoding medical knowledge, such
templates reduce the amount of physician’s inputs from the full narration of findings
to the selecting or key value entering. Next, to enable the fast retrieving of templates
and their automated filling, SmartReporting defines two working modes: semi-auto
mode and auto mode.
Semi-auto mode: captures mouse clicks from the physician on the location of
interest, and prompts the physician to select a template from a list to fill in for
describing the finding (Fig. 10a). The list is filtered to consist only of the templates
that are related to the anatomies around the cursor location of activation. To achieve
anatomy awareness, a CNN-based anatomical segmentation algorithm is applied,
which incorporates structural invariances to enhance model robustness to possible
pathologies [56]. Figure 11b demonstrates the segmentation of 65 anatomies for an
input head CT as an example.
Auto mode: It makes use of the computer-aided diagnosis of a medical scan.
As shown in Fig. 10b, upon the activation, the physician is prompted with a list of
templates that are related to the findings around the selected location. Moreover, the
templates selected are also pre-filled with information from the automated diagnosis
for extra time saving. For example, the dimension of a hemorrhage can be calculated
558 Y. Liang et al.
Fig. 9 An example of the interactive template of SmartReporting, which is used to guide the
physician to fill in descriptions of a finding
Fig. 10 SmartReporting’s interactive study display interface with its response to the physician’s
selections in the a semi-auto mode and b auto mode. c An example of the SmartReporting’s report
screen with an auto-generated readable report
Human-Centered AI for Medical Imaging 559
Fig. 11 Given a head CT scan (a), SmartReporting’s AI enables the segmentation of 65 anatomies
(b) for both semi-auto and auto modes, and the segmentation of multiple types of positive finding,
e.g., hemorrhage and edema (c), for the auto mode
with the AI’s predicted region and the voxel spacing information that is stored in
the metadata of a scan. The physician can edit or confirm the template with the
pre-filled text. To achieve the automated diagnosis, SmartReporting applies multiple
CNN models that are known in the art [20], with each for detecting the existence of
one type of finding and generating a mask for it. As an example, Fig. 11c illustrates
the detection and segmentation of hemorrhage and edema from an input head CT.
Once the physician finishes describing all the findings, SmartReporting converts
the filled templates into a medical report, as the example shows in Fig. 11c. Such
conversion is done using a pre-defined mapping protocol that maps template entry-
option pairs to readable sentences.
SmartReporting has the potential to provide multiple advantages over the state-of-
the-art voice-based reporting systems: (i) it can reduce the time to complete diagnosis
by using pre-defined templates to describe findings, using image-based AI to accel-
erate the template retrieval/filling, and automatically converting filled templates into
reports; (ii) it can reduce interpretation mistakes, since the physician instead describ-
ing findings by mainly making selections based on the options provided via templates;
(iii) it is more adaptable for multiple language interchange, since it mostly uses built-
in entries and options for describing findings, which are stored with encodings that
are independent of languages; (iv) it also has the potential to ease the mining of
medical records, which is a meaningful task for disease prediction and improving
the clinical decision-making [45], since all the findings generated are standard in
wordings and structured in formats.
Despite the achievements of the above-mentioned computer tools, they all have a
relatively simple working mode of first presenting the AI’s result as is, and then the
physician diagnoses by accepting/declining the result or basing the practice on the
result. As such, the performance of the tools can be largely affected by the mispre-
dictions from AI. Some study has suggested that a physician can possibly be more
likely to miss a diagnosis with the presence of a false-negative AI detection than
without the AI system [51]. However, errors are almost inevitable for current AI
algorithms due to multiple reasons, e.g., the ambiguity exists in medical imaging,
560 Y. Liang et al.
by default, which could defeat the recommendation boxes’ purpose to accelerate the
pathologists’ examination process.
Lesson #2: Medical diagnosis is seldom a one-shot task, thus AI’s recommenda-
tions need to continuously direct a medical user to filter and prioritize a large task
space, taking into account new information extracted from a user’s up-to-date input.
Recommendation #2: make AI-generated suggestions always available (and con-
stantly evolving) throughout the process of a (manual) examination. For example, in
Impetus, a straightforward design idea is to show recommendation boxes one after
another. We believe this is especially helpful when the pathologist might be drawn
to a local, zoomed-in region and neglects looking at the rest of the WSI. The always-
available recommendation boxes can serve as global anchors that inform pathologists
of what might need to be examined elsewhere beyond the current view. This is an
example of a multi-shot diagnosis behavior where each shot is an attempt to find
tumor cells in a selected region.
Lesson #3: Medical tasks are often time-critical, thus the benefits of AI’s guid-
ance, suggestions and recommendations need to be weighed by the amount of extra
efforts incurred and the actionability of the provided information. Recommen-
dation #3: weigh the amount of extra efforts by co-designing a system with target
medical users, as different physicians have a different notion of time urgency. Emer-
gency room doctors often deal with urgent cases by making decisions in a matter
of seconds, and internists often perform examinations in 15–20 min per patient, and
oncologists or implant specialists might decide on a case via multiple meetings that
span days. There is a sense of timeliness in all these scenarios, but the amount of
time that can be budgeted differs from case to case. To address such differences, we
further recommend modeling each interactive task in a medical AI system (i.e., how
long it might take for the user to perform each task) and providing a mechanism that
allows physicians to ‘filter out’ interactive components that might take too much time
(e.g., the attention map in Impetus). Importantly, different levels of urgency should
be modifiable (perhaps as a one-time setup) by physicians in different specialties.
Lesson #4: To guide the examination process with prioritization, AI should help
a medical user narrow in small regions of a large task space, as well as helping
them filter out information within specific regions. Recommendation #4: use
visualization to filter out information, i.e., leverage AI’s results to reduce information
load for the physicians. An example would be a spotlight effect that darkens parts
of a WSI where AI detects little or no tumor cells. Based on our observation that
pathologists used AI’s results to confirm their examination on the original H&E
WSI, such an overt visualization can help them filter out subsets of the WSI patches.
Meanwhile, pathologists can also reveal a darkened region if they want to examine
further AI’s findings (e.g., when they disagree with AI, believing a darkened spot
has signs of tumor).
Lesson #5: It is possible for medical users to provide labels during their workflow
with acceptable extra effort. However, the system should provide explicit feedback
on how the model improves as a result, as a way to motivate and guide medical
users’ future inputs. Recommendation #5: when adapting the model on-the-fly,
show a visualization that indicates the model’s performance changes as the physician
Human-Centered AI for Medical Imaging 563
labels more data. There could be various designs of such information, from showing
low-level technical details (e.g., the model’s specificity versus sensitivity), high-level
visualization (e.g., charts that plot accuracy over WSIs read), and even actionable
items (e.g., ‘nudging’ the user to label certain classes of data to balance the training
set). There are two main factors to consider when evaluating a given design: (i) as we
observed in our study, whether the design could inform the physician of the model’s
performance improvement or degradation as they label more data, which can be
measured quantitatively as the amount of performance gain divided by the amount
of labeling work done; (ii) as we noted in Lesson #2, whether consuming the extra
information incurs too much effort and slows down the agile labeling process and
whether there is actionability given the extra information about model performance
changes.
Lesson #6: Tasks treated equally by an AI might carry different weights to a
medical user. Thus for medically high-staked tasks, AI should provide information
to validate its confidence level. Recommendation #6: provide additional justifica-
tion for a negative diagnosis of a high-staked disease. For example, when Impetus
concludes a case as negative, the system can still display the top five regions wherein
AI finds the most likely signs of tumor (albeit below a threshold of positivity). In
this way, even if the result turned out to be a false negative, the physicians would be
guided to examine regions where the actual tumor cells are likely to appear. Beyond
such intrinsic details, it is also possible to retrieve extrinsic information, e.g., preva-
lence of the disease given the patient’s population, or similar histological images for
comparison. As suggested in [106], such extrinsic justification can complement the
explanation of a model’s intrinsic process, thus allowing physicians to understand
AI’s decision more comprehensively.
References
26. Croskerry P, Cosby K, Graber ML, Singh H (2017) Diagnosis: interpreting the shadows. CRC
Press, Boca Raton
27. Deep P (2000) Screening for common oral diseases. J-Can Dent Assoc 66(6):298–299
28. Endo K, Sata N, Ishiguro Y, Miki A, Sasanuma H, Sakuma Y, Shimizu A, Hyodo M, Lefor
A, Yasuda Y (2014) A patient-specific surgical simulator using preoperative imaging data: an
interactive simulator using a three-dimensional tactile mouse. J Comput Surg 1(1):1–8
29. Ertosun MG, Rubin DL (2015) Automated grading of gliomas using deep learning in digital
pathology images: a modular approach with ensemble of convolutional neural networks. In:
AMIA annual symposium proceedings, vol 2015. American Medical Informatics Association,
p 1899
30. Eschweiler J, Stromps JP, Fischer M, Schick F, Rath B, Pallua N, Radermacher K (2016) A
biomechanical model of the wrist joint for patient-specific model guided surgical therapy:
part 2. Proc Inst Mech Eng Part H: J Eng Med 230(4):326–334
31. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S (2017) Dermatologist-
level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118
32. Falk T, Mai D, Bensch R, Çiçek Ö, Abdulkadir A, Marrakchi Y, Böhm A, Deubner J, Jäckel
Z, Seiwald K et al (2019) U-net: deep learning for cell counting, detection, and morphometry.
Nat Methods 16(1):67–70
33. Fu H, Xu Y, Lin S, Wong DWK, Liu J (2016) Deepvessel: retinal vessel segmentation via
deep learning and conditional random field. In: International conference on medical image
computing and computer-assisted intervention. Springer, pp 132–139
34. Fu H, Xu Y, Wong DWK, Liu J (2016) Retinal vessel segmentation via deep learning network
and fully-connected conditional random fields. In: 2016 IEEE 13th international symposium
on biomedical imaging (ISBI). IEEE, pp 698–701
35. Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model
uncertainty in deep learning. In: International conference on machine learning, pp 1050–1059
36. Gu H, Huang J, Hung L, Chen XA (2020) Lessons learned from designing an AI-enabled
diagnosis tool for pathologists. arXiv:2006.12695 null
37. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, Venugopalan S, Widner
K, Madams T, Cuadros J et al (2016) Development and validation of a deep learning algorithm
for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316(22):2402–2410
38. Haenssle HA, Fink C, Schneiderbauer R, Toberer F, Buhl T, Blum A, Kalloo A, Hassen
ABH, Thomas L, Enk A et al (2018) Man against machine: diagnostic performance of a deep
learning convolutional neural network for dermoscopic melanoma recognition in comparison
to 58 dermatologists. Ann Oncol 29(8):1836–1842
39. Holzinger A, Malle B, Kieseberg P, Roth PM, Müller H, Reihs R, Zatloukal K (2017) Towards
the augmented pathologist: challenges of explainable-AI in digital pathology. arXiv preprint
arXiv:1712.06657
40. Huo Y, Xu Z, Xiong Y, Aboud K, Parvathaneni P, Bao S, Bermudez C, Resnick SM, Cutting
LE, Landman BA (2019) 3D whole brain segmentation using spatially localized atlas network
tiles. NeuroImage 194:105–119
41. Ilse M, Tomczak JM, Welling M (2018) Attention-based deep multiple instance learning.
arXiv preprint arXiv:1802.04712
42. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball
R, Shpanskaya K, Others (2019) Chexpert: a large chest radiograph dataset with uncertainty
labels and expert comparison. In: Thirty-Third AAAI conference on artificial intelligence.
http://aaai.org
43. Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y (2017)
Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol 2(4):230–243
44. Johnson S, Chapman K, Huebner G (1984) Stress reduction prior to oral surgery. Anesth Prog
31(4):165
45. Jothi N, Husain W et al (2015) Data mining in healthcare-a review. Procedia Comput Sci
72:306–313
Human-Centered AI for Medical Imaging 567
65. Miller RA, Masarie FE (1990) The demise of the ‘Greek Oracle’ model for medical diagnostic
systems. ISSN: 00261270 Publication Title: Methods of Information in Medicine
66. Miller RA, Masarie FE (1990) The demise of the ‘Greek Oracle’ model for medical diagnostic
systems
67. Milletari F, Navab N, Ahmadi SA (2016) V-net: fully convolutional neural networks for
volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision
(3DV). IEEE, pp 565–571
68. Mills I, Frost J, Cooper C, Moles DR, Kay E (2014) Patient-centred care in general dental
practice-a systematic review of the literature. BMC Oral Health 14(1):64
69. Nair T, Precup D, Arnold DL, Arbel T (2020) Exploring uncertainty measures in deep networks
for multiple sclerosis lesion detection and segmentation. Med Image Anal 59:101557
70. Nalisnik M, Amgad M, Lee S, Halani SH, Vega JEV, Brat DJ, Gutman DA, Cooper LA
(2017) Interactive phenotyping of large-scale histology imaging data with HistomicsML. Sci
Rep 7(1):14588
71. Olivieri LJ, Zurakowski D, Ramakrishnan K, Su L, Alfares FA, Irwin MR, Heichel J, Krieger
A, Nath DS (2018) Novel, 3D display of heart models in the postoperative care setting
improves CICU caregiver confidence. World J Pediatr Congenit Hear Surg 9(2):206–213
72. Oosterink FM, De Jongh A, Hoogstraten J (2009) Prevalence of dental fear and phobia relative
to other fear and phobia subtypes. Eur J Oral Sci 117(2):135–143
73. Pedoia V, Norman B, Mehany SN, Bucknor MD, Link TM, Majumdar S (2019) 3D con-
volutional neural networks for detection and severity staging of meniscus and PFJ cartilage
morphological degenerative changes in osteoarthritis and anterior cruciate ligament subjects.
J Magn Reson Imaging 49(2):400–410
74. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell MV, Corrado GS, Peng L, Webster
DR (2018) Prediction of cardiovascular risk factors from retinal fundus photographs via deep
learning. Nat Biomed Eng 2(3):158
75. Prinz A, Bolz M, Findl O (2005) Advantage of three dimensional animated teaching over tra-
ditional surgical videos for teaching ophthalmic surgery: a randomised study. Br J Ophthalmol
89(11):1495–1499
76. Qadir HA, Balasingham I, Solhusvik J, Bergsland J, Aabakken L, Shin Y (2019) Improving
automatic polyp detection using CNN by exploiting temporal dependency in colonoscopy
video. IEEE J Biomed Health Inform 24(1):180–193
77. Rosenthal DF, Bos JM, Sokolowski RA, Mayo JB, Quigley KA, Powell RA, Teel MM (1997)
A voice-enabled, structured medical reporting system. J Am Med Inform Assoc 4(6):436–441
78. Roux L, Racoceanu D, Loménie N, Kulikova M, Irshad H, Klossa J, Capron F, Genestie C,
Le Naour G, Gurcan MN (2013) Mitosis detection in breast cancer histological images an
ICPR 2012 contest. J Pathol Inform 4
79. Rozier RG, Horowitz AM, Podschun G (2011) Dentist-patient communication techniques
used in the united states: the results of a national survey. J Am Dent Assoc 142(5):518–530
80. Rubin GD (2015) Lung nodule and cancer detection in CT screening. J Thorac Imaging
30(2):130
81. Schaekermann M, Cai CJ, Huang AE, Sayres R (2020) Expert discussions improve com-
prehension of difficult cases in medical image assessment. In: Proceedings of the 2020 CHI
conference on human factors in computing systems, pp 1–13
82. Settles B (2009) Active learning literature survey. University of Wisconsin-Madison Depart-
ment of Computer Sciences, Technical Report
83. Shan H, Padole A, Homayounieh F, Kruger U, Khera RD, Nitiwarangkul C, Kalra MK,
Wang G (2019) Competitive performance of a modularized deep neural network compared to
commercial algorithms for low-dose CT image reconstruction. Nat Mach Intell 1(6):269–276
84. Shiping X, Jean-Paul D, Yuan L (2020) System for generating medical reports for imaging
studies. US Patent App. 17/006,590
85. Shortliffe EH (1974) A rule-based computer program for advising physicians regarding
antimicrobial therapy selection. In: Proceedings of the 1974 annual ACM conference-volume
2, p 739
Human-Centered AI for Medical Imaging 569
86. Shortliffe EH (1993) The adolescence of AI in medicine: will the field come of age in the
’90s? Artif Intell Med. https://doi.org/10.1016/0933-3657(93)90011-Q
87. Smith-Bindman R, Kwan ML, Marlow EC, Theis MK, Bolch W, Cheng SY, Bowles EJ,
Duncan JR, Greenlee RT, Kushi LH et al (2019) Trends in use of medical imaging in us health
care systems and in Ontario, Canada, 2000–2016. JAMA 322(9):843–856
88. Smith-Bindman R, Miglioretti DL, Larson EB (2008) Rising use of diagnostic medical imag-
ing in a large integrated health system. Health Aff 27(6):1491–1502
89. Sommer C, Straehle C, Koethe U, Hamprecht FA (2011) Ilastik: interactive learning and
segmentation toolkit. In: 2011 IEEE international symposium on biomedical imaging: from
nano to macro. IEEE, pp 230–233
90. Song W, Liang Y, Wang K, He L (2021) Oral-3D: reconstructing the 3D bone structure of
oral cavity from 2D panoramic X-ray. In: Proceedings of the AAAI conference on artificial
intelligence
91. Stewart MA (1995) Effective physician-patient communication and health outcomes: a review.
CMAJ: Can Med Assoc J 152(9):1423
92. Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI (2020) An
overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ
Digit Med 3(1):1–10
93. Teel MM, Sokolowski R, Rosenthal D, Belge M (1998) Voice-enabled structured medical
reporting. In: Proceedings of the SIGCHI conference on human factors in computing systems,
pp 595–602
94. Travaline JM, Ruchinskas R, D’Alonzo GE Jr (2005) Patient-physician communication: why
and how. J Am Osteopat Assoc 105(1):13
95. Urban G, Tripathi P, Alkayali T, Mittal M, Jalali F, Karnes W, Baldi P (2018) Deep learning
localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy.
Gastroenterology 155(4):1069–1078
96. Vardell E, Bou-Crick C (2012) VisualDX: a visual diagnostic decision support tool. Med Ref
Serv Q 31(4):414–424
97. Veta M, Heng YJ, Stathonikos N, Bejnordi BE, Beca F, Wollmann T, Rohr K, Shah MA, Wang
D, Rousson M et al (2019) Predicting breast tumor proliferation from whole-slide images:
the TUPAC16 challenge. Med Image Anal 54:111–121
98. Veta M, Van Diest PJ, Willems SM, Wang H, Madabhushi A, Cruz-Roa A, Gonzalez F, Larsen
AB, Vestergaard JS, Dahl AB et al (2015) Assessment of algorithms for mitosis detection in
breast cancer histopathology images. Med Image Anal 20(1):237–248
99. Wang P, Xiao X, Brown JRG, Berzin TM, Tu M, Xiong F, Hu X, Liu P, Song Y, Zhang D et al
(2018) Development and validation of a deep-learning algorithm for the detection of polyps
during colonoscopy. Nat Biomed Eng 2(10):741–748
100. Wears RL, Berg M (2005) Computer technology and clinical work: still waiting for Godot.
JAMA 293(10):1261–1263. https://doi.org/10.1001/jama.293.10.1261
101. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich
I, Sander C, Stuart JM, Network CGAR et al (2013) The cancer genome atlas pan-cancer
analysis project. Nat Genet 45(10):1113
102. Welker SA (2014) Urogynecology patient education: visualizing surgical management of
pelvic organ prolapse. PhD thesis, Johns Hopkins University
103. Whitman RM, Dufeau D (2017) Visualization of cardiac anatomy: new approaches for medical
education. FASEB J 31(1_supplement), 736–8
104. Winkler JK, Fink C, Toberer F, Enk A, Deinlein T, Hofmann-Wellenhof R, Thomas L, Lallas
A, Blum A, Stolz W et al (2019) Association between surgical skin markings in dermo-
scopic images and diagnostic performance of a deep learning convolutional neural network
for melanoma recognition. JAMA Dermatol 155(10):1135–1141
105. Wolf JA, Moreau JF, Akilov O, Patton T, English JC, Ho J, Ferris LK (2013) Diagnostic
inaccuracy of smartphone applications for melanoma detection. JAMA Dermatol 149(4):422–
426
570 Y. Liang et al.
106. Xie Y, Chen M, Kao D, Gao G, Chen XA (2020) CheXplain: enabling physicians to explore
and understand data-driven, AI-enabled medical imaging analysis. In: Proceedings of the 2020
CHI conference on human factors in computing systems, CHI ’20. Association for Computing
Machinery, New York, pp 1–13. https://doi.org/10.1145/3313831.3376807
107. Xie Y, Gao G, Chen XA (2019) Outlining the design space of explainable intelligent systems
for medical diagnosis
108. Xu Y, Zhu JY, Chang E, Tu Z (2012) Multiple clustered instance learning for histopathol-
ogy cancer image classification, segmentation and clustering. In: 2012 IEEE conference on
computer vision and pattern recognition. IEEE, pp 964–971
109. Yang J, Liang Y, Zhang Y, Song W, Wang K, He L (2020) Exploring instance-level uncertainty
for medical detection
110. Yang X, Hu J, Zhu S, Liang X, Li J, Luo E (2011) Computer-assisted surgical planning and
simulation for condylar reconstruction in patients with osteochondroma. Br J Oral Maxillofac
Surg 49(3):203–208
111. Yu L, Chen H, Dou Q, Qin J, Heng PA (2016) Integrating online and offline three-dimensional
deep learning for automated polyp detection in colonoscopy videos. IEEE J Biomed Health
Inform 21(1):65–75
112. Zhan ZQ, Fu H, Yang YY, Chen J, Liu J, Jiang YG (2020) Colonoscopy polyp detection:
domain adaptation from medical report images to real-time videos
113. Zhang QS, Zhu SC (2018) Visual interpretability for deep learning: a survey. Front Inf Technol
Electron Eng 19(1):27–39
114. Zhou SK, Greenspan H, Davatzikos C, Duncan JS, van Ginneken B, Madabhushi A, Prince
JL, Rueckert D, Summers RM (2020) A review of deep learning in medical imaging: image
traits, technology trends, case studies with progress highlights, and future promises. arXiv
preprint arXiv:2008.09104
115. Zhou ZH (2004) Multi-instance learning: a survey. Department of Computer Science & Tech-
nology, Nanjing University, Technical Report
116. Zhu Y, Zhang S, Liu W, Metaxas DN (2014) Scalable histopathological image analysis via
active learning. In: International conference on medical image computing and computer-
assisted intervention. Springer , pp 369–376
117. Zilly J, Buhmann JM, Mahapatra D (2017) Glaucoma detection using entropy sampling and
ensemble learning for automatic optic cup and disc segmentation. Comput Med Imaging
Graph 55:28–41
3D Spatial Sound Individualization with
Perceptual Feedback
1 Introduction
User modeling is important in the design of interactive systems. Each user has their
own physical and cognitive characteristics or preference biases. We can significantly
improve user experience by adapting a system to individual users by taking such char-
K. Yamamoto (B)
Hamamatsu-si, Shizuoka-ken, Japan
e-mail: yamo_o@acm.org
T. Igarashi
Bunkyo, Tokyo, Japan
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 571
Y. Li and O. Hilliges (eds.), Artificial Intelligence for Human Computer Interaction:
A Modern Approach, Human–Computer Interaction Series,
https://doi.org/10.1007/978-3-030-82681-9_17
572 K. Yamamoto and T. Igarashi
Fig. 1 We can recognize the localization of the sound source by spectral transforms of both ear’s
sounds. These two-channel transforms of the spectrums can be represented as finite impulse response
filters and are called human-related transfer functions (HRTFs)
head, the perceived direction of the sound naturally changes accordingly. This allows
the user to feel as if the sound source is fixed in a space regardless of the user’s
motion. How is it possible to change the direction of sound perception even though
the positional relationship between the headphone and the ears remains unchanged?
The human auditory system perceives the directions of incoming sounds using
both ears. According to the direction from which a sound arrives to the head, an arrival
time difference to the left and right ears can be determined. In addition, the sound
is intricately diffracted by the shape of the person’s head and ears. This diffraction
effect depends on the frequency and incoming direction of the sound. Therefore, the
spectrums of the sounds that arrive at each ear are modulated (Fig. 1). The human
auditory system recognizes the location of the sound by these sound modulations.
These two-channel modulations of the spectrums can be represented as finite impulse
response filters and are called human-related transfer functions (HRTFs). 3D spatial
sound manipulates the perceived direction of incoming sound by convoluting this
HRTFs with the original stereo source according to the relative positions between
the user and virtual source.
HRTFs are highly specific to individuals because they depend considerably on the
shape of indivisual user’s ears and head. We call the proper HRTFs for individual users
as an individualized HRTFs. We know that inappropriate HRTFs can lead to improper
localization of the sound source accompanied by an unexpected equalization of
the timbre. Such improper localization especially includes front-back and up-down
confusions [1–3]. Because of this, we must essentially measure the individualized
HRTF for each user.
The traditional method for acquiring individualized HRTFs is acoustic measure-
ment in an anechoic chamber. Loud speaker arrays are spherically arranged around
the subject’s head and two small microphones are inserted into both ears. The subject
sits with his or her head placed at the center of these spherical speaker arrays and is
instructed to remain still during the long measurement periods. Specific test signals
(such as sine sweep signals) are played one by one from the different loud speakers
and the signals at the microphones are recorded. By comparing these recordings
with those obtained from a microphone placed at the center of the speaker arrays
574 K. Yamamoto and T. Igarashi
(without the subject), the individual HRTFs can be computed. Many variants exist
for conducting these measurements.
However, because such measurements usually require expensive equipment as
well as tedious procedures, using these measurements with each end user is imprac-
tical and thus prevents their widespread use. Thus, using individualized HRTFs for
each end user has been impractical, and most 3d spatial sound applications are forced
to use the averaged (not optimized for any people) HRTFs, resulting in sub-optimal
user experience. This may explain why 3D spatial sound rendering has not been as
popular as visual rendering.
This chapter introduces a method to obtain individualized HRTFs without acoustic
measurement by leveraging machine learning and perceptual feedback. To achieve
this, we first design a deep neural network-based HRTF generative model trained by
the publicity available HRTF database. However, of course, the database does not
contain the optimal HRTFs for the target user, and it could be difficult to produce
the target HRTFs directly. To address this, we train a model in such a way to extract
the unique features of each individual’s HRTFs, and it produces new tailored HRTFs
for the target user by blending them at the calibration phase. This can be conducted
by optimizing the embedded user-adaptable parameters with a human-in-the-loop
manner.
This subsection gives an outline of adapting generative model to a specific user with
perceptual feedback. The details are explained in Sects. 2 and 3. The generative model
y = G(z) produces an output y with the latent variable z as input. For example, in this
chapter, this output y becomes HRTF. If we change the latent variable z, the output
y will also change accordingly. In other words, if we can find the latent variable
that produces the optimal output for the target user by exploration, we can expect
to be able to adapt the generative model. However, there are two problems with this
method.
The first problem is that a well generalized pre-trained model, as described
in Sect. 1.1, has a difficulty producing a truly optimal output for any particular
target user. To address this problem, we first transform the generative model as
y = G(z, β), where β represents a vector that has the dimension of the number of
subjects in the training dataset. During training, β becomes a one-hot vector that
indicates the subject to which the inputted data belongs. Thus, if the current input
x is the p-th subject’s, the corresponding element of β became 1; otherwise, 0. In
this way, the model can explicitly handle who the currently input data belongs to.
By using β, this model separates the characteristic factors that are common among
the different subjects from the factors that are unique to each individual subject.
Then, we train the model so that the factors that are common among subjects are
3D Spatial Sound Individualization with Perceptual Feedback 575
generalized, and the part that handles factors that are not common among subjects
but are unique to each individual subject is optimized and not generalized. By doing
this, the model is able to generate a tailored output for all the subjects in the training
data. At calibration phase, in order to make it possible to generate the optimal output
for a certain target user who is not included in the training dataset, β is not a one-hot
vector but a vector of continuous values. Then, the model can generate the optimal
output for the target user by blending the feature factors unique to each subject that
was separated during training. In this way, if the variance of the subject data included
in the training data is sufficiently large, it would be possible to generate the optimal
output for the target user. This specific method is discussed in detail in Sect. 2.
The second problem for exploring the latent space to obtain the optimal output
for a specific user is that it is difficult for the user to directly correct the errors of
the output from the model. Since the target user does not know the specific form of
output that suits them, they can not answer specifically what is wrong with the output
from the model. For example, in this chapter, it is impractical to answer with concrete
values which parts of the spectrum of the generated HRTFs are suitable for oneself
and which parts are not. Therefore, it is necessary to use other indirect methods to
properly evaluate the errors in the model output. A solution to this is perceptual
feedback, in which the target user is asked to evaluate the output by answering a
simple perceptual question, rather than specifically pointing out the errors in the
model output. However, it is often very difficult for users to express their perceptual
score of the model output in absolute values (absolute assessment) [4, 5]. For this
reason, it is desirable to use relative comparative evaluation (relative assessment)
instead of absolute scoring [5–7]. In relative assessment, an evaluator is provided
multiple options and chooses the one that seems better among them. The number of
options may be two in the simplest form but can be more than two. In latent space
exploration, we repeatedly conduct this evaluation, and the model output is evaluated
based on the relationship of the evaluations among the queries. We explore the latent
space so that the evaluation is maximized. The method presented in this chapter also
introduces such relative assessment to explore the latent space of the model. This
will be explained in detail in Sect. 3.2.
There are several methods to explore the latent space of the model from such the
perceptual feedback (e.g., generic algorithm [18], preferential Bayesian optimization
koyama, and evolution strategy [8]). A common requirement for all of these is the
ability to complete the exploration with as few queries as possible. This is especially
important when we treat what are called expensive-to-evaluate problems, where the
user’s evaluation cost is high. For example, the cost for a person to listen/watch
and evaluate a sound/image is much higher than the cost for a computer to compute
one iteration of the optimization. In such a problem, it is not practical to make
the user evaluate many times because the cost of one query is too high. Therefore,
minimizing the number of queries is very important. In Sect. 3.3, we introduce a
hybrid combination of evolution strategy and gradient descent method as a way to
achieve this. This method combines evolution strategy, which is a sampling-based
optimization method, with gradient information to improve the convergence in a
smaller number of queries. However, since the user’s perceptual evaluation is done by
576 K. Yamamoto and T. Igarashi
sampling discrete points in the latent space, we can not directly observe the function
that represents the user’s evaluation score, and thus can not obtain the gradient.
In the method introduced in this chapter, the gradient is estimated by inferring the
landscape of the function, assuming that the latent model evaluation values follow
random variables with a certain posterior distribution (e.g., Gaussian process).
Chapter Organization This chapter is organized as follows. Section 2 describes
how we could embed user-adaptable nonlinear space parameters into a generative pre-
trainable neural network model and extract the characteristic features of individual
user’s data. We discuss these techniques without limiting it specifically to HRTF
individualization problem. Then, Sect. 3 illustrates how the internal parameters could
be optimized for a specific user through perceptual feedback of the user: pairwise
comparison queries. We also introduce some techniques to accelerate the adaptation
by reducing the number of queries. Section 4 presents an actual example of 3D spatial
sound adaptation as an application of the methods introduced in this chapter. We
also demonstrate a user study that shows how the localization qualities are improved
through our framework as an example (Sect. 4). Finally, we discuss the possibilities
and some future directions of the introduced methods (Sect. 5), and then summarize
this chapter (Sect. 6).
Fig. 2 Conditional variational AutoEncoder. This model consists of the encoder and decoder and
reconstructs the output from input so that the latent variable follows some prior distribution
≥ −K L(qφ (z|x, c)|| pθ (z|x)) + Eqφ (z|x,c) [log pθ (c|x, z)] , (1)
where K L() denotes Kullback–Leibler divergence. Suppose that the latent variables
follow Gaussian distribution z = gφ (x, ), ∼ N(0, I ), where gφ (·, ·) is a deter-
ministic differentiable function, the empirical lower bound that can be written as
1
L
L(x, c; θ, φ) = −K L(qφ (z|x, c)|| pθ (z|x)) + log pθ (c|x, z (l) ), (2)
L l=1
where L is the number of samples. In addition, the first term on the right hand side
of Eq. (2) can be analytically computed as
1
J
− K L(qφ (z|x, c)|| pθ (z|x)) = 1 + log( j ) − μ2j − j , (3)
2 j=1
where μ and denote the variational mean and variance, and J is the dimension-
ality of z. This loss function is totally differentiable at each parameter in the neural
network. So, we can train this model by stochastic gradient descent methods like
Adam [20].
578 K. Yamamoto and T. Igarashi
In this subsection, we introduce an adaptive layer that separates the individual and
non-individual factors from a training dataset during training. This is a new type
of layer: a layer in neural network means a combination of a linear function and
nonlinear function. Generally, a neural network consists of several numbers of layers.
We can use arbitrary type of functions for a layer as long as they are differentiable.
The model we are aiming for in this chapter not only reconstructs the input x, but
also aims to extract the personality of the subjects in the database.
As described above, a common layer in a neural network can be written as a
combination of linear and nonlinear functions:
where x ∈ R N and y ∈ R M are the input and output of this layer, respectively.
Linear () denotes a linear layer function of the neural network. W ∈ R N ×M is a
matrix, b ∈ R M is a bias vector, and f () is an arbitrary nonlinear function (e.g., sig-
moid). We introduce blending vector β to represent individuality in the model. β has
same dimensions with the number of subjects K in the training dataset. It is a one-hot
vector during training and each dimension corresponds to a subject in the training
data. We train the model by setting β according to the training data. We now consider
to isolate the latent individualities into a tensor from the weight matrix in an unsu-
pervised manner. The assumption here is that subjects in the training data sufficiently
cover the wide variety of individual factors and the target user’s characteristics are
approximated by a combination of the separated individualities.
3D Spatial Sound Individualization with Perceptual Feedback 579
The adaptive layer is based on tensor factorization that employs stochastic gradient
descent optimization [12]. We decompose this W and b as follows by introducing a
new parameter β (Fig. 3):
After training, we aim to calibrate the generator (the decoder of the neural network)
to obtain an individualized data for a target user. To this end, we assume that the
output optimized for an individual user can be expressed as a blending of the trained
individualities of features in the dataset in a nonlinear space.
Thus, we now reinterpret the binary subject label β in Eq. (5) with a continuous
personalization
weight vector. Each βi takes [0,1] continuous value and is constrained
as iK βi = 1. Then, in Eq. (5), we can say the optimized individualization trans-
formation matrices to the user can be expressed as A ⊗3 β and B ⊗3 β, which are
blends of the individualities of the subjects included in the trained data set. Simi-
larly, the latent variables z, which are necessary for generating a new HRTF, are also
transformed using β as
1
z̄ ∼ N z̄ mean , z̄ var , (7)
2
where Z mean ∈ R L×K , Z var ∈ R L×K are matrices in which each column is the pre-
computed latent vector (z mean
1
(c), . . . , z mean
K
(c)) and (z var
1
(c), . . . , z var
K
(c)) that cor-
respond to the subject. Furthermore, z mean (c) and z var (c) are switched by the con-
k k
580 K. Yamamoto and T. Igarashi
Fig. 4 A comparison between hidden units interpolation approach (bottom-left) and our adaptive
layers (bottom-right). We trained two networks with three nonlinear functions A, B, and C (Top).
Red, blue, green lines at bottom two graphs represent the reconstructed functions respectively.
Purple lines denote a blending of three functions equally, and black line denotes a blending of A
and B. Hidden units interpolation approach diminishes the details of each function while our method
preserves them
dition c. L denotes the dimensions of the latent variables. We use the blended z̄ for
the latent variables in the individual feature vector of the user.
The system optimizes the blending vector β for an individual user by fixing
the other parameters A and B, as well as the matrices W and bias vectors b . This
approach has the advantage of dramatically reducing the DoFs of the design variables
for optimization purposes because it can eliminate the need for multiple optimization
runs when considering all dimensions of the condition. This means optimizing only
a blending vector β covers the individualities of the user through all conditions.
This adaptive layer allows us to interpolate, emphasize, diminish, and blend each indi-
viduality in the trained dataset by adjusting the blending vector β ∈ R K at runtime
as an additional input. Several approaches exist that morph data into different cat-
egorized data continuously by interpolating several sampled hidden units extracted
with AutoEncoder (e.g., using procedural modeling of a 3D mesh [13] and con-
trolling and stylizing the human character motion [14]). However, this corresponds
to the approach in which exploring only in the generalized space as described in
Sect. 1.1. Then, these approaches are limited in terms of their ability to distinguish
many nonlinear functions, which is crucial to solving our target problem.
Figure 4 shows a comparison of three simple functions morphing between hidden
units interpolation approach and our approach after performing the same number of
iterations (although this number is unfavorable with our approach). We use a dual-
stacked AutoEncoder for our experiment. (Note that for hidden units interpolation,
we replace each adaptive layer with a common fully connected layer.)
The hidden units interpolation approach diminishes each characteristic feature of
the functions. This is crucial when the target function to reconstruct is complex and
has sharp peaks and dips. However, our adaptive layer successfully enables us to
reconstruct the details of characteristic features and blend them.
3D Spatial Sound Individualization with Perceptual Feedback 581
The optimization procedure for the blending vector β is interactive with the user.
The user gives relative scores for two blending vector βi and β j through pairwise
comparison as described in Sect. 1.3. The user is presented with two test outputs
of the model, and rates the relative score (evaluates which is better and how much
better). With this input, our optimization problem is reformulated into a minimization
problem argminβ Q(β) where the absolute cost values Q(β) are computed from the
relative scores as described in a Sect. 3.2. By running this procedure iteratively, the
system optimizes the β for the target user.
To optimize this black-box system, we use a hybrid optimization scheme [15]
as an evolutional strategy (we use CMA-ES [8]) and a gradient descent approach
(we use the BGFS-quasi-Newton method). The optimization procedure is shown in
Algorithm 1. We introduce local Gaussian process regression (GPR) [16] to acceler-
ate the optimization. This method estimates the local landscape of the cost function
from discrete sampling to obtain the gradients.
Figure 5 shows a comparison between with and without using gradient informa-
tion estimated by GPR. We minimize the EggHolder function for this evaluation
using four conditions (N = 8, 32). CMA-ES requires N times of function evalua-
tion (pairwise comparisons) at each iteration. In our target problem, the number of
samplings should be small because the sampling size is proportional to the user’s
effort. However, when the number of samplings N seeded by CMA-ES is few, the
optimization without GPR tends to be trapped by a bad local minima. As shown in
582 K. Yamamoto and T. Igarashi
Fig. 5 Comparison of convergence curves between CMA-ES with/without GPR. We use EggHolder
function for this evaluation. When the number of samplings seeded by CMA-ES is fewer, the
optimization without GPR fails to bad local minima while our technique converges to better solution
with much fewer iterations
Fig. 5, our technique addresses this problem and converges to a better solution with
considerably fewer iterations than in previous methods.
At a single iteration during optimization, we first sample N sets of β samplings
β1 , . . . , β N using common CMA-ES procedure. Let P be a set of pairs of indices
(1, 2), . . . , (N − 1, N ). For each (i, j) ∈ P, the system generates two test outputs
S(βi ), S(β j ), presents the pair to the user side by side to the user, and requests the user
to rate each to generate a relative score (N − 1 queries of pairwise comparisons).
After collecting the user feedback for all pairs, the system stores N − 1 pairs of
different βs and their relative scores. After computing the absolute value of the cost q
for each β samplings using these relative scores, we estimate the local landscape of the
continuous cost function Q(β) from the discrete q. We represent Q = (q1 , . . . , q N )
as a set of indices of sampling points. Using this estimated cost function, the system
computes the gradients and updates the covariance matrix in CMA-ES with quasi-
Newton. We detail this procedure in the next subsection.
The system requests that the user provides feedback regarding the two test outputs
of the model that correspond to each β pair. The system then has a set of β pairs
P and corresponding the relative scores. Given these relative scores, we compute
the absolute value of the sampling cost q for each sampling β. Our formulation is
derived from Koyama et al. [7], which estimates the consistent goodness field of
high dimensional parameters through unreliable crowd sourced rating tasks. Their
3D Spatial Sound Individualization with Perceptual Feedback 583
where ω > 0 balances the two constraints (we set 5.0). Er elative (q) is the relative
score-based constraint and is represented as
Er elative (q) = ||qi − q j + di, j ||2 , (9)
(i, j)∈P
where di, j denotes the offset determined by the rating between i-th and j-th
samples. Here, we set di, j = (3 − RelativeScor e)/2, where RelativeScor e ∈
{1, 2, 3, 4, 5}. We evaluate the rating as linear scale, but this depends on the tar-
get problem. Note that the sign of di, j is opposite to Koyama et al. [7] because we
assume a minimization problem here. In addition, we enforce the continuity of the
cost function by E continuous (q):
|β − β |
2
E continuous (q) =
1−
i j
qj
qi −
. (10)
i∈Q
i= j i=k |βi − βk |
In this equation, we constrain the absolute costs of two sampling β to become closer
when the distance of the two β diminishes. This minimization problem Eq. (8) can
be solved as a linear least square problem.
Finally, we estimate the local landscape of the cost function Q(β) using multidi-
mensional GPR. We include discrete q samplings obtained by Eq. (8) into Gaussian
process regression (GPR). GPR is a non-parametric, Bayesian approach to regression.
This approximate function can be used to estimate the gradients, which are required
by the quasi-Newton method as described below (Sect. 3.3). Note that although GPR
is expensive with high dimensions, it is not a serious problem in our case because
the dimension of a design parameter would not increase to such a high dimension.
3.3 Optimization
where y (g) , g and ρ (g) are the design parameters, iteration step and a global step size
respectively. z (g) ∼ N (0, I) are independent realizations of a normally distributed
random vector with zero mean and covariance matrix equal to the identity matrix I.
The columns of the orthogonal matrix B (g) are the normalized eigenvectors of The
covariance matrix C (g) , and D(g) is a diagonal matrix with the square roots of the
corresponding eigenvalues. C (g) can be computed as
1
C (g+1) = (1 − α)C (g) + α p (g) ( p (g) )T
τ
μ
1 ωi (g+1) (g+1)
+α 1− (y − q (g) )(yi − q (g) )T , (14)
τ i=1 ρ (g)2 i
where α and τ are learning rates, and ωi denotes the weight determined by the rank in
the offsprings i. p (g) is also the learning rate, but not a hyper parameter. It is computed
using q (g) at each iteration (please see [15] for details). Quasi N ewtonU pdate()
represents the gradient descent part. We use quasi-Newton method, but one can use
any other gradient-based methods such as Adam [20] alternately.
For our initial guess, we first randomly generate the offsprings y (g) within a
user-defined
(g) range by means of Gaussian distribution,
and then enforce a constraint
y = 1 by dividing all the offsprings by y (g) . This constraint can be considered
as a portion of the user feedback function. In addition, in most case, optimal β for
a specific user has a sparsity because when too much personal factors are blended
together, individuality would disappear. Thus, the optimizer randomly drops the
elements under the average to zero in β offsprings of CMA-ES (we dropped them
with 30% probability). In most case, this constraint reduces training error.
To construct the HRTF input data structure of x, we include not only the impulse
response of the HRTF at the exact sampled direction c, but also its several surrounding
neighboring impulse responses (Fig. 6). In total, we sample 25 directions with 5 ×
5 rectangular grid shapes, in which the center becomes the HRTF at the direction c.
The stride of the grid is ±0.08π rotations for both yaw and pitch on a unit sphere
direction from the head. In addition, we obtain the power spectrums and LR time
domain impulse responses at each direction using bilinear interpolation in frequency
domain on the unit sphere [22, 23]. To reconstruct an interpolated time signals, we
use both interpolated power spectrums and phases. We call this 5×5 grid that stores
HRTF information as the Patch. This patch representation is expected to encode the
correlations with surrounding directions. Finally, this input data structure becomes a
3D voxel patch with 5 × 5 × 128 dimensions (128 power spectrums or 128 sample
time signals) and each voxel has four color channels (power spectrums and time
signals of LR channels) as shown in Fig. 6. This becomes the input x of our neural
network.
Our neural network reconstructs the HRTF x to minimize the difference between the
input and output HRTF with an AutoEncoder manner. However, solving a regression
problem of a signal that shows a large fluctuation (e.g., time domain audio signal
and power spectrum) using a generative neural network is difficult. This is because
a neural network smoothes the output throughout the training data. Therefore, the
trained result tends to be an “averaged signal,” which causes a fatal error. To address
this, we use a quantized format similar to WaveNet [24].
586 K. Yamamoto and T. Igarashi
Fig. 6 The input data structure of our neural network (We call HRTF patch). This HRTF patch
has voxel-like data structure, which encodes spatial correlations of HRTFs. Each voxel has four
properties (LR channels of power spectrums and time signals) like color channels of image
WaveNet predicts time domain audio signals using an image-like quantized format
(width: time, height: amplitude), which successfully solves a regression problem of
large fluctuated signals. Similarly, we quantize the power spectrums and time signals
of HRTFs into 256 steps using μ-law compression. As a result, the output format
becomes an image-like representation. Unlike in WaveNet, we do not use one-hot
vectors for the final layer nor the softmax function. The softmax function generalizes
all the output of the neural network into [0, 1] probabilities. This is equivalent to
solving an unconstrained optimization problem which requires extensive training
data. However, the size of our setting’s training data is considerably less than that of
WaveNet, which can lead to optimization failure.
Alternatively, we construct an array of normal distributions on each quantized
vector, in which mean values are equal to each quantized value. In addition, we set
all the variances to 5 and minimize the mean squared error of these multiple normal
distributions. This addresses the aforementioned problem because it is equivalent
to constraining the value range of the solution. To generate a final HRTF, we first
compute each quantized value by maximum likelihood estimation from the output
and then obtain the result by decoding the quantized values using inverse μ-law
compression.
Fig. 7 3D convolutional
layer for HRTF patch which
packs multiple proximity
HRTFs around the target
direction c
(Fig. 7). This is because the spectral correlation of an HRTF with its surroundings
generally has a different structure between the lower and higher frequencies as a
result of the frequency dependent diffraction by the subject’s head and ears.
We use two convolutional layers for each channel (for four total channels). We set
the kernel size of each convolution as 3×3×3 (yaw × pitch × frequency axis), and
add zero padding to the frequency axis only. For all directions, we set the stride as 1.
Thus, the first convolutional layer transforms each channel of a patch from 5×5×128
to 3×3×128, and the second layer further transforms them to 1×1×128. Note that
we did not add bias parameters to this convolutional layer in our experiment.
4.1.4 Dataset
For training dataset, we used the publicly available CIPIC data set [17] which contains
HRTFs of both ears measured in an anechoic chamber for 45 subjects at 25 azimuths
and 50 elevations. In total, it includes 1250 sample directions of HRTF per subject
and ear. Each set of data for a direction, subject, and ear is recorded as an impulse
response of 200 wave samples with a 44.1 kHz sampling rate audio file. Impulse
signals are played by a loud speaker array spherically arranged around the head. They
are recorded using two small microphones inserted into the ears of each subject.
We designed a GUI application for calibration (Fig. 8). This application first presents
a pair of test signals and its intended direction. The user then plays the test sound by
pressing an A/B selection button. Each of these two test signals is generated from
different HRTFs (personalization weights), respectively, and has the same intended
direction. We randomly select an audio source from 10 predefined test sounds (e.g.,
speech, helicopter, short music phrase) and then filter the audio using the generated
HRTFs. The intended direction continuously moves spherically around a head and
588 K. Yamamoto and T. Igarashi
Fig. 8 The user interface pane for gathering the user feedbacks. It runs on web browser. When the
user push A (red) or B (blue) button, one of the test signal pair is played. The 3D graphics show
the intended direction of the system from the side and top views. The user rates the pair by 5 pt
scale with radio buttons and submits it. Finally, the user exports his/her individualized HRTF data
by pressing the export button
is shown as a moving sphere from side and top views. The user listens to the test
signals and provides feedback by selecting one of the 5-scale options that represent
the sound that is perceptually closer to the intended direction. “1” means that the test
signal A is definitely better, and “5” means B is definitely better. “3” means neutral.
By iterating this simple pairwise comparisons (approximately 150 ∼ 200 times), the
system automatically individualizes the HRTFs by optimizing the blending vector
for the target user. The user can stop the calibration at an arbitrary timing (usually
when the user is satisfied or can not distinguish two test signals).
Fig. 9 The result of user study. The third column shows how many numbers of options are selected
as better HRTF for each participant between best-fitted CIPIC HRTF and optimized HRTF by our
system
participant to find the best HRTF six times. Unfortunately, this was not perfectly
repeatable, but we observed the participant selected same HRTF three times in the
six explorations. We can consider this small variation in the result is not a critical
problem in our experiment because it means that the selected HRTFs are equally
good.
We next requested that the participants calibrate their HRTFs using our system.
We asked each participant to answer at least 100 pairwise comparisons. We did not
decide the maximum times of comparisons and when a participant indicated that he
or she was satisfied, we stopped the calibration. We measured calibration time and
the number of mouse clicks (number of pairwise comparisons) for each participant.
The participants used the UI described in the previous section. The calibration took
20 ∼ 35 min for each participant. The number of pairwise comparisons was 109 ∼
202 times. This calibration time was much shorter than the actual measurement for
obtaining fitted HRTFs.
After each calibration, we conducted a blind listening test to compare the HRTFs
obtained using our method and the best-fitted CIPIC HRTFs. In this step, we showed
each participant 100 pairs of test sounds. One of the test sounds in each pair was
convolved by an optimized HRTF and the other was convolved by the best CIPIC
HRTF for the participant. The test sound to convolve was randomly selected from
10 prepared sounds (e.g., short music, helicopter, and speech) and played 100 times.
We requested that each participant use the same GUI as during the calibration to
select one test sound from each pair that showed better spatialization. We requested
that each participant select only either 1 or 5 from among the option buttons. The
order (A or B) in which test sounds were played using optimized HRTF and best
CIPIC HRTF for each presented pair was random. We did not inform the participants
whether the selection was the optimized HRTF or the best CIPIC HRTF.
Figure 9 shows the number of times each HRTF was selected as a better HRTF
by each participant. These results show that the optimized HRTFs were significantly
better for almost all participants (*p-value < 0.05 by Chi-squared test for 18/20
subjects) than were the best CIPIC HRTFs, indicating that our system successfully
optimizes HRTFs for individual users.
590 K. Yamamoto and T. Igarashi
5 Discussions
This chapter has discussed techniques for adapting the output of neural network-
based generative models to specific users not included in the training data, by sepa-
rating out individuality through tensor decomposition with an unsupervised manner
and using perceptual feedback (pairwise comparisons) to optimize their mixture. To
minimizing the number of queries, we estimated the landscape of the perceptual
function of the target user’s preference from the relative assessments, and used it
for gradient estimation which improves the convergence of the optimization. This
section discusses the possibilities and some future directions of these methods.
As described in Sect. 3.2, we used Gaussian process regression (GPR) [16] for esti-
mating the landscape of the perceptual function of the user’s preference. The esti-
mated landscape is used for computing the gradient that improves the efficiency of
black-box optimization Although there are many regression algorithms (e.g., sup-
port vector machine, neural network, lasso, and polynomial regression), GPR has an
advantage that works well for relatively smaller numbers of data, and non-convex
function. Such the gradient estimation from relative assessments is broadly useful for
generic algorithm [18], Bayesian optimization [7], and other black-box optimization
problems. Especially, in problems with high evaluation costs, where the evaluation
of the objective function is provided by a human, it would be valuable since it reduces
the number of queries as much as possible. However, the method introduced in this
chapter is just a starting point in the field that estimates the gradient from perceptual
relative assessments. We expect that in the future, more efficient algorithms will be
able to solve the adaptation problem with perceptual feedback more effortlessly.
speech synthesis and voice conversion techniques [32–35] basically can only output
the person’s voices which are included in the training dataset, leaving little room for
the user to edit the output voice to their preference. Our approach could address such
the problem by blending the decomposed characteristics in the training dataset for
generating a new style voice. This is similar to the concept of disentangling [36–38]
in machine learning field, which has been increasingly studied in the last few years.
The idea presented here can be considered as a more aggressive exploration of such
techniques.
In addition, the methods introduced in this chapter would also be useful in other
domains, especially at the situations that the user want to explore more preferable
result when they can judge whether the output of a machine learning-based model is
good or bad perceptually, but can not specifically determine what is good and what is
bad. For creative applications, many machine learning models that generate pictures
and musics to support artists have been proposed [39–43]. However, in reality, most
artists use them as if they were playing a lottery over and over again until they happen
to produce output that they like, and it has been difficult for them to actively explore
for more favorable output on their own. This is partly due to the fact that the artists
themselves might not understand what they want specifically from the system and
are vaguely looking for better results through their senses. To address this problem,
several methods have been proposed using perceptual relative assessments that aim
to efficiently explore the learned latent space to find the output that the user wants
[44–46]. However, again, since these methods still explore in generalized spaces, it
is difficult to obtain extreme results (which is what artists often want). This results
in the problem that it often outputs only dull results for artists. In such situations, the
methods presented in this chapter could be expected to give artists more benefit from
machine learning models by manipulating a more distinctive set of the individualities
in the training dataset and provide a more machine-human collaborative creative
environment.
6 Conclusion
This chapter introduced a method to adapt the output of the generative model to
a specific user using perceptual feedback. We mentioned adaptation to a specific
individual is incompatible with machine learning, which essentially aims at general-
ization. However, by using tensor decomposition to separate the generalization part
from the adaptation part, we showed the problem can be transformed into optimizing
a minimum set of adaptation parameters while utilizing the versatility of machine
learning. In addition, to efficiently collect perceptual feedback, we showed how to
use not only sampling-based search, but also inference of the perceptual function
landscape of a specific user. After describing the algorithm from the mathematical
viewpoint, this chapter demonstrated the effectiveness of the method by adapting
3d spatial sound to a specific user as a practical example. As this shows, perceptual
feedback is often far less labor intensive for the user than most physical measure-
3D Spatial Sound Individualization with Perceptual Feedback 593
ments. Finally, we discussed the possibilities and some future of the introduced
methods, especially in HCI. We showed many HCI problems requiring online cali-
bration/personalization could be addressed by the introduced approach. We believe
that personal adaptation of machine learning models using perceptual feedback will
bring users experiences that have been available only to a limited number of people
in many fields in the future.
References
1. Wenzel EM, Arruda DJ, Kistler DJ (1993) Localization using non-individualized head-related
transfer functions. J Acoust Soc Amer 94
2. Moller H, Sorensen MF, Jensen CB, Hammershoi D (1996) Binaural technique: do we need
individual recordings? J Audio Eng Soc 44:451–469
3. Middlebrooks JC (1999) Virtual localization improved by scaling non-individualized external-
ear transfer functions in frequency. J Acoust Soc Amer 106
4. Brochu E, Brochu T, de Freitas N (2010) A Bayesian interactive optimization approach to
procedural animation design. In: Proceedings of the SCA, pp 103–112
5. Kristi T, Gupta Maya R (2011) How to analyze paired comparison data. Technical Report
UWEETR-2011-0004
6. Patrick L, Alan C, Tom T, Seetzen H (2005) Evaluation of tone mapping operators using a high
dynamic range display. ACM Trans Graph
7. Koyama Y, Sakamoto D, Igarashi T (2014) Crowd-powered parameter analysis for visual design
exploration. In: Proceedings of ACM UIST, pp 56–74
8. Hansen N, Muller SD, Koumoutsakos P (2003) Reducing the time complexity of the derandom-
ized evolution strategy with covariance matrix adaptation (CMA-ES). Evolut Comput 11:1–18
9. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio
Y (2014) Generative adversarial nets. In: Advances in neural information processing systems
10. Kingma, Diederik P (2014) Semi-supervised learning with deep generative models. In:
Advances in Neural Information Processing Systems
11. Kihyuk S, Lee H, Yan X (2015) Learning structured output representation using deep condi-
tional generative models. In: Advances in neural information processing systems
12. Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems.
IEEE Comput 42(8)
13. Yehuda K, Rovert B, Chris V (2015) Procedural modeling using autoencoder networks. In:
Proceeding of ACM UIST
14. Daniel H, Jun S, Taku K (2016) A deep learning framework for character motion synthesis and
editing. ACM Trans Graph (SIGGRAPH)
15. Xuefeng C, Xiabi L, Yunde J (2009) Combining evolution strategy and gradient descent method
for discriminative learning of bayesian classifiers. Proc Gen Evolut Comput 8:507–514
16. Matheron G (1963) Principles of geostatistics. Econ Geol 1246–1266
17. Algazi VR, Duda RO, Thompson DM, Avendano C (2001) The CIPIC HRTF database. In:
IEEE Workshop on applications of signal processing to audio and electroacoustics, pp 99–102
18. John H (1992) Adaptation in natural and artificial systems. MIT Press, Cambridge
19. Takahama R, Kamishima T, Kashima H (2016) Progressive comparison for ranking estimation.
In: Proceedings of IJCAI
20. Kingma D, Ba JP (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
21. Kazuhiko Y, Takeo I (2017) Fully perceptual-based 3D spatial sound individualization with an
adaptive variational autoEncoder. ACM Trans Graph (SIGGRAPH Asia)
22. Wenzel EM, Foster SH (1993) Fully perceptual consequences of interpolating head-related
transfer functions during spatial synthesis. In: Proceedings of workshop on applications of
signal processing to audio and acoustics
594 K. Yamamoto and T. Igarashi
48. Liu J, Liu C, Belkin NJ (2020) Personalization in text information retrieval: a survey. J Ass Inf
Sci Technol
49. Helten T, Baak A, Bharaj G, Muller M, Seidel H-P, Theobalt C (2013) Personalization and
evaluation of a real-time depth-based full body tracker. In: 3DV-Conference
50. Anastasia T, Andrea T, Remelli E, Pauly M, Fitzgibbon AW (2017) Online generative model
personalization for hand tracking. ACM Trans Graph (SIGGRAPH Asia)