Jakobsson Vahlström 2021
Jakobsson Vahlström 2021
DF
iv
Abstract
The aim of this master thesis is to evaluate possible improvements of the perceived
behavior of an Autonomous Mobile Robot (AMR) system using an RGB-D cam-
era for Human Pose Estimation (HPE). This topic is approached by conducting
an initial study of the current challenges and needs of humans working in shared
local environments with AMR-systems. A study on how different HPE methods
perform in an AMR system has also been conducted, which show that different
methods could have very differing performance. Based on the results of the studies,
a prototype implementation of some interaction concepts is presented, including the
understanding of the human intentions as well as how an AMR should respond to
these intentions.
The results of this thesis show that there are common issues for human users who are
working along with current AMR systems. A proof of concept on how some of these
issues could be solved is proposed, which potentially could improve the perceived
behavior of the AMR significantly. Suggestions of HPE methods that have shown
the most promising performance from tests is also given.
v
Acknowledgements
We wish to extend our deepest gratitude to Kollmorgen Automation and specifi-
cally the AMR team for having us during the work in this master’s thesis. Simon
Bokesand has assisted us in solutions and Mikael Bohman has been of great help
in anything software related. Johanna Turesson has helped us with analyzing the
market and Zihua Yang has assisted us with her UX-expertise. Their expertise and
guidance have been of great assistance all the way from the formulation of the prob-
lem to the end result.
We also would like to thank our supervisor at Chalmers, Kristofer Bengtsson, for his
invaluable insight and guidance during this project. It has been a pleasure having
him as a supervisor.
vii
Contents
List of Acronyms xi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim and research questions . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Ethical implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theory 7
2.1 Automated guided vehicles . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 History of Automated Guided Vehicles (AGVs) . . . . . . . . 7
2.1.2 Pros and cons with Automated Guided Vehicles (AGVs) . . . 8
2.1.3 Autonomous Mobile Robots . . . . . . . . . . . . . . . . . . . 8
2.2 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Applications of CV . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 3D Computer Vision . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Node to node communication . . . . . . . . . . . . . . . . . . 11
2.3.3 ROS Navigation Stack . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 DL on CPU vs GPU . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . 12
2.4.3 The evolution of Deep Learning . . . . . . . . . . . . . . . . . 13
2.4.4 Machine learning inference . . . . . . . . . . . . . . . . . . . . 14
2.4.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Methods for HPE . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Distinctions between current approaches . . . . . . . . . . . . 16
2.5.3 Training and evaluating HPE . . . . . . . . . . . . . . . . . . 16
2.6 Human Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Human Aware Navigation . . . . . . . . . . . . . . . . . . . . 17
2.7 Inertial Measurement Units . . . . . . . . . . . . . . . . . . . . . . . 18
ix
Contents
3 User study 19
3.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Interview Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Interview 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Interview 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Interview 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Final interaction concepts . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Move idle AMR by pointing . . . . . . . . . . . . . . . . . . . 21
3.3.3 Reduce speed around humans . . . . . . . . . . . . . . . . . . 22
3.3.4 Follow person . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Discussion of interaction concepts . . . . . . . . . . . . . . . . . . . . 22
x
Contents
6 Conclusion 51
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 55
A Interviews I
A.1 Interview 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Interview 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.3 Interview 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
xi
Contents
xii
List of Acronyms
xiii
1
Introduction
Due to the vast development of AI and microprocessors during the last decade, AGV
systems can now sense their surrounding environment and interpret their sensor data
in new ways that allow them to distinguish humans from other surrounding objects.
The potential improvement of using this ability is however not yet fully understood.
If AGVs would be more adaptable to human behavior and give the possibility for
people to interact with them in an intuitive way, it could ease the working situation
for human users who are working in the same local environment significantly. This
thesis project is therefore investigating the possibility of an interface that enables
AGVs to adapt to human presence and lets people perform basic interactions with
the AGV.
1.1 Background
Kollmorgen Automation AB is a company that during the last decades has been
developing world-leading software solutions for AGVs [2]. In parallel with their
development in this field, they have recently also started looking at solutions for
Autonomous Mobile Robots (AMRs). The major difference compared to the AGV
is that the AMR is controlled completely by hardware and software located on the
robot itself, while the AGV requires support features to navigate e.g. reflectors
placed in its surroundings. This gives AMRs increased freedom and flexibility in the
path planning compared to the AGVs which are only driving in predefined lanes.
AGVs have mainly been using a two-dimensional laser scanner for localization and
1
1. Introduction
have not had the flexibility to change its route if there is an obstacle obstructing
it. Instead, AGVs are relying on that the surrounding environment does not put
obstacles in its way, and if it does the AGV simply stops. AMRs are however flexible
enough to detect and avoid obstacles in real-time, which requires some additional
sensors to the laser scanner since it only detects objects located at the same height
on which the laser scanner is mounted on. To be able to see obstacles with different
heights an RGB-D camera can then be used as a complement for obstacle detection.
Due to the large progress in Deep Learning (DL) during the last decade as men-
tioned above, the use of an RGB-D camera enables additional possibilities since it
can be used for detecting more specific objects than just obstacles in general. This
opens up a potential of an adjustment of the behavior of the vehicle depending on
what kind of obstacle is in the way, which is a potential that Kollmorgen would like
to examine.
The research questions that should be answered in this thesis were the following:
Q1: From the perspective of a person working in the same area as AMRs, what
features could improve the behavior of the AMR system using HPE?
Q2: Among available Deep Learning alternatives of online Human Pose Estimation,
what are the strength and weaknesses of the different approaches? What
method is suitable to be used in real-time on an AMR system?
Q3: Given a functional HPE system, how should the interactions according to Q1
be designed, and is the AMR response intuitive from a human’s perspective?
1.3 Method
To retrieve an understanding of the challenges when working in environments around
AMRs, an assessment was conducted with people working like that. The aim was
to determine whether people have encountered problematic situations with AMRs,
if they have come across any thoughts of what they think would be good features to
have regarding HMI up until now, and finally what kind of features they would like
to see in the future if they would be given the task of designing them. Additionally,
2
1. Introduction
inputs and ideas have been taken from different people such as supervisors at both
Chalmers and Kollmorgen, and also the sales department at Kollmorgen who knows
what their competitors have been working on. These inputs resulted in a set of
interaction concepts to be implemented and further examined.
The interaction concepts were the basis of what requirements there were on the
performance of the HPE methods. A study on different HPE methods was then
conducted to examine which ones fulfilled those requirements the best. Since the
hardware of the AMR system consisted of a CPU only, the methods were primarily
evaluated when run on a CPU. Additionally, a perspective on how much the per-
formance could improve if one would decide to add a GPU to the system has been
given by a comparison between methods run on a GPU. This gives guidelines on
which kind of HPE models are most suitable to work with for real-time estimation
applications, and also what possible benefits there could be to add a GPU to the
AMR hardware.
Based on the input of what problems there were for humans located in AMR en-
vironments and the capabilities of the methods that were examined, a proposed
implementation of the interaction concepts has been developed. The implementa-
tion has been evaluated based on the resulting changes in the behavior of the AMR
system. A final recommendation has then been presented to Kollmorgen on how the
use of HPE could improve the current AMR solution.
1.4 Scope
The scope of this thesis was to evaluate what potential improvements that the use
of HPE could give to an AMR system from a user’s perspective. The initial study
was mainly based on interviews with people who in different ways have experience
working with AGVs or AMRs. Due to an ongoing pandemic of Covid-19, the study
has been limited to actions that are done remotely. Therefore, no visits have been
made to sites using AGV or AMR systems, and the interviews have been conducted
online. The study was also limited to inputs and experiences that have been re-
trieved by people at Kollmorgen or the use of products given out by Kollmorgen.
The thesis also conducts a study on what different DL methods for HPE there are
and how well they are suited to run in a real-time application on the AMR system.
No design or training of DL models has been done during this thesis. The study
is therefore limited to DL methods for HPE that are already designed by others
and available online as open-source software. It is also limited to previously trained
model graphs that are available along with the open-source software for each DL
method. The study is done from the perspective of what methods are most suitable
for a real-time application on the AMR system at Kollmorgen. The study is there-
fore based on the current hardware available on this particular AMR.
3
1. Introduction
system specifically.
To gather data about the humans in the robot surrounding, the cameras will be
used to estimate the poses of each person in the field of view of the robot’s cam-
era. The task of interpreting human poses is called Human Pose Estimation (HPE)
which is a young field with many available methods to choose from. A second focus
of this thesis is an evaluation of which of these methods to use.
In the field of Human Aware Navigation, research has been done in improving the
path planning around moving humans both using Light Detection and Ranging (Li-
DAR) scanners for tracking movement of a whole-body [4] as well as eye-gaze glasses
for predicting the movement of a whole body[5]. Additionally, work has been done
toward incorporating detected humans into the navigation stack in Robot Operating
System (ROS) [6].
When a user wants to communicate with a drone, one study shows that the primary
mode of communication without instruction is by using gestures [7]. One important
gesture used in this thesis is a pointing gesture. An earlier work successfully gave
directions to a robot in real-time using hand gestures. In that implementation, they
defined the pointed direction by the hand direction by having detected a hand sil-
houette [8]. Another work used an HPE-method within ROS to detect what object
a user is pointing at. In this work, the object pointed at is an object that coincides
with the 2D line formed by the forearm drawn on each frame [9].
Not much research available on the usage of HPE for Human-Robot Interactions
has been found. Two relevant studies that used gesture control are [8] and [9] who
used different ways of estimating the point a user was directing a robot to.
4
1. Introduction
The United Nations has defined 17 sustainability goals [11] that Chalmers is in-
terested in working towards. The nature of one of these goals is to ensure healthy
lives and promote well-being for people of all ages. A target to reach this is to re-
duce the total amount of road accidents [11], and all advancements in the interaction
between humans and autonomously mobile robots work towards this goal. Further,
an advantage of the AMR system is its usability in non-factory environments, such
as healthcare facilities. The automation of tasks in these areas could provide more
affordable healthcare which is another of the goals [11]. Lastly, a third relevant goal
is to promote sustained inclusive and sustainable growth. A target for reaching this
is by fostering increased productivity in labor-intensive sectors [11], where logistics
is one of these.
5
1. Introduction
6
2
Theory
The topic of this thesis spans many relevant areas. In this chapter, we will describe
the relevant theory to understand and motivate our methods and conclusions.
There are many different types and applications for AGVs. They are in general
used in industrial applications for transporting materials in a systematic way. They
can be deployed in many different environments for example warehouses, offices,
factories, etc. It is common for AGVs to be equipped with some kind of additional
feature such as lifting mechanisms. A common application is the AGV version of
a forklift, which does the same thing as a manually driven forklift but instead con-
trolled autonomously.
The second era of AGV systems took place from the 1970s until the early 1990s
where electronics were introduced on the computers in simple onboard computers
7
2. Theory
and control cabinets for block section control. This was a time when big advances
were made in electronics and sensory technology, which was something that the
manufacturers of AGV systems took advantage of. Technical innovations for exam-
ple high-performance electronics and microprocessors and more powerful batteries
increased the possibilities for the manufacturers and improved the performance of
the systems even more [13]. During the 1980s, breakthroughs in laser technology
were also made, which revolutionized the AGV industry. This vastly increased the
ability to change the layout at the factory site without major changes in the soft-
ware of the AGV systems [14]. It was also in this era where the company Netzler &
Dahlgren Co AB started working on their first AGV systems, which is the company
that has evolved into what Kollmorgen Automation is today [15].
The third era lasted from the mid-1990s to 2010. At this time standards for AGV
systems were set, and the devices have electronic guidance and contact-free sensors.
In this era, the AGV systems became acknowledged as reliable and powerful au-
tonomous solutions that gave economic profit overtime for the customers. AGVs
were developed to be able to handle almost any kind of load, which broadened the
applications from mainly the automotive industry to a big variety of customers.
A problem with deploying an AGV system is that it comes with a high initial
investment, and it takes time for the investment to pay off. Depending on the
quality of the system there could also be maintenance and reparation costs from
time to time. AGVs are also, along with most robotic systems, not suitable for
non-repetitive tasks where the outside of what the AGV is programmed to do [16].
8
2. Theory
AGV systems usually require very precise placement. It also enables algorithms and
applications of Artificial Intelligence for optimizing the performance of the AMRs,
for example, it could learn to drive in spaces that are not used as much or where it
minimizes the traffic. A downside with not having external support for the sensors
such as reflexes is that the measurement errors that occur in the sensors are big-
ger, which leads to a decreased accuracy of positioning of the AMR in relation to
surrounding objects. This limits the ability of AMRs to perform tasks that require
high precision and accuracy when approaching a load or target.
Typical applications of AMRs are transporting material or load within factories and
warehouses. This is a time-consuming and simple task for workers, which easily
can be automated. While AGV systems are effective, they are typically labor, floor
space, and capital-intensive. This can then be solved by AMR systems instead,
which are designed to fill the efficiency gap in functionality [17]. The flexibility of
AMRs also enables more collaborative behavior with human workers. Order picking
is an expensive task that is performed in most warehouses or factories, where humans
have to walk between shelves to fetch material which is very time-consuming. With
AMR systems the robots could instead make the shelves come to the worker and
then put it back into the storage structure, which increases productivity significantly.
Another solution for order picking could be that the AMR follows next to the human
to assist with storage possibilities or tools needed to perform the task effectively.
When this is done to pick parts for a kit, the process is called kitting.
2.2.1 Applications of CV
There are applications of CV in several fields where different kinds of information
can be retrieved. It might for example be how far an object is to a camera if a vehicle
drives in the middle of its lane or how many people are in the view of a camera.
9
2. Theory
These kinds of questions and many more are challenges that are included in the
field and should be answered based on the information given by digital images and
videos [21]. Apart from the field of autonomous vehicles, there are also applications
of CV in other areas like medicine and the military [22].
Typical tasks to solve in CV are for example object detection, classification, and
identification. In object detection, the challenge is to identify whether there are ob-
jects, such as obstacles, in an image. In object classification, the task is to recognize
certain pre-specified object classes in images, for example, cars, houses, humans,
etc. In object identification, an individual instance of an object is recognized for
example identifying a specific person.
2.3.1 Nodes
A ROS system generally consists of a set of nodes which is computation processes
written in for example C++ or Python. A complete robot control system will often
consist of several nodes to control different components of the system. This means
that the nodes can work separately, regardless of programming language. Communi-
cation between nodes is handled through topics. Topics are communication channels
where data is sent in the form of messages. Messages are a simplified description
10
2. Theory
11
2. Theory
networks, which are computing systems that are inspired by the biological neural
networks which form animal brains [28]. Deep learning can be divided into three
categories supervised, semi-supervised, and unsupervised learning [29]. Supervised
learning is used when one has a big set of data that is used to train the system for
example predicting a certain object in an image. When training a model supervised,
there is always a ground truth for the model to compare with. This ground truth is
used to see how the model performs and to tune the model based on it[29].
Unsupervised learning is used when one wants a computer system to learn behavior
or skill from scratch [30] based on a given environment that it interacts with for
example a computer game. The system will then have some reward system that
gives it a positive reward when it does something considered as good and vice versa.
Semi-supervised learning is a mix between the previously described methods where
the model is given some data with a ground truth included, and then tries to learn
by itself from that point [31].
Because of this, a GPU can however not fully replace a CPU since it lacks the
task scheduling abilities and the ability to run multiple general-purpose computing
tasks simultaneously that the CPU has. The GPU must therefore be considered
as a complement to the CPU [32], which from a user’s perspective is an additional
expense that has to be taken into consideration when designing a system.
12
2. Theory
objects classification [35]. It is, therefore, a very appropriate tool for this project.
Connectionism [39] was another field of research that became popular in the 1980s,
which is an approach of cognitive science that tries to explain intellectual abilities
using artificial neural networks. It was during this time the concept of hidden layers
in neural networks was introduced, which is layers of neurons in between the input
and the output of a network. Significant improvements were also accomplished in
back-propagation and other concepts were made which are still key components in
DL. In the mid-1990s a wave of startups based on AI occurred, but due to the lack
of computational resources, it was hard to deliver sufficient products which led to a
dip in interest [36].
Another breakthrough in DL was made in 2006 when Geoffrey Hinton showed that
many-layered feedforward neural networks could be trained one layer at a time [40].
This discovery enabled researchers to train neural networks with more and more
hidden layers in them, which led to the popularisation of the term Deep Learning.
In parallel, during this time a large increase in computational power was achieved,
which improved the ability to train the networks by using larger and larger datasets
[36]. With a continuously increasing number of people using online services in the
world, more and more data is produced which could be used for training DL net-
works.
For a long time, deep learning was a field of research that made significant advances
but was not very popular in the industry due to unreliability and limitations in
what results it could achieve. By 2011 the computing power of GPUs had increased
significantly, which made it possible to train deep neural networks more effectively
[41]. It was therefore first during this time that the combination of deep CNNs and
GPUs made progress on computer vision [42]. Until 2011 CNNs was not perform-
ing any better than other shallow machine learning methods, but in 2012 it could
be shown that CNNs run on GPUs could improve the performance of DL in CV
dramatically [43]. Since this time the use of DL has outperformed other alternative
13
2. Theory
methods, and many vision benchmark records have been set and continuously gets
increased with better and better approaches coming up.
2.4.5 Optimization
Machine learning models designed for difficult tasks can consist of dozens or hun-
dreds of layers and millions or billions of weights connecting them. They are then
very large and complex to work with, and the larger the model is, the more com-
putation time, memory, and energy is consumed. Depending on the hardware the
model is running, this could cause a delay of the response time (also called latency)
from inputting data to the output. It could be useful to tune the parameters and
weights during training, but when performing inference it might be required for the
model to run at low power or on another type of hardware. Then it could instead
be useful to simplify the model in order to decrease the number of calculations and
reduce the latency, although it might cause a reduction in estimation performance
[45]. In industrial applications, the software is often running on industrial PCs that
can have limitations in the hardware and power supply available. This means that
optimization might be needed to get a model to run on a sufficient frequency to be
able to perform its intended task.
There are different ways to optimize machine learning models to maximize their
performance in relation to power consumption and latency. Pruning is a method
where each artificial neurons individual contribution to the network is evaluated.
If a neuron is rarely or never used, it could be removed from the network without
any significant difference in the output. This way the size and complexity of the
model could be reduced and the latency improved. Quantization is another method
that reduces the numerical precision from the weights from for example a 32-bit
floating-point to 8-bit, which also reduces the size of the model and speeds up the
14
2. Theory
computation [45].
The human body has many attributes ranging from how its joints are positioned to
the texture and color of the body surface. The model of a human does not need to
carry all available information, only the information needed for the specific task [46].
Three common models are the skeleton-based model [46], the contour model [46] as
well as the volume-based model [46]. If only the pose of the body is of interest,
then the skeleton-based model is simple and flexible [46]. In the case of the skeleton
model, the HPE task is accomplished by breaking down the full human body into
distinct parts, called keypoints, such as the joints of the arms and legs, and then
trying to estimate their position in the frame.
15
2. Theory
There is no guarantee that only one person is being in front of a camera. The HPE
algorithms, therefore, need to handle multiple people being located in the image.
There are two main ways of tackling this problem, the top-down and the bottom-up
approach [47].
The top-down approach generally starts by estimating a bounding box around each
present person [47]. After this, a single-person pose estimator on each person is used.
The run-time of this approach tends to increase linearly per added person since a
pose estimation algorithm needs to be run once on each found person. This is prob-
lematic when running in real-time. Another problem with the top-down method is
due to its early commitment [48]. This means that if no bounding box of a person
can be found despite some or all of the pose is in the frame, the next step of esti-
mating the key points within that box can not be performed either [48].
The bottom-up method on the other hand starts by estimating the position of every
keypoint present in the frame [48]. After this is done, the algorithm links the key
points together appropriately to form each pose. This solves the early commitment
problem of the top-down approach while introducing new challenges in knowing
which keypoints correspond to which pose [48].
HPE is most commonly performed in 2D, however, there are also methods for 3D-
HPE. This additional information is useful for 3D-based interactions as well as to get
more useful data for analyzing an estimated pose and its feasibility. Previous meth-
ods have done this by using 2D pose estimates and from this regressing the depth
data using DL methods [47], [49], [50]. In [50] for example, the model is trained for
a monocular RGB-camera with ground-truth data for keypoints in 3D and is robust
against occlusions. Another possible method is to take the RGB 2D-HPE and fuse
it with depth data to transform it to 3D-HPE.
Each solution is often trained on publicly available datasets, and there are compe-
titions each year for the most accurate solution. One difference is which keypoints
are included, and therefore which keypoints can be estimated. Two commonly used
datasets are the Microsoft COCO dataset and the MPII dataset, where one dif-
ference is that the Microsoft COCO dataset has keypoints for eyes and ears while
MPII only has one keypoint representing the entire head [47], [51]. These additional
keypoints can be useful when for example head-pose or eye position is of interest.
There are many metrics for evaluating the performance of a dataset. One common
method is the Average Precision (AP), which defines a true positive as being within
a certain threshold around the ground truth [47].
16
2. Theory
Not only perceived safety is of importance, but for the AMR to be integrated into
an environment populated by people there is a need for Socially Compliant naviga-
tion. In a very basic scenario, the human is not distinguished from other objects
and the navigation of the robot is simply optimized around this human, if possible.
However, when people navigate around each other the interactions adhere to some
sort of cooperation which enables smooth avoidance of collisions [52]. An interface
for humans to communicate with an AMR would allow for easier cooperation which
in turn would allow for more socially compliant behavior.
Human-Robot Interactions will in this thesis be divided into two main categories:
Active interactions when a user actively gives commands to the AMR and passive in-
teractions when the AMR perceives a person and acts accordingly. Further, a similar
distinction will be made between pedestrians and users: Pedestrians are not actively
trying to interact with the AMR while users are actively trying to engage the AMR.
When making a robot behaves like a human, the approach is to identify the pa-
17
2. Theory
rameters of the desired behavior [3]. These parameters could be the movement
speed and time between action and reaction. There are higher-level decisions as
well, such as which side of a person to pass on that also should adhere to human
convention.
18
3
User study
When determining functions that intend to support the usage of technology for
humans, it is the end user’s perspective that one has to start from. Therefore, there
must be an understanding of what problems and needs there are for end-users today
in order to develop as useful functionality as possible. This was initially retrieved
by a market analysis in collaboration with Kollmorgen to understand what similar
functionalities there are on the market today. Given an initial understanding of
what interactions are feasible to perform with the used technology, a first set of
conceptual interaction ideas was then put together:
1. The possibility to tell the AMR to stop with a gesture.
2. The possibility to direct the robot to take a detour when going to a position.
3. The possibility to direct an idle robot to go to a new position.
4. The possibility for the robot to detect if it is moving close to a pedestrian that
has its back turned to the robot and is therefore unaware of the robot. When
the robot perceives this, it will reduce its speed.
These interactions were used as initial candidates of interaction concepts to imple-
ment. They were extracted based on inspiration from the market analysis, inputs
are given from people with experience in the business, and also from creativity and
personal experience from the authors.
3.1 Interviews
To get a better insight into what new functions are wanted in the industry and what
makes for a good HMI, three interviews were conducted. One of the interviewees
is a user experience designer for AMRs, and two are working on sites that utilize
AGVs. In preparation for the interviews, a video with demonstrations of the initial
interactions was prepared. The reason for this was to give the interviewees a rough
idea of what concepts were realistic to perceive the current technology used. In
the video, an AMR was manually controlled to simulate a desired behavior when
interacting with a user or a pedestrian.
19
3. User study
3. The interviewee was asked about actually dangerous and perceived as danger-
ous scenarios in working with AGVs.
4. The interviewee was asked about annoyances in working with AGVs.
5. The interviewee was asked about what new functionality he or she would like
to see in an AMR.
6. The interviewee was then shown a video of a user interacting with an AMR in
a set of scenarios consisting of what the authors have come up with as well as
scenarios inspired by previous interviews.
7. During the demonstration the interviewee was asked if he or she would find
the functions demonstrated in the interactions useful and for possible improve-
ments of it.
8. With the hypothesis of the interviewee getting inspiration from the proposed
interactions he or she was once again asked about new interactions that would
be useful.
The results of the interviews, together with a feasibility analysis resulted in a set of
functions and concepts to be implemented.
3.2.1 Interview 1
Based on the answer to question 7, it emerged that users might want to minimize
the amount of work and knowledge needed when handling an AMR. For this reason,
the possibility to redirect an AMR to take a detour can be seen as adding a feature
that is not worth the extra effort for a user to learn and might not be used in a
real application. It could therefore be an unnecessary feature in comparison to the
potentially beneficial aspects it could give. This feature was not included in the set
of demos in the subsequent interviews. By similar reasoning, more passive features
such as slowing down when behind a person’s back could be considered rewarding
since it does not take any effort from the user or pedestrian but could still improve
the interpreted behavior. A source of inspiration for what intuitive behavior is could
be the normal car traffic where the vehicles are driven by humans.
3.2.2 Interview 2
After the interview in section A.2, the need for passive human-aware navigation was
reinforced. AGVs could sometimes drive too close to a human at too high speed for
it to feel comfortable. The desired functionality for the AGV to acknowledge that
it has perceived pedestrians was also brought up. Additionally, new functionality
was mentioned. When a person is performing kitting, where a user is gathering
items from multiple places, an AMR should be able to follow the user from bin to
20
3. User study
bin so that the user can put the items on the AMR. This new function was also
incorporated into the video for the next interview.
3.2.3 Interview 3
From the interview in section A.3, it was understood that the speed of the AGV
should be moderated according to the pedestrians in its surroundings. It happens
that AGVs suddenly show up behind the back, which is interpreted as uncomfortable
behavior. There are times where there is a lot of traffic with both AGVs and manual
trucks, where humans need to be able to know what AGV is about to do. There are
occasions where AGVs are getting in the way and trap manual trucks, where people
have to interrupt their work and manually move the AGVs to solve the situation.
The concept of stopping an AGV by a hand gesture would be useful if it would work
reliably. This would be especially useful in a case where a person for example is
driving a manual truck and sees that an AGV is about to get in the way or trap the
manual truck.
3.3.1 Stopping
The ability to stop the AMR by a hand gesture when it is running could be useful
in multiple aspects. Since it will only make the AMR stay in its current position, it
will not affect anything else around it or have any specific prerequisites to perform
the action. As mentioned in A.3 this could give a human the possibility to stop
a running AMR if it is predicted that there will be an issue or clash if it would
continue. This way, the manual worker also does not need to go to the controller of
the AMR to stop it, but could instead in a simpler way continue with the work it is
currently doing.
21
3. User study
This behavior could be particularly useful if the AMR notices that the human has its
back turned towards it, and it is, therefore, less likely that the person is aware that
the AMR is heading towards it. If the AMR instead detects the eyes of a human
obstacle, it is more likely aware of the vehicle approaching. This could reduce the
number of uncomfortable situations of surprise.
The behavior could also be different whether it is one person or a group of people
that are in the way of the AMR. If it is one person that is believed to be aware
of the vehicle, it could be considered more likely for the person to move out of the
way. Then one could give the AMR a bit more speed and smaller distance to the
human before it has to stop. If it instead is a group of people blocking the way,
where it could be considered less likely that they will move out of the way since
they could have a conversation or are working with something and might not want
to be disturbed. Then the AMR could instead give a bit more safety distance and
a slower pace when approaching the group.
22
3. User study
people who have experience with working with AGVs or AMRs in different ways.
This gives a good insight into what problems could occur and helps when trying to
identify what specific problems that feature using HPE could improve or solve.
An important note is that the interviewees were found through the network of Koll-
morgen. Since AMRs are a relatively new field for them, their main customer base
is in the AGV business. The contact with the second and third interviewees was,
therefore, made through this network. The experiences and thoughts that were given
in those interviews were therefore based on the usage of AGVs and not AMRs. It
was however considered as experiences similar enough to be used also for the AMRs
since the daily usage and many of the issues are believed to be similar between the
two. Since the interviewees were found through Kollmorgen there could potentially
be experiences that only apply to their particular products, and not representing
general usage of AGVs made by other manufacturers.
The given interaction concepts were intentionally a slightly scattered mix of some
different use cases of HPE where the behavior of the AMR could be improved. The
reason for this was to be able to return a general assessment of what kind of in-
teractions were interpreted as the most valuable improvements by end-users. An
important distinction between interactions is those that are considered active and
those who are passive from a humans perspective, as described in subsection 2.6.1.
In this case, the interactions that stop the AMR and moves an idle AMR by point-
ing are considered active interactions since they require an input performed by a
human where the AMR is supposed to respond in a certain way. The interactions
where the AMR reduces its speed and where it follows a person are instead consid-
ered as passive interactions since they could be performed continuously without any
specific action by a human. The follow person interaction would however probably
need some kind of initial action by a user to be activated, since this kind of function
normally is not the primary task of an AMR.
23
3. User study
24
4
Evaluation of HPE methods
4.1 Metrics
Each model is evaluated based on the following metrics:
• Frequency when the model is running on a CPU and/or a GPU, in frames per
second (fps)
• Latency when the model is running on a CPU and/or a GPU in milliseconds
(ms)
• Precision of the model according to their respective research papers given in
Average Precision (AP)
• Occlusion handling of each model, which is an observed estimation on how
good the model estimates a pose when parts of the body are missing
• False positive detection, which refers to what extent poses that are observed
as falsely detected by the model even though it is not an actual human pose
it returns
To be able to deploy HPE on a real-time system the system must be able to esti-
mate poses with a sufficient frequency to be able to interpret the poses intended by
a human. It is also important to not have a too large latency such that an inter-
preted pose is performed within a sufficient time from when it was executed. Good
precision of a model increases the certainty that an estimated pose is correct, but
it could also be an indication of a model that is computationally heavy to run on
light systems and might therefore require a GPU to run sufficiently.
Since the camera has a limited Field of View (FoV) from where it is placed on the
AMR, only parts of a body are likely visible in an image. Therefore, it would be
25
4. Evaluation of HPE methods
good if the model is able to estimate human poses even if an entire body is not
visible. This is what the metric occlusion handling refers to. It is also crucial that
the poses that are given by the model are reliable as actual human poses, and not
poses found in other features of the input image. This is what false positive detection
refers to, which is an observed estimation of how many false poses are given by a
model.
All these metrics are factors that have to be taken into account to determine which
model suits the application the best. It is challenging to find a good balance between
sufficient precision and reliability of the poses. At the same time, it can not be too
computationally heavy for the system to run in real-time.
The RGB-D camera that has been used during the tests is an Intel Realsense D435i.
The performance of the DL models that run on CPU has been retrieved with a Dell
Latitude 7280 PC with an Intel i7 7th generation CPU processor running a Linux
Ubuntu 18.04 LTS operating system. The performance when running on a GPU has
been retrieved with an NVIDIA Jetson AGX Xavier industrial PC with a 512-Core
Volta GPU running also running a Linux Ubuntu 18.04 LTS operating system.
26
4. Evaluation of HPE methods
4.3.1 OpenPose
With this method, a bottom-up approach is taken where each keypoint is estimated
as a confidence-map while in parallel a Part Affinity Fields (PAF) is estimated. The
PAF estimates the direction from each joint to the next [48]. For example, all left
shoulders and elbows are estimated in terms of a heat-map. Simultaneously, a PAF
is estimated linking the correct left shoulder to the correct left elbow, associating
the key points of each person together.
The implementation used is a package called trt_pose, which is given out by NVIDIA
to run real-time pose estimation on the NVIDIA Jetson modules [55]. This is also
the model used for HPE in the NVIDIA ISAAC SDK, which is a toolkit for the
deployment of robots powered by AI [56]. It is therefore assumed that this model is
optimized for running inference in real-time, which is positive for this usage case.
When this OpenPose implementation is run on the system using a GPU it runs
with a frequency of 12 fps and with a latency of 1000 ms, which is a really good
performance when run in real-time applications. The model has an average precision
of 63.5% according to the research report [48], which is a good number. It is also
very good at detecting humans even if an entire body is visible and it does not give
many false estimations, which makes it seem like a very reliable model to use.
The implementation of Lightweight OpenPose used is a model given out by the Intel
Open Visual Inference and Neural network Optimization (OpenVINO) toolkit, which
is a DL toolkit for CPU usage that supports quick development of many different
types of applications for example human vision, speech recognition, natural language
processing, and recommendation systems [58]. It includes scripts for extracting and
interpreting the data given by the model and also plotting the estimated keypoints
on the original image. Since Lightweight OpenPose, similar to OpenPose described
above, is given out by a toolkit for applications of DL it is assumed that this model
is optimized for running inference in real-time.
The model has been benchmarked with a frequency of 13 fps and a latency of
27
4. Evaluation of HPE methods
160 ms when run in the simulation environment on a CPU, which is a really good
performance. The model has a precision of 42.8% when run according to its research
paper [57]. Similar to the OpenPose model running on the GPU this model is also
very good at occlusion handling and understands when there are only parts of a
body present. It has been noted to give some false pose estimations when run on
the real robot, which is something that has to be dealt with to use in a real-time
application. Since the model is created for performing inference on CPU hardware,
which requires lighter models and fewer calculations, the performance is considered
good due to its high frequency, low latency, and perceived reliability despite having
a relatively bad precision.
4.3.3 FastPose
The model called FastPose is a solution given by an open-source library that per-
forms 2D pose detection on input images. It is a top-down method where it uses
the model called You Only Look Once (YOLO) [59] for object detection. Based on
the bounding boxes of humans found by the YOLO model, a custom Mobilenet v1
using 50% of the parameters is used as a feature extractor [60]. On top of this, a
basic bounding box tracking algorithm is used to keep track of a person as long as
it stays in the image [61].
The FastPose package is designed to be easy to use and implement in Python. The
package includes scripts that simplify the use such that the extracted keypoints
are given directly, and there are support scripts that plot the keypoints on the
original image. When benchmarking the model it gives a performance of 25 fps and
a latency of 90 ms when it is run on a CPU, which is extraordinarily good. There
is no precision value found for this model, but it is seemingly a bit unstable since it
gives quite inconsistent estimations that indicate a lower rate of precision. It is also
noted that the model seems to always expect an entire body when given a bounding
box interpreted as a human by the classifier, which makes it squeeze in legs quite
randomly of a body that is only visible from the waist and upwards even if it detects
the upper body correctly. It is however giving good estimations when a full body is
visible, but considering the limited FoV that the camera has, it is common for only
a part of a body to be visible. False pose estimations occur with a relatively high
frequency when a human is not present in the image, which makes it slightly difficult
to work with and requires post-processing to filter out false pose estimations.
4.3.4 HRNet
HRNet is the official implementation of the research paper Deep High-Resolution
Representation Learning for Human Pose Estimation [62]. The novelty with this
approach is to maintain a high-resolution representation of the image and combine
that with running high to low-resolution sub-networks while keeping efficient compu-
tation complexity and number of parameters. The model is run with the top-down
method, where it uses bounding boxes of persons produced by the FasterRCNN
network [63]. The network is a top competitor in many pose estimation challenges,
28
4. Evaluation of HPE methods
and has been further used as a backbone for other high performing models [64].
4.3.5 Higher-HRNet
The model Higher-HRNet is a novel bottom-up approach for HPE. As well as with
previously presented HRNet, Higher-HRNet uses high-resolution representations of
the images within its network [67]. This approach is however using a bottom-up
approach, which makes it fundamentally different from the previous HRNet. The
idea with this is to be able to execute the HPE task also on small objects in an image.
Along with the previously presented HRNet model, there is a similar simplified open-
source package called simple-HigherHRNet, which contains support scripts that sim-
plify the implementation of the model [68]. When run on a CPU the model estimates
keypoints with a frequency of 0.51 fps and a latency of 2300 ms, which is not usable
at all. When running the model on a GPU instead the performance increases to
1.8 fps and 2200ms latency, which is better, but not quite good enough to use in a
real-time application. According to the authors of the model it has a precision of
70,5%, which is very good [67]. It is also very good at occlusion handling, and no
false pose estimations have been observed when testing the model.
4.3.6 PoseNet
PoseNet is a bottom-up approach for pose estimation and instance segmentation of
multiple persons in an image using an efficient single-shot model [69]. The model is
used as a demo example for pose estimation by TensorFlow, which is a commonly
used machine learning library in Python that is developed and maintained by Google.
29
4. Evaluation of HPE methods
a very good performance for running on a CPU only, and well suitable for running on
a real-time application. When running the model on a GPU instead the frequency
increases to 13 fps and the latency to 140 ms, which is even better. The model has a
precision of 68.7% according to the authors of the model, which is considered good
[69]. It is also good at occlusion handling and no false pose estimations have been
observed, which gives a robust impression.
4.3.7 Cubemos
There is a Skeleton Tracking SDK for Intel Realsense Cameras given out by Cube-
mos, which is a paid-for package for real-time estimation on CPU hardware [71].
The package is designed to be simple to implement and to use in applications. Since
this package is not open source, it requires a valid license to run. This package is
included in the study since it is considered a good reference compared to the other
solutions that have been considered in this project.
A test version of the package has been tried out. It should be noted that it was
not fully incorporated in the simulation environment like the rest of the models
when tested, so the retrieved performances might be slightly misleading. When run
in a real-time demo on a CPU the model had a frequency of 10 fps and a very
small latency. It is good at detecting poses even if parts of the body are missing.
Some false pose estimation occurs, but overall it has a very good performance and
reliability.
Objective metrics
Latency CPU Latency GPU Precision
Models FPS CPU FPS GPU
(ms) (ms) AP (%)
OpenPose - - 12 100 65.3
FastPose 25 90 - - -
Lightweight OpenPose 13 160 - - 42.8
HRNet 1.85 1100 2.5 1350 76.3
Higher-HRNet 0.51 2300 1.8 2200 70.5
PoseNet 8 220 13 140 68.7
Cubemos* 10 Very small - - -
Table 4.1: Table summarizing the objective results of the HPE model comparison
30
4. Evaluation of HPE methods
Observational metrics
False
Occlusion Implementation Implementation
Models positive
handling language simplicity
detection
Openpose Good Never Python Slightly tricky
FastPose Bad Often Python Straight forward
Lightweight OpenPose Good Sometimes C++ Slightly tricky
HRNet Sufficient Never Python Straight forward
Higher-HRNet Good Never Python Straight forward
PoseNet Good Never Python Slightly tricky
Cubemos* Good Sometimes Python or C++ Simple
Table 4.2: Table summarizing the observational results of the HPE model com-
parison.
Here it should be noted that the model called Cubemos is not tested on the same
simulation environment as the others, which is why it has a * behind it. If a model
has not been run on certain hardware or if certain information has not been found,
it is represented with a - in the table.
The set of DL models that has been retrieved and tested on the system is only
a small pick of available models online as open-source software. It does however
contain models with significant differences compared to each other, where there are
both top-down and bottom-up models represented as well as models that are run on
both CPU and GPU. A key factor in this application is that the AMR system is cur-
rently being run on a CPU only, which makes models that run on CPUs particularly
interesting to explore. As described in subsection 2.4.3, the main reason why DL
has made such a breakthrough during the last decade is the increased computational
power that is offered by modern GPUs, which makes it possible to train deep net-
works. Therefore most models are developed on GPUs, and there is a significantly
lower number of alternatives that run on CPUs.
In the result shown in subsection 4.3.8, one can see that there is a clear increase of
frequency and decrease of latency for the models that have been run on both types
of hardware. It is therefore a clear advantage to have a GPU available in the AMR
system if possible. But since industrial GPUs are relatively expensive compared to
31
4. Evaluation of HPE methods
However, an important note is that the methods have been tested in a simulation
environment running on a PC for CPU tests and an NVIDIA Jetson AGX Xavier for
GPU tests. This is not the same hardware setup that would be used if it would be
run on the real AMR system, and it is therefore likely that the retrieved frequency
and latency could be slightly decreased due to differences in computational power
between CPUs. The result should therefore be interpreted as more of general guid-
ance of how each model could potentially perform rather than giving an expectation
of the same particular performance as retrieved in these tests.
From what the tests have shown in this thesis project, it is considered enough to
use a model that only requires a CPU to run. From running tests on the system, it
is estimated that a frequency of at least 2-3 fps is required to be able to interpret
possible intended interactions from a person in a stable way. A latency of more than
1000-1500 ms is also difficult to work with from a humans perspective since it causes
slow response time that might lead to confusion whether the AMR has interpreted
an interaction or not.
When running real-time inference on a CPU, relatively light models are required
to be able to run them at a sufficient frequency. This is because of the reduced
computational power available in CPUs compared to GPUs as described in subsec-
tion 2.4.1. As subsection 2.4.5 explains, the drawback with this is that the pose
estimation performance of the models is likely to be reduced. This is therefore a
trade-off that has to be made when choosing which model to use in the application.
This drawback is also something that has been noted from the results when testing
the models. The FastPose method is a good example of when a model is too light,
which affects the pose estimation performance significantly. The model runs on the
system with a very high frequency and gives sufficient estimations when there are
full human bodies in the image. The big problem is when there is only half a body
visible as described in subsection 4.3.3, which is a common case on the AMR system
due to the limited FoV of the camera. It is also giving some false estimations of
poses when people are not visible, which makes the model too unreliable to use in
the intended application.
On the other end of this spectra are the HRNet and Higher-HRNet models, which
are both giving reliable estimations with no false pose estimations, but runs with
a much slower frequency and much higher latency when they are run both on a
CPU as well as on a GPU compared to FastPose. When running them on the AMR
system they, therefore, do not quite interpret the local environment in a sufficient
way for a real-time application, mainly due to the delay that the latency gives.
The models that have given the best performance when run on a CPU are the
Lightweight OpenPose and PoseNet models, which can run at good frequencies and
32
4. Evaluation of HPE methods
do not have a latency that affects the behavior of the system. They are also good
at producing good pose estimations even if only parts of a body are visible and do
not give too many false estimations, which are very good features for this kind of
application. They are therefore considered as good options to use if only a CPU is
available. On the GPU the OpenPose, as well as the PoseNet model, are run with
a convincing performance, both by frequency and reliability.
The retrieved performance of the open-source models are also performing in a simi-
lar, if not better, way than the paid-for Cubemos model described in subsection 4.3.7.
The implementation of the open-source models is also relatively simple, so there is
no big reason to use the paid-for model from that aspect either. It is therefore
considered unnecessary to pay for a licensed model such as Cubemos when there are
open source options with similar performance and simplicity.
Another note is that the bottom-up methods seem to be better at occlusion han-
dling than the top-down methods. The top-down methods FastPose and HRNet
have shown slightly worse flexibility when only a part of a body is visible. The HR-
Net does not give any false-positive estimations, but it seems to require more parts
of a body to be visible than the bottom-up methods to detect the pose. A reason
could be that the bottom-up methods as described in subsection 2.5.2 detect body
parts individually before connecting them to a full-body context, which might make
it more suitable for interpreting humans located in the local environment when the
camera range is limited.
The resulting models that have been examined during this thesis project are con-
sidered to show a good overall representation of some different general approaches
there are available as open-source for performing HPE. An important note is however
that the tested models are not necessarily representing the best options available,
even though some of them are considered sufficient models to run in this kind of
application. As mentioned in section 1.4 the model packages used are selected since
they include support scripts for interpreting the output, which is a limiting factor
when choosing which models to examine. The reason for this is that it simplifies the
implementation of the models significantly since it is common for different models
to have a different kind of raw output depending on the HPE approach. It then
requires a more extensive theoretical understanding of each method individually to
be able to interpret the output, which is time-consuming and was considered un-
necessary for a more comprehensive study. It could however be something worth
looking into if one would decide to proceed with the concept of HPE to develop
a final functionality for the AMR as a product since there are methods with good
performance that are not as well documented as the ones tested in this project.
The observational metrics that have been retrieved should be interpreted as general
guidelines since they are subjectively observed on how well the model seems to
perform according to each metric. An important note here is that there has been an
ongoing pandemic of Covid-19 during this thesis project, which has constrained the
tests and the collection of results to be carried out in an isolated home environment
33
4. Evaluation of HPE methods
only. Due to this fact, the results have been taken from a simulation environment
only and not on the real AMR system, which would be preferred otherwise. It
also means that the models have not been tested in a warehouse- or factory-like
environment, which potentially could cause other performances or issues with the
models that have not been observed during these circumstances.
34
5
Design and Implementation
The implementation of each interaction from chapter 3 has two main parts. Firstly,
the poses need to be interpreted, and secondly, the AMR needs to respond in a
certain way. Additionally, both the interpretation and the response need to be
robust and intuitive. Human Pose Estimation methods can perceive keypoints as
detailed as joints, eyes, ears, and noses. This limits the level of detail of how a
human pose can be perceived. For an interaction to be intuitive, it means that the
gesture used by the user should come naturally when a corresponding response of
the robot is desired. The AMR should in turn respond expectedly from the user’s
point of view.
Due to this design, the robot can rotate around the z-axis and move in the x-
direction. The robot can therefore be controlled by sending commands for velocity
and angular velocity independently and the transformation from this to the actual
actuation of the robot is handled automatically.
35
5. Design and Implementation
Since the color sensor’s FoV is smaller than that of the depth sensor, the FoV of
any combined image is that of the color sensor.
The camera is also used for collision avoidance so the horizontal FoV needs to be
maximized. Because of this, the camera in landscape mode, with the horizontal FoV
as the widest. Also, the camera is tilted upwards so that the field of view is along the
floor, while still keeping part of the floor in its field of view. Then, there is a certain
distance needed to perceive the full pose of a 1.8m tall human. As mentioned, the
camera is mounted at a height of 0.18m which puts the theoretical distance needed
at slightly more than 2m. This is illustrated in Figure 5.2.
Figure 5.2: Camera placement and its effect on the necessary distance to human
to perceive its full pose.
Additionally, the further away from the camera that the human pose is, the less
accurate it is perceived. The interactions that will be implemented are all relevant
36
5. Design and Implementation
when the human is in close range of the robot, so not much useful information can
be derived from poses that are more than 5m away from the camera. Therefore, the
range of view is cut off at 5m.
Figure 5.3: Images of the robot and camera used in this thesis
5.2 3D poses
As described in chapter 3 many of the interactions require estimating the human
pose in 3D space. The 2D poses were transformed to 3D by a simple sensor fusion
method. To enable 3D HPE from 2D HPE and depth data from an RGB-D cam-
era, the depth image is first aligned with the color image. This process enables the
pixel-by-pixel match between color and depth images. The 2D pose estimation can
then be fused with the depth data. Using the focal length as well as the principal
point of the frame results in a pose estimation in 3D space. The rotation of the
camera is unknown due to uncertainty in mounting and needs to be rotated so that
it is aligned with the floor. How this is done is described in section 5.3.
Not only data of the individual keypoints of each pose is of interest for the in-
teractions, but some data related to the full pose is also of interest. When building
the 3D poses, four measurements are calculated by averaging each keypoint:
37
5. Design and Implementation
For the next estimations, both the gyroscope and the accelerometer are used to
update the rotation which improves the initial estimate. First, the change in rota-
tion as measured by the gyroscope is added to the rotation, and then the rotation
as measured by the accelerometer is filtered into this estimate. The output of this
is a rotation matrix that is used to rotate each pose in 3D space.
One of the remaining 3D poses is then chosen as the main user. The main user
is the one that is actively communicating with the AMR and the choice of the user
38
5. Design and Implementation
is not trivial. In this thesis, the user choice was set to be partially based on how
centered it is in the image, and partially on its distance in meters from the camera.
The centredness of a pose is expressed in radians away from being in the center of
view of the camera. The ratio of importance between how centered the pose is and
its distance from the robot is set to be 1 : 10.
The flow of data was first implemented within one node, where all flow happened
sequentially in one process. To increase the modularity the node was instead split
into three parts:
The resulting first architecture of the nodes is illustrated in Figure 5.6, where the
robot interactor node is responsible for 2D- to 3D pose fusion.
39
5. Design and Implementation
Figure 5.6: Architecture of the node structure with 2D- to 3D pose fusion in the
interaction node.
The reason for this architecture choice was to put the camera interfacer- and robot-
interactor node on the robot and the pose estimator node on separate hardware.
This, therefore, minimizes the data sent between hardware since it could lead to
latency.
However, the inference time for the pose estimator varies. Since the 2D pose needs
to be fused with the depth data captured at the same time, the time-synchronization
of these two ROS messages was therefore unstable. If the pose estimator node is
instead responsible for 3D pose estimates and the proper rotation of the estimates,
the messages are sent more sequentially and the time synchronization is simplified.
Additionally, in the second architecture, all interactions are done in 3D, so that
the backs are detected by the robot interactor node. This second architecture is
illustrated in Figure 5.7.
Figure 5.7: Architecture of the node structure with 2D- to 3D pose fusion in the
pose estimation node.
Since the pose estimator node is the bottleneck in this sequence, a camera interfacer
node used for only this purpose should have its publishing rate set to be similar to
the pose-estimation rate. The reason for this is that a rate higher than necessary
would lead to unnecessary computational costs when publishing unused images.
40
5. Design and Implementation
The ability to follow a user requires the user to stay in the field of view of the
camera. Therefore a controller was implemented as two P-controllers, one that con-
trols the rotation to keep the user in the center of the image and one that keeps
a distance of 0.5m from the user. Due to the schematic of the robot described in
5.1, the rotation and linear velocity can be controlled separately. As mentioned in
chapter 3, the speed needs to be within a comfortable range, and the speed should
also not be large enough to cause a danger to the motors. Therefore, the output
linear- and angular velocities have the same thresholds at the maximum velocity as
that it has when using the Navigation Stack: 0.5m/s and 0.5rad/s. If no pose is
detected, the linear- and rotational velocities are instead set to zero. An illustration
of the controller can be seen in Figure 5.8.
Figure 5.8: The P-controller for follower mode where rx,ref and Θz,ref are reference
angle deviation from having the pose in the frame center and reference distance to the
perceived main pose. vx and ωz are the linear and rotational movement commands.
rx,error and Θz,error are the angle deviation from having the pose in the frame center
and distance to the perceived main pose.
41
5. Design and Implementation
or not. Having the back turned towards the robot is defined by the shoulder-, hip-,
elbow- or feet pair “flipped”. This means that for example if a person’s anatomical
left shoulder was to the left of the right shoulder from the robot’s point of view, the
back was turned towards the robot. This is illustrated in Figure 5.10. As can be seen
in Figure 5.10, the Human Pose Estimator can distinguish between left and right
limb pairs. When a back is detected, the speed is scaled down to a set threshold
velocity.
The ROS Navigation Stack sends velocity commands to the AMR that controls
its movement towards a given destination. When the ROS Navigation Stack is used
together with this feature, the calculated velocity commands are filtered and altered
before being sent to the robot.
This is achieved by remapping the output control command message from the Nav-
igation Stack to an intermediate message. An intermediate node then listens to this
message, scales it, and sends the altered movement message to the robot.
42
5. Design and Implementation
Figure 5.10: Illustration of the perceived front and back of a human body and its
corresponding estimated pose. The turquoise limb is the anatomical right arm of a
person, the green limb is the left arm. The red limb corresponds to the right leg
and the yellow limb corresponds to the left leg.
above its elbow. The desired point is then the projection of the line between the
eye and the hand that intersects the floor plane. The floor-plane is set as being a
horizontal plane at a set distance below the camera. This distance is the height at
which the camera is mounted which could vary between different AMR setups, and
is therefore given in the specifications of each individual AMR system. The can be
done since the 3D pose is accurately rotated according to IMU-data. The desired
yaw-angle of the robot is such that it directs its front towards the user. These
definitions are illustrated in Figure 5.12.
43
5. Design and Implementation
5.7 Results
The performance of each interaction was analyzed with regard to stability and usabil-
ity for the usage scenario. The main method used when evaluating the performance
was OpenVINO’s implementation of Lightweight OpenPose.
44
5. Design and Implementation
45
5. Design and Implementation
right hand is pointing across the body, the projected point is less accurate and
within around 20cm from the point pointed at. An illustration of the performance
is illustrated in Figure 5.13 where the user is directing the robot to the position of
a small wooden horse.
Figure 5.13: Illustration of the performance when directing the robot by a gesture.
The user is pointing at the position of a small wooden horse, more clearly illustrated
in Figure 5.13(a). In Figure 5.13(b) the small wooden horse can be seen as a red
blur at the left. The point pointed at as perceived by the robot is illustrated as a red
point, and the resulting vector describing the position and rotation of the movement
command is illustrated as a red arrow.
The navigation to the point pointed at is poor when using the Navigation Stack,
especially if it is close to the user. The Navigation Stack is sometimes slow to
navigate to the point, and in some cases, it completely fails at finding a path. This
means that the implementation loses some robustness due to the problems of the
Navigation Stack.
As the results show, the interactions from the user study in chapter 3 were suc-
cessfully implemented. The intended use of each interaction is potentially useful
according to the conducted interviews, however, this was not verified by actually
testing the final result on the interviewees. The resulting implementation also uses
separate nodes to make it fit better in the context of other uses of the camera and
46
5. Design and Implementation
The problem of discarding poor poses and choosing the main pose to listen to is
a large source of instability. For many pose detection methods, some sort of score
that describes the estimated accuracy of the pose can be extracted. However, the
score of each detected pose is determined as not a reliable metric as the sole reason
to discard detected poses. Sorting based on feasibility is better, and the added data
can be used to sort out bad poses. Not only the discarding of poor poses is improved
in 3D, but due to the instability of simple 2D poses the sorting of what pose is the
main user was improved as well. This could still lead to instability of main-user
choice, so some object tracking could be implemented as well based on the midpoint
of each pose. This would lead to more accuracy in a more complex scenario, but
the need for it was not tested for.
Similarly to the problem of poses not being tracked, the implementation interprets
poses for each frame separately and not in sequence. This limits the interpretations
and is the reason why designing the “point by hand gesture” is difficult to imple-
ment in an intuitive way that avoids false-positive readings. A more intricate state
machine would solve this problem and enable a more intuitive interaction.
The camera needs to be placed at a reasonable height, meaning not higher than
the robot and not lower than the floor. Then, the placement that minimizes the
distance needed between a robot and human is at floor level and tilted upward so
that the limit of its field of view is parallel to the floor. This is still not very realistic
so the mounting height of 0.18m is good enough and for the floor to be visible the
camera needs to be slightly tilted down for the floor to be in its field of view. On
the other hand, if the camera would be rotated 90 deg, then in a similar scenario
the distances needed would be only 0.7m. This latter scenario would however limit
the horizontal FoV which would have negative implications for interactions such as
the follower mode, as well as for possible other parallel usages of the camera such
as obstacle avoidance.
The implementation is designed as separate ROS nodes which opens up the possibil-
ity to use a pose estimator written in a different language than the other functions,
such as Python, in which many of the HPE methods are implemented in. The pub-
lishing rate of the camera-interfacer was set to not be faster than the pose-estimator
and this publishing rate depends on which pose estimator is used. The camera would
however fill multiple purposes when used on the full AMR system, so the publishing
rate simply shouldn’t be less than the pose-estimation rate to maximize the rate of
the full solution.
47
5. Design and Implementation
a first iteration, the Navigation Stack, described in subsection 2.3.3, was used. This
resulted in no control for keeping the user in the horizontal field of view, leading to
poor behavior in a real usage scenario. When a P-controller was used instead this
problem was solved, with the consequence that the user is responsible for not steer-
ing the robot into an obstacle. P-controllers were chosen over more advanced ones
because they are simple to implement and they worked adequately for the purpose.
Further, from the interviews, it became apparent that the robot should not appear
too quick to react when around a pedestrian or user. The problem of the instantly
reacting controller could be improved by moving after the user has changed its po-
sition outside of a certain threshold.
In the Lightweight OpenPose implementation, the legs get correctly linked together
only when the hip is detected. This demands that the robot perceives almost half
of the body to function, and within a range of 0.5m this is not possible. Using the
Navigation Stack the robot could continue navigating to a point as close as 0.5m to
the user, but there are limitations to this as well. In the final implementation, the
robot stops as soon as it cannot perceive the user. There is a theoretical risk of it
picking up a new user, the previous second-best candidate in the user-choice in this
solution, but it has not been seen in the simulations yet. One alternative is to use
the Navigation Stack to autonomously continue after losing track of the user but
assumes things about a user no longer in the field of view. Another possible solution
is to use LiDAR tracking for this application instead, or at least use LiDAR when
the robot is too close to the user. This latter solution was not tested, but there are
previous studies tracking users with only a LiDAR which shows that this could be
done.
48
5. Design and Implementation
implementation depends on how well poses are detected. If the poses are filtered
correctly, the current implementation works well. Because of limitations in the field
of view, the legs of one person might not be linked to each other which leads to the
corresponding person not being labeled as having its back turned to the robot, with
the consequence that a turned back is not detected and the robot continues to move
at full speed.
5.8.5 Summary
The implementation of the final interaction concepts has been tested and run in
a simulation environment where the AMR system successfully responds with the
desired behavioral changes. This shows that the concepts seem to work as proof
of concept. It would however be preferable to test them on a real AMR system
in a realistic environment, which has not been possible due to the pandemic. This
would give a much better overall feedback on how the system responds and how well
the behavior has improved. To insert the functionality in the product, it would be
preferable to have an iterative process with end-user tests who could give feedback
on if/how it improves their work environment and what additional changes should
be made. This kind of contact would be really valuable in order to design useful
improvements as possible.
49
5. Design and Implementation
50
6
Conclusion
In section 1.2 three research questions were presented which have been the basis of
what has been analyzed throughout the thesis. In this chapter, these questions are
answered and briefly discussed. The first question to answer was:
Q1: From the perspective of a person working in the same area as AMRs, what
features could improve the behavior of the AMR system using HPE?
As presented in chapter 3, four interaction concepts have been identified which are
believed to have the potential of improving the interpreted behavior of the AMR
system. There is e.g. the concept of reducing the speed of the AMR when a human
is identified nearby and the function where the AMR follows a person to assist when
picking material in a warehouse. These two concepts are mainly considered as pas-
sive interactions since they do not require any particular actions from a person in
order to be executed. There is also the concept of pointing an idle AMR to a certain
location, and the function of stopping and AMR with a gesture. These interactions
are instead considered active since they involve an active gesture from a human to
be executed. In order to know to what extent these four interaction concepts could
be helpful, they would have to be tested and evaluated on multiple worksites using
the AMR systems which have not been possible during this thesis project.
Another part of the study was to get an idea of how the performance of the interac-
tion concepts could improve if there would be a GPU available in the hardware of the
AMR system. This difference was retrieved by testing some methods on both CPU
and GPU, which unsurprisingly showed a clear improvement in frequency. However,
methods run on CPU only has also shown promising results which indicates that a
GPU might not be necessary in order to add Human-Robot Interaction functionality
to the AMR system.
51
6. Conclusion
In order to run the intended functionality on CPU hardware only, the most impor-
tant trade-off is to have a model that is sufficiently robust at the same time as it
can not be too computationally expensive to run on a sufficient frequency. In the
set of HPE methods that are examined, there are examples at both ends of this
spectra. Based on the current hardware on the AMR system where there is only
CPU available, the methods called Lightweight OpenPose and PoseNet are the ones
that have shown the best performance overall and are considered good enough to
run in this kind of real-time application.
52
6. Conclusion
53
6. Conclusion
54
Bibliography
[1] G. V. Research, “Automated guided vehicle market size, share trends analysis
report by vehicle type, by navigation technology, by application, by end-use
industry, by component, by battery type, by region, and segment forecasts, 2020
- 2027,” February 2020. [Online]. Available: https://www.grandviewresearch.
com/industry-analysis/automated-guided-vehicle-agv-market
[2] “Kollmorgen automation ab,” https://www.kollmorgen.com/en-us/solutions/
automated-material-handling/automated-guided-vehicles/.
[3] T. Kruse, A. K. Pandey, R. Alami, and A. Kirsch, “Human-aware robot
navigation: A survey,” Robotics and Autonomous Systems, vol. 61, no. 12, pp.
1726–1743, 2013. [Online]. Available: http://dx.doi.org/10.1016/j.robot.2013.
05.007
[4] M. Svenstrup, S. Tranberg, H. J. Andersen, and T. Bak, “Pose estimation
and adaptive robot behaviour for human-robot interaction,” Proceedings -
IEEE International Conference on Robotics and Automation, pp. 3571–3576,
2009. [Online]. Available: https://vbn.aau.dk/ws/portalfiles/portal/36563049/
Adaptive-Pose-Estimation.pdf
[5] R. T. Chadalavada, H. Andreasson, M. Schindler, and R. Palm,
“Accessing your navigation plans ! Human - Robot Intention
Transfer using Eye - Tracking Glasses.” [Online]. Available: http://iliad-
project.eu/wp-content/uploads/2019/08/Chadalavada_etal-ICMR_2018-
Implicit_Intention_Tranference-Data_Analysis-SUBMITTED.pdf
[6] M. Fernandez Carmona, T. Parekh, and M. Hanheide, “Making the case
for human-aware navigation in warehouses,” Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), vol. 11650 LNAI, no. 732737, pp. 449–453,
2019. [Online]. Available: http://iliad-project.eu/wp-content/uploads/papers/
ILIAD-TAROS19-short-paper.pdf
[7] D. Tezza and M. Andujar, “The State-of-the-Art of Human-Drone Interaction:
A Survey,” IEEE Access, vol. 7, no. November, pp. 167 438–167 454, 2019.
[Online]. Available: https://www.researchgate.net/publication/337500595_
The_State-of-the-Art_of_Human-Drone_Interaction_A_Survey
[8] M. Van Den Bergh, D. Carton, R. De Nijs, N. Mitsou, C. Landsiedel,
K. Kuehnlenz, D. Wollherr, L. Van Gool, and M. Buss, “Real-time 3D
hand gesture interaction with a robot for understanding directions from
humans,” Proceedings - IEEE International Workshop on Robot and Human
Interactive Communication, no. May 2014, pp. 357–362, 2011. [Online].
Available: https://www.researchgate.net/publication/224256248_Real-
55
Bibliography
time_3D_hand_gesture_interaction_with_a_robot_for_understanding_
directions_from_humans
[9] A. C. Medeiros, P. Ratsamee, Y. Uranishi, T. Mashita, and H. Takemura,
“Human-Drone Interaction: Using Pointing Gesture to Define a Target
Object,” Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
vol. 12182 LNCS, no. July, pp. 688–705, 2020. [Online]. Avail-
able: https://www.researchgate.net/publication/342835810_Human-
Drone_Interaction_Using_Pointing_Gesture_to_Define_a_Target_
Object/link/5f226c09458515b729f33336/download
[10] L. Takayama, W. Ju, and C. Nass, “Beyond dirty, dangerous and dull: What
everyday people think robots should do,” HRI 2008 - Proceedings of the 3rd
ACM/IEEE International Conference on Human-Robot Interaction: Living
with Robots, no. January, pp. 25–32, 2008.
[11] UN General Assembly, “Transforming our world : the 2030 Agenda for
Sustainable Development,” UN General Assembly, no. A/RES/70/1, pp. 1–35,
2015. [Online]. Available: https://www.refworld.org/docid/57b6e3e44.html
[12] D. M, September 2013. [Online]. Available: http://www.reconditionedforklifts.
com/blog/forklift-history-2/history-automated-guided-vehicles
[13] G. Ullrich, Automated Guided Vehicle Systems. Springer-Verlag Berlin Heidel-
berg, 2015.
[14] S. Alexandersson, “Meet a real pioneer in laser-
guided agvs,” September 2015. [Online]. Avail-
able: https://www.kollmorgen.com/en-us/blogs/_blog-in-motion/articles/
samuel-alexandersson/meet-a-real-pioneer-in-laser-guided-agvs/
[15] Wikipedia, “Ndc netzler and dahlgren co ab,” 2020. [Online]. Available:
https://en.wikipedia.org/wiki/NDC_Netzler_%26_Dahlgren_Co_AB
[16] C. Benevides, October 2020. [Online]. Available: https://www.conveyco.com/
advantages-disadvantages-automated-guided-vehicles-agvs/
[17] E. Romaine, “Types and applications of autonomous mobile robots (amrs),”
August 2020. [Online]. Available: https://www.conveyco.com/types-and-
applications-of-amrs/
[18] “One look is worth a thousand words,” Piqua Leader-Dispatch, August 1913.
[19] R. Szeliski, Computer Vision: Algorithms and Applications. Springer Science
Business Media, 2010.
[20] S. A. Papert, “The summer vision project,” 1966. [Online]. Available:
http://hdl.handle.net/1721.1/6125
[21] R. Klette, Concise Computer Vision: An Introduction into Theory and Al-
gorithms. University of Auckland: Springer London Heidelberg New York
Dordrecht.
[22] F. Luongo, R. Hakim, J. H. Nguyen, A. Anandkumar, and A. J.
Hung, “Deep learning-based computer vision to recognize and classify
suturing gestures inrobot-assisted surgery,” August 2020. [Online]. Available:
https://arxiv.org/abs/2008.11833?
[23] “Depth camera d435,” https://www.intelrealsense.com/depth-camera-d435/.
[24] “About ros.” [Online]. Available: https://www.ros.org/about-ros/
56
Bibliography
57
Bibliography
58
Bibliography
59
Bibliography
60
A
Interviews
A.1 Interview 1
Following are the main points from an interview with a user experience (UX) de-
signer that is working with the usability and design of an AMR system. It has no
practical experience in working with the application of AMRs, but is experienced
in conducting user tests and what typical inputs users could give that developers
generally do not think about.
• Users do not want to use new functions if it is too difficult to use, they want
to have it as easy as possible.
• Extended functionality needs to be reliable, otherwise it will not be used.
• A robot should not be “annoying”, if it is coming up from behind it should not
give the user a signal to move. Instead it should wait or take another route.
• A user would want to get acknowledgement from a robot that it has been seen.
• The behaviour of an AMR is more intuitive if it acts more human-like.
• An intuitive behaviour of vehicles could be found in normal car traffic, since
they are driven by humans and are something that people are used to.
• The size and perceived threat of an AMR decides what speeds are interpreted
as comfortable and suitable for a pedestrian.
A.2 Interview 2
Following are the main points from an interview with a worker at a site with 8-10
AGVs that handle pallets. The AGVs are mostly in contact with warehouse workers.
• The robots can sometimes get uncomfortably close to people.
• When they do get close they give off an audio signal
• Accidents have happened where AGVs have driven in to manual trucks.
• The AGV does not wait for you when you leave, instead it drives directly after
you.
• It should be clearer where the AGV is going
• The AGV should not start moving as soon as it can after a person has moved.
• One idea is that when a person is doing kitting the AGV could follow the
person performing the kitting task.
• If there is a problem during kitting, the person should be able to point the
robot in another direction to make it leave.
• The AGV cannot stack pallets.
• A set of AGVs should be able to move like a caravan.
I
A. Interviews
A.3 Interview 3
Following are the main points from an interview with a worker at a site with five
AGVs and also a set of manual trucks. The facility where the AGVs operate is
described as having narrow passages such that it sometimes could be difficult for a
pedestrian to pass an AGV without being very close to it.
• AGVs are sometimes hit by manual trucks
• The AGV is often in the way, however you learn its route.
• The speed of the AGV has been reduced in production areas, while a higher
speed is maintained in the warehouse.
• No reported serious accidents
• It has been reported that that an AGV has hit a pedestrian
• The AGV uses sound- and visual signaling to communicate with pedestrians.
• The sound signals are perceived as irritating by the interviewee.
• The AGV is slowing down when an obstacle in the way is detected.
• The presence AGVs could sometimes give a feeling of insecurity, e.g. when
they suddenly show up directly behind the back.
• When shown the demo of stopping the AMR by hand gesture, the response
was positive. It is possible that a manual truck can get trapped by an AGV,
and it is necessary to manually turn off and move the AGV to get out. This
could solve the issue.
• It has happened that an AGV has not detected the forks of a manual truck
and has driven in to them resulting in damages.
• When shown the demo of directing an AMR by hand-gesture the interviewee
was worried that the AMR would lose its positioning.
• Using an AMR should be as simple as possible if many people are using it.
For example, how do you get it working if it gets stuck.
• An AGV should be able to find a pallet itself within a certain area, and not
only rely that it is placed on an exact position.
II