The Hanabi Challenge A New Frontier For AI Research
The Hanabi Challenge A New Frontier For AI Research
From the early days of computing, games have been important testbeds for studying
how well machines can do sophisticated decision making. In recent years, machine
learning has made dramatic advances with artificial agents reaching superhuman
performance in challenge domains like Go, Atari, and some variants of poker. As with
their predecessors of chess, checkers, and backgammon, these game domains have
driven research by providing sophisticated yet well-defined challenges for artificial
intelligence practitioners. We continue this tradition by proposing the game of Hanabi
as a new challenge domain with novel problems that arise from its combination of
purely cooperative gameplay with two to five players and imperfect information. In
particular, we argue that Hanabi elevates reasoning about the beliefs and intentions of
other agents to the foreground. We believe developing novel techniques for such
theory of mind reasoning will not only be crucial for success in Hanabi, but also in
broader collaborative efforts, especially those with human partners. To facilitate
future research, we introduce the open-source Hanabi Learning Environment, propose
an experimental framework for the research community to evaluate algorithmic
advances, and assess the performance of current state-of-the-art techniques.
Previous article in issue
Next article in issue
Keywords
Multi-agent learning
Challenge paper
Reinforcement learning
Games
Theory of mind
Communication
Imperfect information
Cooperative
1. Introduction
Throughout human societies, people engage in a wide range of activities with a
diversity of other people. These multi-agent interactions are integral to everything
from mundane daily tasks, like commuting to work, to operating the organisations that
underpin modern life, such as governments and economic markets. With such
complex multi-agent interactions playing a pivotal role in human lives, it is desirable
for artificially intelligent agents to also be capable of cooperating effectively with
other agents, particularly humans.
While these issues make inferring the behaviour of others a daunting challenge for AI
practitioners, humans routinely make such inferences in their social interactions
using theory of mind [1], [2]: reasoning about others as agents with their own mental
states – such as perspectives, beliefs, and intentions – to explain and predict their
behaviour.4 Alternatively, one can think of theory of mind as the human ability to
imagine the world from another person's point of view. For example, a simple real-
world use of theory of mind can be observed when a pedestrian crosses a busy street.
Once some traffic has stopped, a driver approaching the stopped cars may not be able
to directly observe the pedestrian. However, they can reason about why the other
drivers have stopped, and infer that a pedestrian is crossing.
In this work, we examine the popular card game Hanabi, and argue for it as a new
research frontier that, at its very core, presents the kind of multi-agent challenges
where humans employ theory of mind. Hanabi won the prestigious Spiel des
Jahres award in 2013 and enjoys an active community, including a number of sites
that allow for online gameplay [4], [5]. Hanabi is a cooperative game of imperfect
information for two to five players, best described as a type of team solitaire. The
game's imperfect information arises from each player being unable to see their own
cards (i.e. the ones they hold and can act on), each of which has a colour and rank. To
succeed, players must coordinate to efficiently reveal information to their teammates,
however players can only communicate though grounded hint actions that point out
all of a player's cards of a chosen rank or colour. Importantly, performing a hint action
consumes the limited resource of information tokens, making it impossible to fully
resolve each player's uncertainty about the cards they hold based on this grounded
information alone. For AI practitioners, this restricted communication structure also
prevents the use of “cheap talk” communication channels explored in previous multi-
agent research [6], [7], [8]. Successful play involves communicating extra information
implicitly through the choice of actions themselves, which are observable by all
players.
Hanabi is different from the adversarial two-player zero-sum games where computers
have reached super-human skill, e.g., chess [9], checkers [10], go [11],
backgammon [12] and two-player poker [13], [14]. In those games, agents typically
compute an equilibrium policy (or equivalently, a strategy) such that no single player
can improve their utility by deviating from the equilibrium. While two-player zero-
sum games can have multiple equilibria, different equilibria are interchangeable: each
player can play their part of different equilibrium profiles without impacting their
utility. As a result, agents can achieve a meaningful worst-case performance guarantee
in these domains by finding any equilibrium policy. However, since Hanabi is neither
(exclusively) two-player nor zero-sum, the value of an agent's policy depends
critically on the policies used by its teammates. Even if all players manage to play
according to the same equilibrium, there can be multiple locally optimal equilibria
that are relatively inferior.5 For algorithms that iteratively train independent agents,
such as those commonly used in the multi-agent reinforcement learning literature,
these inferior equilibria can be particularly difficult to escape and so even learning a
good policy for all players is challenging.
The presence of imperfect information in Hanabi creates another challenging
dimension of complexity for AI algorithms. As has been observed in domains like
poker, imperfect information entangles how an agent should behave across multiple
observed states [17], [18]. In Hanabi, we observe this when thinking of the policy as
a communication protocol6 between players, where the efficacy of any given protocol
depends on the entire scheme rather than how players communicate in a particular
observed situation. That is, how the other players will respond to a chosen signal will
depend upon what other situations use the same signal. Due to this entanglement, the
type of single-action exploration techniques common in reinforcement learning
(e.g., ϵ-greedy, entropy regularisation) can incorrectly evaluate the utility of such
exploration steps as they ignore their holistic impact.
Humans appear to be approaching Hanabi differently than most multiagent
reinforcement learning approaches. Even beginners with no experience will start
signalling playable cards, reasoning that their teammates' perspective precludes them
from knowing this on their own. Furthermore, beginners will confidently play cards
that are only partially identified as playable, recognising that the intent in the partial
identification is sufficient to fully signal its playability. This all happens on the first
game, suggesting players are considering the perspectives, beliefs, and intentions of
the other players (and expecting the other players are doing the same thing about
them). While hard to quantify, it would seem that theory of mind is a central feature in
how the game is first learned. We can see further evidence of theory of mind in the
descriptions of advanced conventions7 used by experienced players. The descriptions
themselves often include the rationale behind each “agreement” explicitly including
reasoning about other players' beliefs and intentions.
C should assume that D is going to play their yellow card. C must do something, and
so they ask themselves: “Why did B give that clue?”. The only reason is that C can
actually make that card playable. [19]
Such conventions then enable further reasoning about other players' beliefs and
intentions. For example, the statement that “C should assume that D is going to play
their yellow card”, is itself the result of reasoning that partial identification of a
playable card is sufficient to identify it as playable.
From human play we can also see that the goal itself is multi-faceted. One challenge is
to learn a policy for the entire team that has high utility. Most of the prior AI research
on Hanabi has focused on this challenge, which we refer to as the self-play setting.
Human players will often strive toward this goal, pre-coordinating their behaviour
either explicitly using written guides or implicitly through many games of experience
with the same players. As one such guide states, though, “Hanabi is very complicated,
so it is impossible to write a guide on how to best solve each individual
situation.”[20]. Even if if such a guide existed it is impractical for human Hanabi
players to memorise nuanced policies or expect others to do the same. However,
humans also routinely play with ad-hoc teams that may have players of different skill
levels and little or no pre-coordination amongst everyone on the team. Even without
agreeing on a complete policy or a set of conventions, humans are still able to achieve
a high degree of success. It appears that human efforts in both goals are aided by
theory of mind reasoning, and AI agents with similar capabilities — playing well in
both pre-coordinated self-play and in uncoordinated ad-hoc teams — would signal a
useful advance for the field.
The combination of cooperation, imperfect information, and limited communication
make Hanabi an ideal challenge domain for learning in both the self-play and ad-hoc
team settings. In Section 2 we describe the details of the game and how humans
approach it. In Section 3 we present the Hanabi Learning Environment open source
code framework (Section 3.1) and guidelines for evaluating both the self-play
(Section 3.2) and ad-hoc team (Section 3.3) settings. We evaluate the performance of
current state-of-the-art reinforcement learning methods in Section 4. Our results show
that although these learning techniques can achieve reasonable performance in self-
play, they generally fall short of the best known hand-coded agents (Section 4.3).
Moreover, we show that these techniques tend to learn extremely brittle policies that
are unreliable for ad-hoc teams (Section 4.4). These results suggest that there is still
substantial room for technical advancements in both the self-play and ad-hoc settings,
especially as the number of players increases. Finally, we highlight connections to
prior work in Section 5.
2.1. Basic strategy
There are too few information tokens to provide complete information (i.e., the rank
and colour) for each of the 25 cards that can be played through only the grounded
information revealed by hints.9 While the quantity of information provided by a hint
can be improved by revealing information about multiple cards at once, the value of
information in Hanabi is very context dependent. To maximise the team's score at the
end of the game, hints need to be selected based on more than just the quantity of
information conveyed. For example in Fig. 1, telling Player 3 that they hold four blue
cards reveals more information than telling Player 2 that they hold a single rank-
1 card, but lower-ranked cards are more important early on, as they can be played
immediately. A typical game therefore begins by hinting to players which cards
are 1s, after which those players play those cards; this both “unlocks” the ability to
play the same-colour 2s and makes the remaining 1s of that colour useful for
recovering information tokens as players can discard the redundant cards.
Players are incentivized to avoid unsuccessful plays in two ways: first, losing all three
lives results in the game immediately ending with zero points; second, the card itself
is discarded. Generally speaking, discarding all cards of a given rank and colour is a
bad outcome, as it reduces the maximum achievable score. For example, in Fig.
1 both green 2s have been discarded, an effective loss of four points as no higher rank
green cards will ever be playable. As a result, hinting to players that are at risk of
discarding the only remaining card of a given rank and colour is often prioritised. This
is particularly common for rank-5 cards since there is only one of each colour and
they often need to be held for a long time before the card can successfully be played.
2.2. Implicit communication
While explicit communication in Hanabi is limited to the hint actions, every action
taken in Hanabi is observed by all players and can also implicitly communicate
information. This implicit information is not conveyed through the impact that an
action has on the environment (i.e., what happens) but through the very fact that a
player decided to take this action (i.e., why it happened). This requires that players
can reason over the actions that another player would have taken in a number of
different situations, essentially reasoning over the intent of the agent. Human players
often exploit such reasoning to convey more information through their actions.
Consider the situation in Fig. 1 and assume the active player (Player 0) knows nothing
about their own cards, and so they choose to hint to another player. One option would
be to tell Player 1 about the 1s in their hand. However, that information is not
particularly actionable, as the yellow 1 is not currently playable. Instead, they could
tell Player 1 about the red card, which is a 1. Although Player 1 would not explicitly
know the card is a 1, and therefore playable, they could infer that it is playable as
there would be little reason to tell them about it otherwise, especially when Player 2
has a blue 1 that would be beneficial to hint. They may also infer that because Player
0 chose to hint with the colour rather than the rank, that one of their other cards is a
non-playable 1.
An even more effective, though also more sophisticated, tactic commonly employed
by humans is the so-called “finesse” move. To perform the finesse in this situation,
Player 0 would tell Player 2 that they have a 2. By the same pragmatic reasoning as
above, Player 2 could falsely infer that their red 2 is the playable white 2 (since
both green 2s were already discarded). Player 1 can see Player 2's red 2 and realise
that Player 2 will make this incorrect inference and mistakenly play the card, leading
Player 1 to question why Player 0 would have chosen this seemingly irrational hint.
Even without established conventions, players could reason about this hint assuming
others are intending to communicate useful information. Consequently, the only
rational explanation for the choice is that Player 1 themselves must hold the red 1 (in
a predictable position, such as the most recently drawn card) and is expected to rescue
the play. Using this tactic, Player 0 can reveal enough information to get two cards
played using only a single information token. There are many other moves that rely on
this kind of reasoning about intent to convey useful information (e.g., bluff, reverse
finesse) [19], [20]. We will use finesse to broadly refer to this style of move.